12.3 Single-Pass Reduction

The two-pass reduction approach is in part a workaround for the inability of CUDA blocks to synchronize with one another. In the absence of interblock synchronization to determine when processing of the final output can begin, a second kernel invocation is needed.

The second kernel invocation can be avoided by using a combination of atomic operations and shared memory, as described in the threadfenceReduction sample in the CUDA SDK. A single device memory location tracks which thread blocks have finished writing their partial sums. Once all blocks have finished, one block performs the final log-step reduction to write the output.

Since this kernel performs several log-step reductions from shared memory, the code in Listing 12.3 that conditionally adds based on the templated thread count is pulled into a separate device function for reuse.

Listing 12.5 Reduction4_LogStepShared.

template<unsigned int numThreads> __device__ void Reduction4_LogStepShared( int *out, volatile int *partials) {
    const int tid = threadIdx.x;
    if (numThreads >= 1024) {
        if (tid < 512) {
            partials[tid] += partials[tid + 512];
        }
        __syncthreads();
    }
    if (numThreads >= 512) {
        if (tid < 256) {
            partials[tid] += partials[tid + 256];
        }
        __sync threads();
    }
    if (numThreads >= 256) {
        if (tid < 128) {
            partials[tid] += partials[tid + 128];
        }
        __sync threads();
    }
    if (numThreads >= 128) {
        if (tid < 64) {
            partials[tid] += partials[tid + 64];
        }
        __sync threads();
}

// warp synchronous at the end  
if (tid < 32) {  
    if (numThreads >= 64) { partials[tid] += partials[tid + 32]; }  
    if (numThreads >= 32) { partials[tid] += partials[tid + 16]; }  
    if (numThreads >= 16) { partials[tid] += partials[tid + 8]; }  
    if (numThreads >= 8) { partials[tid] += partials[tid + 4]; }  
    if (numThreads >= 4) { partials[tid] += partials[tid + 2]; }  
    if (numThreads >= 2) { partials[tid] += partials[tid + 1]; }  
    if (tid == 0) {  
        *out = partials[0];  
    }  
}

The Reduction4_LogStepShared() function, shown in Listing 12.5, writes the reduction for the thread block, whose partial sums are given by partials to the pointer to the memory location specified by out. Listing 12.6 gives the single-pass reduction using Reduction4_LogStepShared() as a subroutine.

Listing 12.6 Single-pass reduction kernel (reduction4SinglePass.cuh).

// Global variable used by reduceSinglePass to count blocks _device__ unsigned int retirementCount = 0;   
template <unsigned int numThreads> _global__ void   
reduceSinglePass( int *out, int *partial, const int \*in, unsigned int N)   
{ extern __shared__ int sPartials[]; unsigned int tid  $=$  threadIdx.x; int sum  $= 0$  . for ( size_t i  $=$  blockIdx.x\*numThreads + tid; i  $<  \mathbb{N}$  . i  $+ =$  numThreads\*gridDim.x）{ sum  $+ =$  in[i]; } sPartials[tid]  $=$  sum; __syncthreads(); if (gridDim.x  $= = 1$  ) { Reduction4_LogStepShared<numThreads>( &out [blockIdx.x], sPartials); return; } Reduction4_LogStepShared<numThreads>( &partial [blockIdx.x], sPartials);

\_shared\_ bool lastBlock;   
// wait for outstanding memory instructions in this thread   
\_threadfence();   
// Thread 0 takes a ticket if( tid  $\equiv$  0）{ unsigned int ticket  $=$  atomicAdd(&retirementCount,1); // If the ticket ID is equal to the number of blocks, // we are the last block! //lastBlock  $=$  (ticket  $\equiv$  gridDim.x-1);   
} _syncthreads();   
// One block performs the final log-step reduction if( lastBlock）{ int sum  $= 0$  . for（size_t i  $=$  tid; i<gridDim.x; i  $+ =$  numThreads）{ sum  $+ =$  partial[i]; } sPartials[threadIdx.x]  $=$  sum; __syncthreads(); Reduction4_LogStepShared<numThreads>(out,sPartials); retirementCount  $= 0$  ：   
}

The kernel starts out with familiar code that has each thread compute a partial reduction across the input array and write the results to shared memory. Once this is done, the single-block case is treated specially, since the output of the log-step reduction from shared memory can be written directly and not to the array of partial sums. The remainder of the kernel is executed only on kernels with multiple thread blocks.

The shared Boolean lastBlock is used to evaluate a predicate that must be communicated to all threads in the final block. The __threadfence() causes all threads in the block to wait until any pending memory transactions have been posted to device memory. When __threadfence() is executed, writes to global memory are visible to all threads, not just the calling thread or threads in the block.

As each block exits, it performs an atomicAdd() to check whether it is the one block that needs to perform the final log-step reduction. Since atomicAdd() returns the previous value of the memory location, the block that increments retirementCount and gets a value equal to gridDim.x-1 can be deemed the "last thread" and can perform the final reduction. The lastBlock shared memory location communicates that result to all threads in the block, and __syncthreads() then must be called so the write to lastBlock will be visible to all threads in the block. The final block performs the final log-step reduction of the partial sums and writes the result. Finally, retirementCount is set back to 0 for subsequent invocations of reduceSinglePass().

12.3_Single-Pass_Reduction

12.3 Single-Pass Reduction