12.2_Two-Pass_Reduction

12.2 Two-Pass Reduction

This algorithm operates in two stages. A kernel performs NumBlocks reductions in parallel, where NumBlocks is the number of blocks used to invoke the kernel; the results are written to an intermediate array. The final result is generated by invoking the same kernel to perform a second pass on the intermediate array with a single block. Listing 12.1 gives a two-pass reduction kernel that computes the sum of an array of integers.

Listing 12.1 Two-pass reduction kernel.

__global__ void
Reduction1_kernel( int *out, const int *in, size_t N )
{
extern __shared__ int sPartials[ ]; 
int sum = 0; 
const int tid = threadIdx.x; 
for ( size_t i = BlockIdx.x*blockDim.x + tid; 
i < N; 
i += blockDim.x*gridDim.x ) { 
sum += in[i]; 
}
sPartials[tid] = sum; 
__syncthreads(); 
for ( int activeThreads = blockDim.x>>1; 
activeThreads; 
activeThreads >= 1 ) { 
if ( tid < activeThreads ) { 
sPartials[tid] += sPartials[tid+activeThreads]; 
} 
__syncthreads(); 
}
if ( tid == 0 ) { 
out[blockIdx.x] = sPartials[0]; 
}
}
void
Reduction1( int *answer, int *partial, 
const int *in, size_t N, 
int numBlocks, int numThreads ) {
unsigned intsharesize = numThreads*sizeof(int); 
Reduction1_kernel<< numBlocks, numThreads,sharesize>>(partial, in, N); 
Reduction1_kernel<< 1, numThreads,sharesize>>(answer, partial, numBlocks);

The shared memory array is used to accumulate the reduction within each thread block. Its size depends on the number of threads in the block, so it must be specified when the kernel is launched. Note: The number of threads in the block must be a power of 2!

The first for loop computes the thread's sum over the input array. If the input pointer is properly aligned, all of the memory transactions by this code are coalesced, and it will maximize the memory bandwidth. Each thread then writes its accumulated sum to shared memory and synchronizes before starting the log-step reduction.

The second for loop performs a log-step reduction over the values in shared memory. The values in the upper half of shared memory are added to the values in the lower half, and the number of participating threads is successively halved until one value in shared_sum[0] contains the output for that block. This part of the kernel is the one that requires that the thread block size be a power of 2.

Finally, the output value of the thread block is written to global memory. This kernel is intended to be invoked twice, as shown in the host function: once with NN blocks, where NN is chosen for maximum performance in performing the reduction over the input array, and then with 1 block to accumulate the final output. Listing 12.2 shows the host function that invokes Reduction1_kernel(). Note that an array for the partial sums is allocated and passed in separately. Also note that since the kernel uses an unsized shared memory array, the amount of shared memory needed by the kernel must be specified as the third parameter in the << >> syntax.

The CUDA SDK discusses several optimizations of this kernel that focus on reducing the amount of conditional code in the log-step reduction. Part of the for loop that performs the log-step reduction—the later part, when the thread count is 32 or fewer—can be implemented with warp-synchronous code. Since the warps in each thread block execute each instruction in lockstep, the __syncthreads() intrinsics are no longer needed when the number of active threads in a block drops below the hardware's warp size of 32. The resulting kernel, located in the reduction2.cu source code file, is shown in Listing 12.2.

IMPORTANT NOTE

When writing warp synchronous code, the volatile keyword must be used for the pointers into shared memory. Otherwise, the compiler may introduce optimizations that change the order of memory operations and the code will not work correctly.

Listing 12.2 Reduction with unrolled, warp-synchronous finish.

__global__ void
Reduction2_kernel( int *out, const int *in, size_t N )
{
    extern __shared__ int sPartials[]{}
    int sum = 0;
    const int tid = threadIdx.x;
    for ( size_t i = BlockIdx.x*blockDim.x + tid;
        i < N;
            i += blockDim.x*gridDim.x ) {
                sum += in[i];
            }
    sPartials[tid] = sum;
    __syncthreads();
    for ( int activeThreads = blockDim.x>>1;
        activeThreads > 32;
        activeThreads >= 1 ) {
            if ( tid < activeThreads ) {
                sPartials[tid] += sPartials[tid+activeThreads];
            }
            __syncthreads();
        }
    if ( threadsIdx.x < 32 ) {
        volatile int *wsSum = sPartials;
        if ( blockDim.x > 32 ) wsSum[tid] += wsSum[tid + 32];
        wsSum[tid] += wsSum[tid + 16];
        wsSum[tid] += wsSum[tid + 8];
        wsSum[tid] += wsSum[tid + 4];
        wsSum[tid] += wsSum[tid + 2];
        wsSum[tid] += wsSum[tid + 1];
        if ( tid == 0 ) {
            volatile int *wsSum = sPartials;
            out[blocksIdx.x] = wsSum[0];
        }
    }
}

The warp synchronous optimization can be taken a step further by lofting the thread count into a template parameter, enabling the log-step reduction to be unrolled completely. Listing 12.3 gives the complete optimized kernel. Following Mark Harris's reduction presentation, 1^{1} the code evaluated at compile time is italicized.

Listing 12.3 Templatized, fully unrolled log-step reduction.

template<unsigned int numThreads>   
.global__void   
Reduction3_kernel( int \*out, const int \*in, size_t N)   
{ extern __shared__int sPartials[]; const unsigned int tid  $=$  threadIdx.x; int sum  $= 0$  for ( size_t i  $=$  blockIdx.x\*numThreads + tid; i  $<  \mathbb{N}$  . i  $+ =$  numThreads\*gridDim.x) { sum  $+ =$  in[i]; } sPartials[tid]  $=$  sum; _syncthreads(); if (numThreads  $\geqslant 1024$  ) { if (tid < 512) { sPartials[tid]  $+ =$  sPartials[tid + 512]; } _syncthreads(); } if (numThreads  $\geqslant 512$  ) { if (tid < 256) { sPartials[tid]  $+ =$  sPartials[tid + 256]; } _syncthreads(); } if (numThreads  $\geqslant 256$  ) { if (tid < 128) { sPartials[tid]  $+ =$  sPartials[tid + 128]; } _syncthreads(); } if (numThreads  $\geqslant 128$  ) { if (tid < 64) { sPartials[tid]  $+ =$  sPartials[tid + 64]; } _syncthreads(); } // warp synchronous at the end if (tid < 32){ volatile int \*wsSum  $=$  sPartials; if (numThreads  $\geqslant$  64) { wsSum[tid]  $+ =$  wsSum[tid + 32]; } if (numThreads  $\geqslant$  32) { wsSum[tid]  $+ =$  wsSum[tid + 16]; } if (numThreads  $\geqslant$  16) { wsSum[tid]  $+ =$  wsSum[tid + 8]; } if (numThreads  $\geqslant$  8) { wsSum[tid]  $+ =$  wsSum[tid + 4]; } if (numThreads  $\geqslant$  4) { wsSum[tid]  $+ =$  wsSum[tid + 2]; } if (numThreads  $\geqslant$  2) { wsSum[tid]  $+ =$  wsSum[tid + 1];
if(tid  $\equiv = 0$  ){ out[blockIdx.x]  $=$  wsSum[0]; }   
}

To instantiate the function template in Listing 12.3, it must be invoked explicitly in a separate host function. Listing 12.4 shows how Reduction3_kernel is invoked by another function template, and the host function uses a switch statement to invoke that template for each possible block size.

Listing 12.4 Template instantiations for unrolled reduction.

template<unsigned int numThreads>   
void   
Reduction3_template( int \*answer, int \*partial, const int \*in, size_t N, int numBlocks)   
{ Reduction3_kernel<numThreads><\n numBlocks, numThreads, numThreads\*sizeof(int)  $\ggg$ $($  partial, in, N); Reduction3_kernel<numThreads><\n 1, numThreads, numThreads\*sizeof(int)  $\ggg$ $($  answer, partial, numBlocks);   
}   
void   
Reduction3( int \*out, int \*partial, const int \*in, size_t N, int numBlocks, int numThreads )   
{ switch ( numThreads ) { case 1: return Reduction3_template< 1>( ... ); case 2: return Reduction3_template< 2>( ... ); case 4: return Reduction3_template< 4>( ... ); case 8: return Reduction3_template< 8>( ... ); case 16: return Reduction3_template< 16>( ... ); case 32: return Reduction3_template< 32>( ... ); case 64: return Reduction3_template< 64>( ... ); case 128: return Reduction3_template< 128>( ... ); case 256: return Reduction3_template< 256>( ... ); case 512: return Reduction3_template< 512>( ... ); case 1024: return Reduction3_template<1024>( ... ); }   
}