12.5_Arbitrary_Block_Sizes

12.5 Arbitrary Block Sizes

So far, all of the reduction implementations that use shared memory require the block size to be a power of 2. With a small amount of additional code, the reduction can be made to work on arbitrary block sizes. Listing 12.8 gives a kernel derived from the very first two-pass kernel given in Listing 12.1, modified to operate on any block size. The floorPow2 variable computes the power of 2 that is less than or equal to the block size, and the contribution from any threads above that power of 2 is added before continuing on to the loop that implements the log-step reduction.

Listing 12.8 Reduction (arbitrary block size) (reduction6AnyBlockSize.cuh).

__global__ void
Reduction6_kernel( int *out, const int *in, size_t N )
{
    extern __shared__ int sPartials[k];
    int sum = 0;
    const int tid = threadIdx.x;
    for ( size_t i = BlockIdx.x*blockDim.x + tid;
        i < N;
            i += blockDim.x*gridDim.x ) {
                sum += in[i];
            }
    sPartials[tid] = sum;
    __syncthreads();
}
// start the shared memory loop on the next power of 2 less
// than the block size. If block size is not a power of 2,
// accumulate the intermediate sums in the remainder range.
int floorPow2 = blockDim.x;
if ( floorPow2 & (floorPow2-1)) {
    while ( floorPow2 & (floorPow2-1)) {
        floorPow2 &= floorPow2-1;
    }
}
if (tid >= floorPow2) {
    sPartials[tid - floorPow2] += sPartials[tid];
}
__syncthreads();
}  
for (int activeThreads = floorPow2>>1;
    activeThreads;
    activeThreads >= 1) {
        if (tid < activeThreads) {
            sPartials[tid] += sPartials[tid+activeThreads];
        }
    __syncthreads();
}  
if (tid == 0) {
    out[blockIdx.x] = sPartials[0];
}