12.5_Arbitrary_Block_Sizes
12.5 Arbitrary Block Sizes
So far, all of the reduction implementations that use shared memory require the block size to be a power of 2. With a small amount of additional code, the reduction can be made to work on arbitrary block sizes. Listing 12.8 gives a kernel derived from the very first two-pass kernel given in Listing 12.1, modified to operate on any block size. The floorPow2 variable computes the power of 2 that is less than or equal to the block size, and the contribution from any threads above that power of 2 is added before continuing on to the loop that implements the log-step reduction.
Listing 12.8 Reduction (arbitrary block size) (reduction6AnyBlockSize.cuh).
__global__ void
Reduction6_kernel( int *out, const int *in, size_t N )
{
extern __shared__ int sPartials[k];
int sum = 0;
const int tid = threadIdx.x;
for ( size_t i = BlockIdx.x*blockDim.x + tid;
i < N;
i += blockDim.x*gridDim.x ) {
sum += in[i];
}
sPartials[tid] = sum;
__syncthreads();
}
// start the shared memory loop on the next power of 2 less
// than the block size. If block size is not a power of 2,
// accumulate the intermediate sums in the remainder range.
int floorPow2 = blockDim.x;
if ( floorPow2 & (floorPow2-1)) {
while ( floorPow2 & (floorPow2-1)) {
floorPow2 &= floorPow2-1;
}
}if (tid >= floorPow2) {
sPartials[tid - floorPow2] += sPartials[tid];
}
__syncthreads();
}
for (int activeThreads = floorPow2>>1;
activeThreads;
activeThreads >= 1) {
if (tid < activeThreads) {
sPartials[tid] += sPartials[tid+activeThreads];
}
__syncthreads();
}
if (tid == 0) {
out[blockIdx.x] = sPartials[0];
}