12.5 任意线程块大小的归约

到目前为止，所有使用共享内存的归约实现都要求线程块大小是2的幂次。加入少量的额外代码，这些归约算法可工作在任意大小的线程块上。代码清单12-8给出了派生于代码清单12-1中第一个两遍内核的内核，经过修改，可以在任意大小的块上运行。floorPow2变量计算小于或等于块大小的2的幂次，在执行对数步长的归约循环之前，累计上那些来自超过2的幂次的线程的贡献。

代码清单12-8 任意线程块大小的归约算法

(reduction6AnyBlockSize.cuh)

__global__ void
Reduction6_kernel( int *out, const int *in, size_t N )
{
    extern __shared__ int sPartials[]{};
    int sum = 0;
    const int tid = threadIdx.x;
    for ( size_t i = blockIdx.x*blockDim.x + tid; i < N;

i += blockDim.x*gridDim.x) { sum += in[i]; } sPartials [tid] = sum; __syncthreads(); // start the shared memory loop on the next power of 2 less // than the block size. If block size is not a power of 2, // accumulate the intermediate sums in the remainder range. int floorPow2 = blockDim.x; if (floorPow2 & (floorPow2-1)) { while (floorPow2 & (floorPow2-1)) { floorPow2 &= floorPow2-1; } if (tid >= floorPow2) { sPartials [tid - floorPow2] += sPartials [tid]; } __syncthreads(); } for (int activeThreads = floorPow2>>1; activeThreads; activeThreads >> 1) { if (tid < activeThreads) { sPartials [tid] += sPartials [tid+activeThreads]; } __syncthreads(); } if (tid == 0) { out [blockIdx.x] = sPartials[0]; }

12.5_任意线程块大小的归约

12.5 任意线程块大小的归约

代码清单12-8 任意线程块大小的归约算法