12.3 单遍归约

两遍归约的做法是针对CUDA线程块无法同步这一问题的部分解决方法。在缺少块间同步机制的条件下，为了确定最终输出的处理何时可以开始，需要调用第二个内核。

使用原子操作和共享内存的组合可避免第二个内核调用，就像 CUDA SDK的threadfenceReduction示例那样。使用一个设备内存位置跟踪哪个线程块已经写完自己的部分和。一旦所有块都完成后，一个块进行最后的对数步长的归约，将输出写回。

由于该内核执行多次来自共享内存的对数步长的归约，代码清单12-3中根据模板化的线程数进行条件求和的代码被提取出来，放入一个单独的设备函数，以便于重用。

Reduction4_LogStepShared()函数如代码清单12-5所示，负责写回线程块的归约值，其部分和由partials提供给由out指定的内存位置。代码清单12-6给出了使用Reduction4_LogStepShared()作为子程序的单遍归约算法。

template<unsigned int numThreads> 
    device__void
Reduction4_LogStepShared( int *out, volatile int *partials )
{
    const int tid = threadIdx.x;
    if (numThreads >= 1024) {
        if (tid < 512) {
            // partials[tid] += partials[tid + 512];
        }
        __syncthreads();
    }
    if (numThreads >= 512) {
        if (tid < 256) {
            // partials[tid] += partials[tid + 256];
        }
        __syncthreads();
    }
    if (numThreads >= 256) {
        if (tid < 128) {
            // partials[tid] += partials[tid + 128];
        }
        __syncthreads();
    }
    if (numThreads >= 128) {
        if (tid < 64) {
            // partials[tid] += partials[tid + 64];
        }
        __syncthreads();
    }
}
// warp synchronous at the end
if (tid < 32) {
    if (numThreads >= 64) {
        // partials[tid] += partials[tid + 32];
    }
    if (numThreads >= 32) {
        // partials[tid] += partials[tid + 16];
    }
    if (numThreads >= 16) {
        // partials[tid] += partials[tid + 8];
    }
    if (numThreads >= 8) {
        // partials[tid] += partials[tid + 4];
    }
    if (numThreads >= 4) {
        // partials[tid] += partials[tid + 2];
    }
    if (numThreads >= 2) {
        // partials[tid] += partials[tid + 1];
    }
    if (tid == 0) {
        *out = partials[0];
    }
}

代码清单12-6 单遍归约内核（reduction4SinglePass.cuh）

// Global variable used by reduceSinglePass to count blocks
device__ unsigned int retirementCount = 0;
template <unsigned int numThreads>
global _ void
reduceSinglePass( int *out, int *partial,
					 const int *in, unsigned int N )
\{
	external _shared _ int sPartials[ ] ;
		signed int tid = threadIdx.x;
	int sum = 0 ;
	for ( size_t i = blockIdx.x*numThreads + tid;
		i < N;
		i += numThreads*gridDim.x ) \{
		sum += in[i];
	\}
	sPartials[tid] = sum;
	sync threads(   );
	if (gridDim.x == 1) \{
		Reduction4_LogStepShared<numThreads>( &out [blockIdx.x],
									sPartials);
		return;
	\}
	Reduction4_LogStepShared<numThreads>( &partial [blockIdx.x],
									sPartials);
	_shared bool lastBlock;
	// wait for outstanding memory instructions in this thread
_threadface(   );
	// Thread 0 takes a ticket
	if( tid==0 ) \{
		unsigned int ticket = atomicAdd(&retirementCount, 1);
		// If the ticket ID is equal to the number of blocks,
			// we are the last block!
		_lastBlock = (ticket == gridDim.x-1);
		\}
	sync threads(   );
	// One block performs the final log-step reduction
	if( lastBlock ) \{
		int sum = 0 ;
		for ( size_t i = tid;
			i < gridDim.x;
			i += numThreads ) \{
			_sum += partial[i];
			\}
			sPartials [threadIdx.x] = sum;
			sync threads(   );
			Reduction4_LogStepShared<numThreads>( out, sPartials);
		-retirementCount = 0;
	\}
\}

内核始于熟悉的代码，每个线程计算整个输入数组中的部分归约并将结果写入共享内存。一旦完成上述操作，单个线程块的情况是经过特殊处理的，因为针对共享内存的对数步长的归约输出可以直接写

回而不被写入部分和数组。该内核的其余部分只能在多线程块的内核上执行。

共享的布尔变量lastBlock是用来评估必须传达给最后一个块的所有线程的一个条件。该__threadfence()导致块中的所有线程都等待，直到任何挂起的内存事务已写回到设备内存。当__threadfence()被执行时，写入全局内存的操作不仅仅可见于被调用线程或者块内线程，而是可见于所有线程。

当每个线程块退出时，它执行一个atomicAdd()来检查自己是否是那个需要执行最终对数步长的归约的线程块。由于atomicAdd()返回内存位置上的前一个值，那个递增retirementCount并能够得到等于gridDim.x-1的值的块为“最后一个线程块”[1]，它要进行最终的归约。lastBlock这一共享内存位置把该结果传达给该线程块的所有线程，随后必须调用__syncreads()，这样写回lastBlock的操作将可见于该块的所有线程。最后一个线程块执行最终的针对部分和的对数步长的归约，并将结果写回。最后，将retirementCount重新置为0，以便于后续reduceSinglePass()的调用。

[1] 原文误为“最后一个线程”。——译者注

12.3_单遍归约

12.3 单遍归约