13.5_Warp_Scans
13.5 Warp Scans
So far, we've focused on constructing our Scan implementations from the top down. At the bottom of all three of our Scan implementations, however, lurks an entirely different software approach to Scan. For subarrays of size 32 or less, we use a special warp scan modeled on the Kogge-Stone circuit (Figure 13.11). Kogge-Stone circuits are work-inefficient, meaning they perform many operations despite their small depth, but at the warp level, where execution resources of CUDA hardware are available whether or not the developer uses them, Kogge-Stone works well on CUDA hardware.
Listing 13.13 gives a device routine that is designed to operate on shared memory, the fastest way for threads to exchange data with one another. Because there are no shared memory conflicts and the routine executes at warp granularity, no thread synchronization is needed during updates to the shared memory.
Listing 13.13 scanWarp.
template<class T> inline __device__T scanWarp( volatile T *sPartials) {const int tid = threadIdx.x; const int lane $=$ tid & 31; if ( lane $\geqslant$ 1)sPartials[0] $+ =$ sPartials[-1]; if ( lane $\geqslant$ 2)sPartials[0] $+ =$ sPartials[-2]; if ( lane $\geqslant$ 4)sPartials[0] $+ =$ sPartials[-4]; if ( lane $\geqslant$ 8)sPartials[0] $+ =$ sPartials[-8]; if ( lane $\geqslant$ 16)sPartials[0] $+ =$ sPartials[-16]; return sPartials[0];13.5.1 ZERO PADDING
We can reduce the number of machine instructions needed to implement the warp scan by interleaving the warps' data with 16-element arrays of 0's, enabling the conditionals to be removed. Listing 13.14 gives a version of scanWarp that assumes 16 zero elements preceding the base address in shared memory.
Listing 13.14 scanWarp0.
template<class T> device_T scanWarp0( volatile T *sharedPartials, int idx) {
const int tid = threadIdx.x;
const int lane = tid & 31;
sharedPartials[idx] += sharedPartials[idx - 1];
sharedPartials[idx] += sharedPartials[idx - 2];
sharedPartials[idx] += sharedPartials[idx - 4];
sharedPartials[idx] += sharedPartials[idx - 8];
sharedPartials[idx] += sharedPartials[idx - 16];
return sharedPartials[idx];
}