13.4 CUDA Implementations

Designing Scan algorithms and studying circuit diagrams is instructive, but in order to implement Scan for CUDA, we need to map the algorithms onto registers, memory and addressing schemes, and correct synchronization. The optimal CUDA implementation of Scan depends on the size of the scan being performed. Different schemes are best for warp-sized scans, scans that can fit in shared memory, and scans that must spill to global memory. Because blocks cannot reliably exchange data through global memory, scans too large to fit in shared memory must perform multiple kernel invocations. $^3$

Before examining special cases (such as scanning of predicates), we will examine three (3) approaches to doing Scan on CUDA.

Scan-then-fan (recursive)
Reduce-then-scan (recursive)
Two-level reduce-then-scan

13.4.1 SCAN-THEN-FAN

The scan-then-fan approach uses a similar decomposition for global and shared memory. Figure 13.12 shows the approach used to scan a threadblock: A scan is performed on each 32-thread warp, and the reduction of that 32-element sub-array is written to shared memory. A single warp then scans the array of partial sums. A single warp is sufficient because CUDA does not support threadblocks with more than 1024 threads. Finally, the base sums are fanned out to each warp's output elements. Note that Figure 13.12 shows an inclusive scan being performed in step 2, so the first element of its output must be fanned out to the second warp, and so on.

13.4_CUDA_Implementations

13.4 CUDA Implementations

13.4.1 SCAN-THEN-FAN