12.7_Predicate_Reduction
12.7 Predicate Reduction
Predicates or truth values (true/false) can be represented compactly, since each predicate only occupies 1 bit. In SM 2.0, NVIDIA added a number of instructions to make predicate manipulation more efficient. The __ ballot() and __popc() intrinsics can be used for warp-level reduction, and the __synchreads_count() intrinsic can be used for block-level reduction.
int __ ballot(int p);
ballot() evaluates a condition for all threads in the warp and returns a 32-bit word, where each bit gives the condition for the corresponding thread in the warp. Since ballot() broadcasts its result to every thread in the warp, it is effectively a reduction across the warp. Any thread that wants to count the number of threads in the warp for which the condition was true can call the popc() intrinsic
int __popc(int i);
which returns the number of set bits in the input word.
SM 2.0 also introduced __syncthreads_count().
int __syncthreads_count(int p);
This intrinsic waits until all warps in the threadblock have arrived, then broadcasts to all threads in the block the number of threads for which the input condition was true.
Since the 1-bit predicates immediately turn into 5- and 9- or 10-bit values after a warp- or block-level reduction, these intrinsics only serve to reduce the amount of shared memory needed for the lowest-level evaluation and reduction. Still, they greatly amplify the number of elements that can be considered by a single thread block.