13.6 流压缩

扫描的实现常作用于断定结果，即通过评估条件得到的真值（0或1）。正如在本章开头提及的，针对断定结果的排他性扫描可以用来实现流压缩。流压缩这类并行问题，只把一个输入数组中“感兴趣的”元素写到输出。对于断定值为1的那些“感兴趣的”元素，排他性扫描计算出该元素的输出索引。

例如，我们写一个接受int型数组并输出那些奇数值元素的扫描。[1] 我们的实现基于Merrill的先归约再扫描方案，并使用一个固定数目的线程块数b。

1）输入数据的第一遍归约给出每个大小为 $\lceil N / b\rceil$ 的子数组中满足条件的元素数。
2）在大小为b的数组上进行扫描操作，得到每个子数组的输出的基索引。
3）在输入数组上执行扫描，评估条件并使用“种子”值作为每个子数组输出的基索引。

代码清单13-22显示了步骤1的代码：predicateReduceSubarraysOdd()函数调用子程序

predicateReduceSubarrayOdd()和isOdd()来计算每个数组元素的断定值，计算归约值并把它写到基本和的数组中。

代码清单13-22 predicateReduceSubarraysOdd

template<class T> __host__device bool isOdd(T x) {
    return x & 1;
}
template<class T, int numThreads> __device__void predicateReduceSubarrayOdd(
    int *gPartials,
    const T *in,
    size_t iBlock,
    size_t N,
    int elementsPerPartial()
{
    extern volatile __shared__int sPartials();
    const int tid = threadIdx.x;
    size_t baseIndex = iBlock*elementsPerPartial;
    int sum = 0;
    for (int i = tid; i < elementsPerPartial; i += blockDim.x) {
        size_t index = baseIndex + i;
        if (index < N)
            sum += isOdd(in[index]);
    }
    sPartials[tid] = sum;
    __synchreads();
    reduceBlock<int,numThreads>(&gPartials[iBlock], sPartials);
}
/* 
* Compute the reductions of each subarray of size 
*elementsPerPartial, and write them to gPartials. */
template<class T, int numThreads> __global__void predicateReduceSubarraysOdd(
    int *gPartials,
    const T *in,
    size_t N,
    int elementsPerPartial()
{
    extern volatile __shared__int sPartialsksen;
    for (int iBlock = blockIdx.x;
        iBlock*elementsPerPartial < N;
        iBlock += gridDim.x)
    {
        return x & 1;
    }
}

predicateReduceSubarrayOdd<T,numThreads>(gPartials, in, iBlock, N, elementsPerPartial); }

调用代码清单13-23中的内核来计算基本和数组的扫描值。做完此步，每个基本和元素的位置是相应线程块输出数组的起始索引，它的值代表输入数组在此块之前断定为真的元素数目。

代码清单13-23 streamCompactOdd内核

template<class T, bool bZeroPad> global __void streamCompactOdd()
{
    T *out,
        int *outCount,
        const int *gBaseSums,
        const T *in,
        size_t N,
        size_t elementsPerPartial()
{
            extern volatile __shared__int sPartials();
            const int tid = threadIdx.x;
            int sIndex = scanSharedIndex<bZeroPad>(threadIdx.x);
        if (bZeroPad) {
            template<class class>
            template<class class>
            template<class class>
            template<class class>
            template<class class>
            template<class class>
            template<class class>
            template<class class>
            template<class class>
            template<class class>
            template<class class>
            template<class class>
            template<class class>
            template<class class>
            template<class class>
            template<class class>
            template<class class>
            template<class class>
            template<class class>
            template<class class>
            template<class class>

sPartials[sIndex-16] = 0;  
}  
// exclusive scan element gBaseSums[blockIdx.x]  
int base_sum = 0;  
if (blockIdx.x && gBaseSums) {  
    base_sum = gBaseSums[blockIdx.x-1];  
}  
for (size_t i = 0;  
    i < elementsPerPartial;  
    i += blockDim.x) {  
        size_t index =(blockIdx.x*elementsPerPartial + i + tid;  
        int value = (index < N) ? in[index] : 0;  
        sPartials[sIndex] = (index < N) ? isOdd(value) : 0;  
        __syncthreads();  
        scanBlock<int,bZeroPad>(sPartials+sIndex);  
        __syncthreads();  
        if (index < N && isOdd(value)) {  
            int outIndex = base_sum;  
            if (tid) {  
                outIndex += sPartials[scanSharedIndex<bZeroPad>(tid-1)];  
            }  
            out[outIndex] = value;  
        }  
        __syncthreads();  
// carry forward from this block to the next.  
{  
    intinx = scanSharedIndex<bZeroPad>(blockDim.x-1);  
    base_sum += sPartials[inx];  
}  
__syncthreads();  
}  
if (threadIdx.x == 0 && blockIdx.x == 0) {  
    if (gBaseSums) {  
        *outCount = gBaseSums[gridDim.x-1];  
    }  
    else {  
        intinx = scanSharedIndex<bZeroPad>(blockDim.x-1);  
        *outCount = sPartials[inx];  
}

代码清单13-23显示了步骤3的代码，它接受输入数组以及基本和数据作为参数，计算每个输入数组元素的断定值，并在该元素断定值为真时，把此元素写入到正确索引的输出元素。主机代码类似于代码清单13-12，它们仅有轻微变化，这里不再显示。

[1] 该代码很容易进行修改，以处理更复杂的断定。

13.6_流压缩

13.6 流压缩