12.6 Reduction Using Arbitrary Data Types

So far, we have only developed reduction kernels that can compute the sum of an array of integers. To generalize these kernels to perform a broader set of operations, we turn to C++ templates. With the exception of the algorithms that use atomics, all of the kernels that have appeared so far can be adapted to use templates. In the source code accompanying the book, they are in the CUDA headers reduction1Templated.cuh, reduction2Templated.cuh, and so on. Listing 12.9 gives the templated version of the reduction kernel from Listing 12.1.

Listing 12.9 Templated reduction kernel.

template<typename ReductionType, typename T> __global__ void Reduction Templated( ReductionType *out, const T *in, size_t N) {
    SharedMemory<ReductionType> sPartials;
    ReductionType sum;
    const int tid = threadIdx.x;
    for ( size_t i = blockIdx.x*blockDim.x + tid;
        i < N;
            i += blockDim.x*gridDim.x ) {
                sum += in[i];
            }
}

sPartials [tid] = sum;  
__syncthreads();  
for ( int activeThreads = blockDim.x>>1;  
    activeThreads;  
    activeThreads >= 1) {  
        if ( tid < activeThreads) {  
            sPartials [tid] += sPartials [tid + activeThreads];  
        }  
    __syncthreads();  
}  
if ( tid == 0) {  
    out[blockIdx.x] = sPartials[0];  
}

Note that since we want to be able to compute a variety of output types for a given type of input (for example, we would like to build kernels that compute any combination of the minimum, maximum, sum, or the sum of squares of an array of integers), we've used two different template parameters: T is the type being reduced, and ReductionType is the type used for partial sums and for the final result.

The first few lines of code use the += operator to "rake" through the input, accumulating a partial sum for each thread in the block. Execution then proceeds exactly as in Listing 12.1, except that the code is operating on ReductionType instead of int. To avoid alignment-related compilation errors, this kernel uses an idiom from the CUDA SDK to declare the variable-sized shared memory.

template<class T> struct SharedMemory
{
    __device__ inline operator T*()
    {
        extern __shared__ int __mem[]();
        return (T*) (void *) __mem;
    }
    __device__ inline operator const T*( const
        {
            extern __shared__ int __mem[]();
            return (T*)(void *) __mem;
        }
    };
};

Listing 12.10 shows an example of a class intended to be used with templatd reduction functions such as Reduction_template. This class computes both the sum and the sum of squares of an array of integers.4 Besides defining operator $+=$ , a specialization of the SharedMemory template must be declared; otherwise, the compiler will generate the following error.

Error: Unaligned memory accesses not supported

The reductionTemplated.cu program in the accompanying source code shows how the function templates from the CUDA headers can be invoked.

Reduction1<CReduction_Sumi_isq, int>(...);

Listing 12.10 CReduction_Sumi_isq class.

struct CReduction_Sumi_isq {
public:
    CReduction_Sumi_isq();
    int sum;
    long long sumsq;
    CReduction_Sumi_isq& operator += (int a);
    volatile CReduction_Sumi_isq& operator += (int a) volatile;
    CReduction_Sumi_isq& operator += (const CReduction_Sumi_isq& a);
    volatile CReduction_Sumi_isq& operator += (
        volatile CReduction_Sumi_isq& a) volatile;
};
inline __device__host_
CReduction_Sumi_isq::CReduction_Sumi_isq()
{
    sum = 0;
    sumsq = 0;
}
inline __device__host_
CReduction_Sumi_isq&
CReduction_Sumi_isq::operator += (int a)
{
    sum += a;
    sumsq += (long long) a*a;
    return *this;
}

{ sum += a; sumsq += (long long) a*a; return *this;   
}   
inline _device _host_ CReduction_Sumi_isq& CReduction_Sumi_isq::operator  $+=$  (const CReduction_Sumi_isq& a) { sum += a-sum; sumsq += a.sumsq; return \*this;   
}   
inline _device _host_ volatile CReduction_Sumi_isq& CReduction_Sumi_isq::operator  $+=$  ( volatile CReduction_Sumi_isq& a ) volatile { sum += a-sum; sumsq += a.sumsq; return \*this;   
}   
inline int operator!=( const CReduction_Sumi_isq& a, const CReduction_Sumi_isq& b) { return a-sum != b-sum && a.sumsq != b.sumsq;   
}   
// from Reduction SDK sample: // specialize to avoid unaligned memory // access compile errors   
// template<>   
struct SharedMemory<CReduction_Sumi_isq> { device__inline operator CReduction_Sumi_isq\*() { extern shared CReduction_Sumi_isq __smem_CReduction_Sumi_isq[]; return (CReduction_Sumi_isq\*) smem_CReduction_Sumi_isq; } device__inline operator const CReduction_Sumi_isq\* () cons { extern shared CReduction_Sumi_isq __smem_CReduction_Sumi_isq[]; return (CReduction_Sumi_isq\*) smem_CReduction_Sumi_isq;   
}

12.6_Reduction_Using_Arbitrary_Data_Types

12.6 Reduction Using Arbitrary Data Types