15.5_Further_Optimizations

15.5 Further Optimizations

Two more optimizations are implemented in the sample source code: SM-aware kernel invocation (since SM 1.x has different instruction set support for multiplication, which is in the innermost loop of this computation) and an unrolled inner loop of the kernel.

15.5.1 SM-AWARE CODING

SM 1.x hardware uses a 24-bit multiplier (plenty wide enough to do the multipli-cations in the inner loop of this computation), yet SM 2.x and SM 3.x hardware use 32-bit multipliers. Sometimes the compiler can detect when the particip- ipating integers are narrow enough that it can use the 24-bit multiply on SM 1.x-class hardware, but that does not seem to be the case for corrShared_ kernel(). To work around the issue, we can use a template on the kernel declaration.

template <bool bSM1>
    __global__ void
    corrSharedSM_kernel( ... )

The inner loop of the kernel then becomes

for ( int j = 0; j < hTemplate; j++) {  
    for ( int i = 0; i < wTemplate; i++) {  
        unsigned char I = LocalBlock[SharedIdx+i];  
        unsigned char T = g_Tpix[ idx++] + i;  
        SumI += I;  
        if ( bSM1 ) {  
            SumISq += umul24(I, I);  
            SumIT += umul24(I, T);  
        }  
    else {  
        SumISq += I*I;  
        SumIT += I*T;  
    }  
}

And the host function that invokes the kernel must detect whether the device is SM 1.x and, if so, invoke the kernel with bSM1=true. In the sample source code, this implementation is given in the corrSharedSM.cuh and corrSharedSMSums.cuh header files.

15.5.2. LOOP UNROLLING

Since each thread is accessing adjacent bytes in shared memory, the innermost loop of these kernels generates 4-way bank conflicts on SM 1.x-class hardware. If we rewrite

for ( int j = 0; j < hTemplate; j++) { for ( int i = 0; i < wTemplate; i++) { unsigned char I = LocalBlock[SharedIdx+i]; unsigned char T = g_Tpix[ idx++]; SumI += I; SumISq += I*I; SumIT += I*T; } SharedIdx += SharedPitch; } as follows   
for ( int j = 0; j < hTemplate; j++) { for ( int i = 0; i < wTemplate/4; i++) { corrSharedAccumulate<bSM1> ( ... LocalBlock[SharedIdx+i*4+0], ); corrSharedAccumulate<bSM1> ( ... LocalBlock[SharedIdx+i*4+1], ); corrSharedAccumulate<bSM1> ( ... LocalBlock[SharedIdx+i*4+2], ); corrSharedAccumulate<bSM1> ( ... LocalBlock[SharedIdx+i*4+3], ); } SharedIdx += SharedPitch; }   
where the corrSharedAccumulate() function encapsulates the template parameter bSM1   
template bool bSM1> _device__void   
corrSharedAccumulate( int& SumI, int& SumISq, int& SumIT, unsigned char I, unsigned char T) { SumI += I; if ( bSM1 ) { SumISq += _umul24(I,I); SumIT += _umul24(I,T); } else { SumISq += I*I; SumIT += I*T; }

Although the primary motivation is to decrease bank conflicts due to byte reads—an effect that only occurs on SM 1.x hardware—the resulting kernel is faster on all CUDA hardware.

15.5_Further_Optimizations - The CUDA Handbook | OpenTech