11.1 Device Memory

If the input and output data are in device memory, optimizing a low-density computation such as SAXPY is a matter of optimizing the global memory access. Besides alignment and coalescing constraints that inform performance, CUDA kernels are sensitive to the number of blocks and threads per block. The globalRead, globalWrite, globalCopy, and globalCopy2 applications (in the memory/ subdirectory of the source code) generate reports for the bandwidths achieved for a variety of operand sizes, block sizes, and loop unroll factors. A sample report generated by globalCopy2 (which follows a memory access pattern similar to SAXPY: two reads and one write per loop iteration) is given in Listing 11.3.

If we reference the globalCopy2.cu application from Chapter 5 (see Listing 5.8), running it on a GK104 gets us the output in Listing 11.3 for 4-byte operands. The top row (unroll factor of 1) corresponds to the naive implementation (similar to Listing 11.2); a slight performance benefit is observed when the loop is unrolled. An unroll factor of 4 gives a speedup of about $10\%$ , delivering 128 GiB/s of bandwidth as opposed to the naive implementation's 116 GiB/s.

Interestingly, using the #pragma unroll compiler directive only increases performance to about 118 GiB/s, while modifying the templated kernel from globalCopy2.cu to perform SAXPY increases performance to 135 GiB/s. Listing 11.4 gives the resulting kernel, which is implemented in the stream1Device.cu application (cudahandbook/streaming/).

For most applications, these small performance differences don't justify rewriting kernels in this way. But if kernels are written to be "blocking-agnostic" (i.e.,

to work correctly for any grid or block size), then the optimal settings can be determined empirically without too much effort.

Listing 11.3 globalCopy2 output (GK104).

Operand size: 4 bytes  
Input size: 16M operands  
Block Size  
Unroll 32 64 128 256 512 maxBW maxThreads  
1 63.21 90.89 104.64 113.45 116.06 116.06 512  
2 66.43 92.89 105.09 116.35 120.66 120.66 512  
3 87.23 100.70 112.07 110.85 121.36 121.36 512  
4 99.54 103.53 113.58 119.52 128.64 128.64 512  
5 94.27 103.56 108.02 122.82 124.88 124.88 512  
6 100.67 104.18 115.10 122.05 122.46 122.46 512  
7 94.56 106.09 116.30 117.63 114.50 117.63 256  
8 58.27 45.10 47.07 46.29 45.18 58.27 32  
9 41.20 34.74 35.87 35.49 34.58 41.20 32  
10 33.59 31.97 32.42 31.43 30.61 33.59 32  
11 27.76 28.17 28.46 27.83 26.79 28.46 128  
12 25.59 26.42 26.54 25.72 24.51 26.54 128  
13 22.69 23.07 23.54 22.50 20.71 23.54 128  
14 22.19 22.40 22.23 21.10 19.00 22.40 64  
15 20.94 21.14 20.98 19.62 17.31 21.14 64  
16 18.86 19.01 18.97 17.66 15.40 19.01 64

Listing 11.4 saxpyGPU (templated unroll).

template<const int n>
    __device__void
    saxpy_unrolled(
        float *out,
        const float *px,
        const float *py,
        size_t N,
        float alpha)
{
    float x[n], y[n];
    size_t i;
    for (i = n*blockIdx.x*blockDim.x+threadIdx.x; i < N-n*blockDim.x*gridDim.x; i += n*blockDim.x*gridDim.x) {
        for (int j = 0; j < n; j++) {

11.1_Device_Memory

11.1 Device Memory