10.12 Optimal Block Sizing and Performance

When the texture coordinates are generated in the "obvious" way, such as in tex2d_addressing.cu

row =(blockIdx.y*blockDim.y + threadIdx.y;  
col =(blockIdx.x*blockDim.x + threadIdx.x;  
... tex2D(tex, (float) col, (float) row);

then texturing performance is dependent on the block size.

To find the optimal size of a thread block, the tex2D_shmoo.cu and surf2Dmemset_shmoo.cu programs time the performance of thread blocks whose width and height vary from 4..64, inclusive. Some combinations of these thread block sizes are not valid because they have too many threads.

For this exercise, the texturing kernel is designed to do as little work as possible (maximizing exposure to the performance of the texture hardware), while still "fooling" the compiler into issuing the code. Each thread computes the floating-point sum of the values it reads and writes the sum if the output parameter is non-NULL. The trick is that we never pass a non-NULL pointer to this kernel! The reason the kernel is structured this way is because if it never wrote any output, the compiler would see that the kernel was not doing any work and would emit code that did not perform the texturing operations at all.

extern "C" __global__ void   
TexSums(float \*out,size_t Width,size_t Height)   
{ float sum  $= 0.0f$  for ( int row  $=$ (blockIdx.y\*blockDim.y  $^+$  threadIdx.y; row  $<$  Height; row  $+ =$  blockDim.y\*gridDim.y） { for (int col  $=$ .blockIdx.x\*blockDim.x  $^+$  threadIdx.x; col  $<$ Width; col  $+ =$  blockDim.x\*gridDim.x） { sum  $+ =$  tex2D( tex,(float)col,(float)row); } if (out){ out[blocks.x\*blockDim.x+threadIdx.x]  $=$  sum; }

Even with our "trick," there is a risk that the compiler will emit code that checks the out parameter and exits the kernel early if it's equal to NULL. We'd have to synthesize some output that wouldn't affect performance too much (for example, have each thread block compute the reduction of the sums in shared memory and write them to out). But by compiling the program with the --keep option and using cuobjdump --dump-sass to examine the microcode, we can see that the compiler doesn't check out until after the doubly-nested for loop as executed.

10.12.1 RESULTS

On a GeForce GTX 280 (GT200), the optimal block size was found to be 128 threads, which delivered 35.7G/s of bandwidth. Thread blocks of size $32\mathrm{W} \times 4\mathrm{H}$

were about the same speed as $16\mathrm{W} \times 8\mathrm{H}$ or $8\mathrm{W} \times 16\mathrm{H}$ , all traversing a $4\mathrm{K} \times 4\mathrm{K}$ texture of float in 1.88 ms. On a Tesla M2050, the optimal block size was found to be 192 threads, which delivered 35.4G/s of bandwidth. As with the GT200, different-sized thread blocks were the same speed, with $6\mathrm{W} \times 32\mathrm{H}$ , $16\mathrm{W} \times 12\mathrm{H}$ , and $8\mathrm{W} \times 24\mathrm{H}$ blocks delivering about the same performance.

The shmoo over 2D surface memset was less conclusive: Block sizes of at least 128 threads generally had good performance, provided the thread count was evenly divisible by the warp size of 32. The fastest 2D surface memset performance reported on a cg1.4xlarge without ECC enabled was 48Gb/s.

For float-valued data for both boards we tested, the peak bandwidth numbers reported by texturing and surface write are about $\frac{1}{4}$ and $\frac{1}{2}$ of the achievable peaks for global load/store, respectively.

10.12_Optimal_Block_Sizing_and_Performance

10.12 Optimal Block Sizing and Performance

10.12.1 RESULTS