11.3 Streams

For workloads that benefit from concurrent memcpy and kernel execution (GPU/GPU overlap), CUDA streams can be used to coordinate execution. The stream3Streams.cu application splits the input and output arrays into k streams and then invokes k host $\rightarrow$ device memcpy, kernels, and device $\rightarrow$ host memcpy, each in their own stream. Associating the transfers and computations with different streams lets CUDA know that the computations are completely independent, and CUDA will exploit whatever parallelism opportunities the hardware can support. On GPUs with multiple copy engines, the GPU may be transferring data both to and from device memory while processing other data with the SMs.

Listing 11.6 shows an excerpt from stream3Streams.cu, the same portion of the application as shown in Listing 11.5. On the test system, the output from this application reads as follows.

Measuring times with 128M floats
Testing with default max of 8 streams (set with --maxStreams )

The GPU in question has only one copy engine, so it is not surprising that the case with 2 streams delivers the highest performance. If the kernel execution time were more in line with the transfer time, it would likely be beneficial to split the arrays into more than 2 subarrays. As things stand, the first kernel launch cannot begin processing until the first host $\rightarrow$ device memcpy is done, and the final device $\rightarrow$ host memcpy cannot begin until the last kernel launch is done. If the kernel processing took more time, this "overhang" would be more pronounced. For our application, the wall clock time of 273 ms shows that most of the kernel processing (13.87 ms) has been hidden.

Note that in this formulation, partly due to hardware limitations, we are not trying to insert anyCORD() calls between operations, as we did in Listing 11.5. On most CUDA hardware, trying to record events between the streamed operations in Listing 11.6 would break concurrency and reduce performance. Instead, we bracket the operations with oneCORD() before and oneCORD() after.

Listing 11.6 stream3Streams.cu excerpt.

for ( int iStream = 0; iStream < nStreams; iStream++) { CUDART_CHECK(udaMemcpyAsync(
    dptrX+iStream*streamStep,
    hptrX+iStream*streamStep,
    streamStep*sizeof(float),
   udaMemcpyHostToDevice,
    streams[iStream] }}); CUDART_CHECK(udaMemcpyAsync(
    dptrY+iStream*streamStep,
    hptrY+iStream*streamStep,
    streamStep*sizeof(float),

CUDAMemcpyHostToDevice, streams[iStream] );   
}   
for ( int iStream = 0; iStream < nStreams; iStream++) { saxpyGPU<<nBlocks, nThreads, 0, streams[iStream] >>>( dptrOut+iStream\*streamStep, dptrX+iStream\*streamStep, dptrY+iStream\*streamStep, streamStep, alpha );   
}   
for ( int iStream = 0; iStream < nStreams; iStream++) { CUDA_CHECK( CUDAMemcpyAsync( hpptrOut+iStream\*streamStep, dptrOut+iStream\*streamStep, streamStep\*sizeof(float),udaMemcpyDeviceToHost, streams[iStream] ) );   
}

11.3_Streams

11.3 Streams