3.7_Streams_and_Events
3.7 Streams and Events
In CUDA, streams and events were added to enable host device memory copies to be performed concurrently with kernel execution. Later versions of CUDA expanded streams' capabilities to support execution of multiple kernels concurrently on the same GPU and to support concurrent execution between multiple GPUs.
CUDA streams are used to manage concurrency between execution units with coarse granularity.
The GPU and the CPU
The copy engine(s) that can perform DMA while the SMs are processing
The streaming multiprocessors (SMs)
Kernels that are intended to run concurrently
Separate GPUs that are executing concurrently
The operations requested in a given stream are performed sequentially. In a sense, CUDA streams are like CPU threads in that operations within a CUDA stream are performed in order.
3.7.1 SOFTWARE PIPELINING
Because there is only one DMA engine serving the various coarse-grained hardware resources in the GPU, applications must "software-pipeline" the operations performed on multiple streams. Otherwise, the DMA engine will "break concurrency" by enforcing synchronization within the stream between different engines. A detailed description of how to take full advantage of CUDA streams using software pipelining is given in Section 6.5, and more examples are given in Chapter 11.
The Kepler architecture reduced the need to software-pipeline streamed operations and, with NVIDIA's Hyper-Q technology (first available with SM 3.5), virtually eliminated the need for software pipelining.
3.7.2 STREAMCALLBACKS
CUDA 5.0 introduced another mechanism for CPU/GPU synchronization that complements the existing mechanisms, which focus on enabling CPU threads to wait until streams are idle or events have been recorded. Stream callbacks are functions provided by the application, registered with CUDA, and later called by CUDA when the stream has reached the point at which cuStreamAddCallback() was called.
Stream execution is suspended for the duration of the stream callback, so for performance reasons, developers should be careful to make sure other streams are available to process during the callback.
3.7.3 THE NULL STREAM
Any of the asynchronous memcpy functions may be called with NULL as the stream parameter, and the memcpy will not be initiated until all preceding operations on the GPU have been completed; in effect, the NULL stream is a join of all the engines on the GPU. Additionally, all streamed memcpy functions are asynchronous, potentially returning control to the application before the memcpy has been performed. The NULL stream is most useful for facilitating CPU/GPU concurrency in applications that have no need for the intra-GPU concurrency facilitated by multiple streams. Once a streamed operation has been initiated with the NULL stream, the application must use synchronization functions such as cuCtxSynchronize() or CUDAThreadSynchronize() to ensure that the operation has been completed before proceeding. But the application may request many such operations before performing the synchronization. For example, the application may perform
an asynchronous host→device memcpy
one or more kernel launches
an asynchronous device→host memcpy
before synchronizing with the context. The cuCtxSynchronize() or CUDAThreadSynchronize() call returns after the GPU has performed the last-requested operation. This idiom is especially useful when performing
smaller memcpy's or launching kernels that will not run for long. The CUDA driver takes valuable CPU time to write commands to the GPU, and overlapping that CPU execution with the GPU's processing of the commands can improve performance.
Note: Even in CUDA 1.0, kernel launches were asynchronous; the NULL stream is implicitly specified to any kernel launch in which no stream is explicitly specified.
3.7.4 EVENTS
CUDA events present another mechanism for synchronization. Introduced at the same time as CUDA streams, "recording" CUDA events is a way for applications to track progress within a CUDA stream. All CUDA events work by writing a shared sync memory location when all preceding operations in the CUDA stream have been performed. Querying the CUDA event causes the driver to peek at this memory location and report whether the event has been recorded; synchronizing with the CUDA event causes the driver to wait until the event has been recorded.
Optionally, CUDA events also can write a timestamp derived from a high-resolution timer in the hardware. Event-based timing can be more robust than CPU-based timing, especially for smaller operations, because it is not subject to spurious unrelated events (such as page faults or network traffic) that may affect wall-clock timing by the CPU. Wall-clock times are definitive because they are a better approximation of what the end users sees, so CUDA events are best used for performance tuning during product development.[11]
Timing using CUDA events is best performed in conjunction with the NULL stream. This rule of thumb is motivated by reasons similar to the reasons RDTSC (Read TimeStamp Counter) is a serializing instruction on the CPU: Just as the CPU is a superscalar processor that can execute many instructions at once, the GPU can be operating on multiple streams at the same time. Without explicit serialization, a timing operation may inadvertently include operations that were not intended to be timed or may exclude operations that were supposed to be timed. As with RDTSC, the trick is to bracket the CUDA event
recordings with enough work that the overhead of performing the timing itself is negligible.
CUDA events optionally can cause an interrupt to be signaled by the hardware, enabling the driver to perform a so-called "blocking" wait. Blocking waits suspend the waiting CPU thread, saving CPU clock cycles and power while the driver waits for the GPU. Before blocking waits became available, CUDA developers commonly complained that the CUDA driver burned a whole CPU core waiting for the GPU by polling a memory location. At the same time, blocking waits may take longer due to the overhead of handling the interrupt, so latency-sensitive applications may still wish to use the default polling behavior.