7.1 Overview

CUDA kernels execute on the GPU and, since the very first version of CUDA, always have executed concurrently with the CPU. In other words, kernel launches are asynchronous: Control is returned to the CPU before the GPU has completed the requested operation. When CUDA was first introduced, there was no need for developers to concern themselves with the asynchrony (or lack thereof) of kernel launches; data had to be copied to and from the GPU explicitly, and the memcpy commands would be enqueued after the commands needed to launch kernels. It was not possible to write CUDA code that exposed the asynchrony of kernel launches; the main side effect was to hide driver overhead when performing multiple kernel launches consecutively.

With the introduction of mapped pinned memory (host memory that can be directly accessed by the GPU), the asynchrony of kernel launches becomes more important, especially for kernels that write to host memory (as opposed to read from it). If a kernel is launched and writes host memory without explicit synchronization (such as with CUDA events), the code suffers from a race condition between the CPU and GPU and may not run correctly. Explicit

synchronization often is not needed for kernels that read via mapped pinned memory, since any pending writes by the CPU will be posted before the kernel launches. But for kernels that are returning results to CPU by writing to mapped pinned memory, synchronizing to avoid write-after-read hazards is essential.

Once a kernel is launched, it runs as a grid of blocks of threads. Not all blocks run concurrently, necessarily; each block is assigned to a streaming multiprocessing (SM), and each SM can maintain the context for multiple blocks. To cover both memory and instruction latencies, the SM generally needs more warps than a single block can contain. The maximum number of blocks per SM cannot be queried, but it is documented by NVIDIA as having been 8 before SM 3.x and 16 on SM 3.x and later hardware.

The programming model makes no guarantees whatsoever as to the order of execution or whether certain blocks or threads can run concurrently. Developers can never assume that all the threads in a kernel launch are executing concurrently. It is easy to launch more threads than the machine can hold, and some will not start executing until others have finished. Given the lack of ordering guarantees, even initialization of global memory at the beginning of a kernel launch is a difficult proposition.

Dynamic parallelism, a new feature added with the Tesla K20 (GK110), the first SM 3.5-capable GPU, enables kernels to launch other kernels and perform synchronization between them. These capabilities address some of the limitations that were present in CUDA in previous hardware. For example, a dynamically parallel kernel can perform initialization by launching and waiting for a child grid.

7.1_Overview

7.1 Overview