7.5 Dynamic Parallelism

Dynamic parallelism, a new capability that works only on SM 3.5-class hardware, enables CUDA kernels to launch other CUDA kernels, and also to invoke various functions in the CUDA runtime. When using dynamic parallelism, a subset of the CUDA runtime (known as the device runtime) becomes available for use by threads running on the device.

Dynamic parallelism introduces the idea of "parent" and "child" grids. Any kernel invoked by another CUDA kernel (as opposed to host code, as done in all previous CUDA versions) is a "child kernel," and the invoking grid is its "parent." By default, CUDA supports two (2) nesting levels (one for the parent and one for the child), a number that may be increased by calling CUDASetDeviceLimit() with CUDALimitDevRuntimeSyncDepth.

Dynamic parallelism was designed to address applications that previously had to deliver results to the CPU so the CPU could specify which work to perform on the GPU. Such "handshaking" disrupts CPU/GPU concurrency in the execution pipeline described in Section 2.5.1, in which the CPU produces commands for consumption by the GPU. The GPU's time is too valuable for it to wait for the CPU to read and analyze results before issuing more work. Dynamic parallelism avoids these pipeline bubbles by enabling the GPU to launch work for itself from kernels.

Dynamic parallelism can improve performance in several cases.

It enables initialization of data structures needed by a kernel before the kernel can begin execution. Previously, such initialization had to be taken care of in host code or by previously invoking a separate kernel.
It enables simplified recursion for applications such as Barnes-Hut gravitational integration or hierarchical grid evaluation for aerodynamic simulations.

NOTE

Dynamic parallelism only works within a given GPU. Kernels can invoke memory copies or other kernels, but they cannot submit work to other GPUs.

7.5.1 SCOPING AND SYNCHRONIZATION

With the notable exception of block and grid size, child grids inherit most kernel configuration parameters, such as the shared memory configuration (set byudaDeviceSetCacheConfig(), from their parents. Thread blocks are a unit of scope: Streams and events created by a thread block can only be used by that thread block (they are not even inherited for use by child grids), and they are automatically destroyed when the thread block exits.

NOTE

Resources created on the device via dynamic parallelism are strictly separated from resources created on the host. Streams and events created on the host may not be used on the device via dynamic parallelism, and vice versa.

CUDA guarantees that a parent grid is not considered complete until all of its children have finished. Although the parent may execute concurrently with the child, there is no guarantee that a child grid will begin execution until its parent calls CUDADeviceSynchronize().

If all threads in a thread block exit, execution of the thread block is suspended until all child grids have finished. If that synchronization is not sufficient, developers can use CUDA streams and events for explicit synchronization. As on the host, operations within a given stream are performed in the order of submission. Operations can only execute concurrently if they are specified in different streams, and there is no guarantee that operations will, in fact, execute concurrently. If needed, synchronization primitives such as __syncthreads() can be used to coordinate the order of submission to a given stream.

NOTE

Streams and events created on the device may not be used outside the thread block that created them.

CUDADeviceSynchronize() synchronizes on all pending work launched by any thread in the thread block. It does not, however, perform any interthread synchronization, so if there is a desire to synchronize on work launched by other threads, developers must use __syncthreads() or other block-level synchronization primitives (see Section 8.6.2).

7.5.2 MEMORY MODEL

Parent and child grids share the same global and constant memory storage, but they have distinct local and shared memory.

Global Memory

There are two points in the execution of a child grid when its view of memory is fully consistent with the parent grid: when the child grid is invoked by the parent and when the child grid completes as signaled by a synchronization API invocation in the parent thread.

All global memory operations in the parent thread prior to the child thread's invocation are visible to the child grid. All memory operations of the child grid are visible to the parent after the parent has synchronized on the child grid's completion. Zero-copy memory has the same coherence and consistency guarantees as global memory.

Constant Memory

Constants are immutable and may not be modified from the device during kernel execution. Taking the address of a constant memory object from within a kernel thread has the same semantics as for all CUDA programs, $^{13}$ and passing that pointer between parents and their children is fully supported.

Shared and Local Memory

Shared and local memory is private to a thread block or thread, respectively, and is not visible or coherent between parent and child. When an object in one of these locations is referenced outside its scope, the behavior is undefined and would likely cause an error.

If nvcc detects an attempt to misuse a pointer to shared or local memory, it will issue a warning. Developers can use the __isGlobal() intrinsic to determine whether a given pointer references global memory. Pointers to shared or local memory are not valid parameters toudaMemcpyAsync() orudaMemsetAsync().

Local Memory

Local memory is private storage for an executing thread and is not visible outside of that thread. It is illegal to pass a pointer to local memory as a launch

argument when launching a child kernel. The result of dereferencing such a local memory pointer from a child will be undefined. To guarantee that this rule is not inadvertently violated by the compiler, all storage passed to a child kernel should be allocated explicitly from the global memory heap.

Texture Memory

Concurrent accesses by parent and child may result in inconsistent data and should be avoided. That said, a degree of coherency between parent and child is enforced by the runtime. A child kernel can use texturing to access memory written by its parent, but writes to memory by a child will not be reflected in the texture memory accesses by a parent until after the parent synchronizes on the child's completion. Texture objects are well supported in the device runtime. They cannot be created or destroyed, but they can be passed in and used by any grid in the hierarchy (parent or child).

7.5.3 STREAMS AND EVENTS

Streams and events created by the device runtime can be used only within the thread block that created the stream. The NULL stream has different semantics in the device runtime than in the host runtime. On the host, synchronizing with the NULL stream forces a "join" of all the other streamed operations on the GPU (as described in Section 6.2.3); on the device, the NULL stream is its own stream, and any interstream synchronization must be performed using events.

When using the device runtime, streams must be created with the CUDANonBlocking flag (a parameter to CUDACreateWithFlags(). The CUDAStreamSynchronize() call is not supported; synchronization must be implemented in terms of events and CUDAWaitEvent().

Only the interstream synchronization capabilities of CUDA events are supported. As a consequence, CUDAEventSynchronize(), CUDAEventElapsedTime(), and CUDAEventQuery() are not supported. Additionally, because timing is not supported, events must be created by passing the CUDAEventDisableTiming flag to CUDAEventCreateWithFlags().

7.5.4 ERROR HANDLING

Any function in the device runtime may return an error ( JudaError_t). The error is recorded in a per-thread slot that can be queried by calling JudaGetLastError(). As with the host-based runtime, CUDA makes a distinction between errors that can be returned immediately (e.g., if an invalid

parameter is passed to a memcpy function) and errors that must be reported asynchronously (e.g., if a launch performed an invalid memory access). If a child grid causes an error at runtime, CUDA will return an error to the host, not to the parent grid.

7.5.5 COMPILING AND LINKING

Unlike the host runtime, developers must explicitly link against the device runtime's static library when using the device runtime. On Windows, the device runtime is cudadevrt.lib; on Linux and MacOS, it is cudadevrt.a. When building with nvcc, this may be accomplished by appending -1cudadevrt to the command line.

7.5.6 RESOURCE MANAGEMENT

Whenever a kernel launches a child grid, the child is considered a new nesting level, and the total number of levels is the nesting depth of the program. In contrast, the deepest level at which the program will explicitly synchronize on a child launch is called the synchronization depth. Typically the synchronization depth is one less than the nesting depth of the program, but if the program does not always need to calludaDeviceSynchronize(), then it may be substantially less than the nesting depth.

The theoretical maximum nesting depth is 24, but in practice it is governed by the device limit CUDALimitDevRuntimeSyncDepth. Any launch that would result in a kernel at a deeper level than the maximum will fail. The default maximum synchronization depth level is 2. The limits must be configured before the top-level kernel is launched from the host.

NOTE

Calling a device runtime function such as CUDAMemcpyAsync() may invoke a kernel, increasing the nesting depth by 1.

For parent kernels that never calludaDeviceSynchronize(), the system does not have to reserve space for the parent kernel. In this case, the memory footprint required for a program will be much less than the conservative maximum. Such a program could specify a shallower maximum synchronization depth to avoid overallocation of backing store.

Memory Footprint

The device runtime system software reserves device memory for the following purposes.

To track pending grid launches
To save saving parent-grid state during synchronization
To serve as an allocation heap for malloc() andudaMalloc() calls from kernels

This memory is not available for use by the application, so some applications may wish to reduce the default allocations, and some applications may have to increase the default values in order to operate correctly. To change the default values, developers calludaDeviceSetLimit(), as summarized in Table 7.3. The limit cudaLimitDevRuntimeSyncDepth is especially important, since each nesting level costs up to 150MB of device memory.

Pending Kernel Launches

When a kernel is launched, all associated configuration and parameter data is tracked until the kernel completes. This data is stored within a

Table 7.3 CUDADeviceSetLimit() Values

system-managed launch pool. The size of the launch pool is configurable by callingudaDeviceSetLimit() from the host and specifyingudaLimitDevRuntimePendingLaunchCount.

Configuration Options

Resource allocation for the device runtime system software is controlled via the CUDADeviceSetLimit() API from the host program. Limits must be set before any kernel is launched and may not be changed while the GPU is actively running programs.

Memory allocated by the device runtime must be freed by the device runtime. Also, memory is allocated by the device runtime out of a preallocated heap whose size is specified by the device limitudaLimitMallocHeapSize. The named limits in Table 7.3 may be set.

7.5.7 SUMMARY

Table 7.4 summarizes the key differences and limitations between the device runtime and the host runtime. Table 7.5 lists the subset of functions that may be called from the device runtime, along with any pertinent limitations.

Table 7.4 Device Runtime Limitations

Table 7.4 Device Runtime Limitations (Continued)

Table 7.5 CUDA Device Runtime Functions

continues

Table 7.5 CUDA Device Runtime Functions (Continued)

Streaming Multiprocessors

The streaming multiprocessors (SMs) are the part of the GPU that runs our CUDA kernels. Each SM contains the following.

Thousands of registers that can be partitioned among threads of execution
Several caches:
Shared memory for fast data interchange between threads
Constant cache for fast broadcast of reads from constant memory
Texture cache to aggregate bandwidth from texture memory
L1 cache to reduce latency to local or global memory
Warp schedulers that can quickly switch contexts between threads and issue instructions to warps that are ready to execute
Execution cores for integer and floating-point operations:
Integer and single-precision floating point operations
Double-precision floating point
Special Function Units (SFUs) for single-precision floating-point transcendental functions

The reason there are many registers and the reason the hardware can context switch between threads so efficiently are to maximize the throughput of the hardware. The GPU is designed to have enough state to cover both execution latency and the memory latency of hundreds of clock cycles that it may take for data from device memory to arrive after a read instruction is executed.

The SMs are general-purpose processors, but they are designed very differently than the execution cores in CPUs: They target much lower clock rates; they support instruction-level parallelism, but not branch prediction or speculative execution; and they have less cache, if they have any cache at all. For suitable workloads, the sheer computing horsepower in a GPU more than makes up for these disadvantages.

The design of the SM has been evolving rapidly since the introduction of the first CUDA-capable hardware in 2006, with three major revisions, codenamed Tesla, Fermi, and Kepler. Developers can query the compute capability by calling CUDAGetDeviceProperties() and examining CUDADeviceProp major and CUDADeviceProp minor, or by calling the driver API function cuDeviceComputeCapability(). Compute capability 1.x, 2.x, and 3.x correspond to Tesla-class, Fermi-class, and Kepler-class hardware, respectively. Table 8.1 summarizes the capabilities added in each generation of the SM hardware.

Table 8.1 SM Capabilities

Table 8.1 SM Capabilities (Continued)

In Chapter 2, Figures 2.29 through 2.32 show block diagrams of different SMs. CUDA cores can execute integer and single-precision floating-point instructions; one double-precision unit implements double-precision support, if available; and Special Function Units implement reciprocal, reciprocal square root, sine/ cosine, and logarithm/exponential functions. Warp schedulers dispatch instructions to these execution units as the resources needed to execute the instruction become available.

This chapter focuses on the instruction set capabilities of the SM. As such, it sometimes refers to the "SASS" instructions, the native instructions into which ptxas or the CUDA driver translate intermediate PTX code. Developers are not able to author SASS code directly; instead, NVIDIA has made these instructions visible to developers through the cuobjdump utility so they can direct optimizations of their source code by examining the compiled microcode.

7.5_Dynamic_Parallelism

7.5 Dynamic Parallelism

NOTE

7.5.1 SCOPING AND SYNCHRONIZATION

NOTE

NOTE

7.5.2 MEMORY MODEL

Global Memory

Constant Memory

Shared and Local Memory

Local Memory

Texture Memory

7.5.3 STREAMS AND EVENTS

7.5.4 ERROR HANDLING

7.5.5 COMPILING AND LINKING

7.5.6 RESOURCE MANAGEMENT

NOTE

Memory Footprint

Pending Kernel Launches

Configuration Options

7.5.7 SUMMARY

Streaming Multiprocessors