7.2 Syntax

When using the CUDA runtime, a kernel launch is specified using the familiar triple-angle-bracket syntax.

Kernel<<<gridSize,blockSize,sharedMem,Stream>>>(Parameters...)

Kernel specifies the kernel to launch.

gridSize specifies the size of the grid in the form of a dim3 structure.

blockSize specifies the dimension of each threadblock as a dim3.

sharedMem specifies additional shared memory1 to reserve for each block.

Stream specifies the stream in which the kernel should be launched.

The dim3 structure used to specify the grid and block sizes has 3 members $(x, y, \text{and } z)$ and, when compiling with C++, a constructor with default parameters such that the $y$ and $z$ members default to 1. See Listing 7.1, which is excerpted from the NVIDIA SDK file vector_types.h.

Listing 7.1 dim3 structure.

struct __device_builtin__dim3
{
    unsigned int x, y, z;
};
if defined(_cplusplus)
{
    __host__ _device__ dim3(
        unsigned int vx = 1,
        unsigned int vy = 1,
        unsigned int vz = 1): x(vx), y(vy), z(vz) {}
    __host__ _device__ dim3( uint3 v): x(v.x), y(v.y), z(v.z) {}
    __host__ _device__ operator uint3(void) {
        uint3 t;
        t.x = x;
        t.y = y;
        t.z = z;
        return t;
    }
}
#endif /* __cplusplus */;
};

Kernels can be launched via the driver API using cuLaunchKernel (), though that function takes the grid and block dimensions as discrete parameters rather than dim3.

CUresult cuLaunchKernel ( CUfunction kernel, unsigned int gridDimX, unsigned int gridDimY, unsigned int gridDimZ, unsigned int blockDimX, unsigned int blockDimY, unsigned int blockDimZ, unsigned int sharedMemBytes, CUstream hStream, void \*\*kernelParams, void \*\*extra   
);

As with the triple-angle-bracket syntax, the parameters to cuLaunchKernel () include the kernel to invoke, the grid and block sizes, the amount of shared memory, and the stream. The main difference is in how the parameters to the kernel itself are given: Since the kernel microcode emitted by ptxas contains metadata that describes each kernel's parameters, $^2$ kernelParams is an array of void *, where each element corresponds to a kernel parameter. Since the type is known by the driver, the correct amount of memory (4 bytes for an int, 8 bytes for a double, etc.) will be copied into the command buffer as part of the hardware-specific command used to invoke the kernel.

7.2.1 LIMITATIONS

All C++ classes participating in a kernel launch must be "plain old data" (POD) with the following characteristics.

No user-declared constructors
No user-defined copy assignment operator
No user-defined destructor
No nonstatic data members that are not themselves PODs
No private or protected nonstatic data
No base classes
No virtual functions

Note that classes that violate these rules may be used in CUDA, or even in CUDA kernels; they simply cannot be used for a kernel launch. In that case, the classes used by a CUDA kernel can be constructed using the POD input data from the launch.

CUDA kernels also do not have return values. They must report their results back via device memory (which must be copied back to the CPU explicitly) or mapped host memory.

7.2.2 CACHES AND COHERENCY

The GPU contains numerous caches to accelerate computation when reuse occurs. The constant cache is optimized for broadcast to the execution units within an SM; the texture cache reduces external bandwidth usage. Neither of these caches is kept coherent with respect to writes to memory by the GPU. For example, there is no protocol to enforce coherency between these caches and the L1 or L2 caches that serve to reduce latency and aggregate bandwidth to global memory. That means two things.

When a kernel is running, it must take care not to write memory that it (or a concurrently running kernel) also is accessing via constant or texture memory.
The CUDA driver must invalidate the constant cache and texture cache before each kernel launch.

For kernels that do not contain TEX instructions, there is no need for the CUDA driver to invalidate the texture cache; as a result, kernels that do not use texture incur less driver overhead.

7.2.3 ASYNCHRONY AND ERROR HANDLING

Kernel launches are asynchronous: As soon as a kernel is submitted to the hardware, it begins executing in parallel with the CPU. This asynchrony complicates error handling. If a kernel encounters an error (for example, if it reads an invalid memory location), the error is reported to the driver (and the application) sometime after the kernel launch. The surest way to check for such errors is to synchronize with the GPU using CUDADeviceSynchronize() or cuCtxSynchronize(). If an error in kernel execution has occurred, the error code "unspecified launch failure" is returned.

Besides explicit CPU/GPU synchronization calls such as CUDADevice-Synchronize() or cuCtxSynchronize(), this error code may be returned by functions that implicitly synchronize with the CPU, such as synchronous memcpy calls.

Invalid Kernel Launches

It is possible to request a kernel launch that the hardware cannot perform—for example, by specifying more threads per block than the hardware supports. When possible, the driver detects these cases and reports an error rather than trying to submit the launch to the hardware.

The CUDA runtime and the driver API handle this case differently. When an invalid parameter is specified, the driver API's explicit API calls such as cuLaunchGrid() and cuLaunchKernel() return error codes. But when using the CUDA runtime, since kernels are launched in-line with C/C++ code, there is no API call to return an error code. Instead, the error is "recorded" into a thread-local slot and applications can query the error value withudaGetLastError(). This same error handling mechanism is used for kernel launches that are invalid for other reasons, such as a memory access violation.

7.2.4 TIMEOUTS

Because the GPU is not able to context-switch in the midst of kernel execution, a long-running CUDA kernel may negatively impact the interactivity of a system that uses the GPU to interact with the user. As a result, many CUDA systems implement a "timeout" that resets the GPU if it runs too long without context switching.

On WDDM (Windows Display Driver Model), the timeout is enforced by the operating system. Microsoft has documented how this "Timeout Detection and Recovery" (TDR) works. See http://bit.ly/WPPSdQ, which includes the Registry keys that control TDR behavior.4 TDR can be safely disabled by using the Tesla Compute Cluster (TCC) driver, though the TCC driver is not available for all hardware.

On Linux, the NVIDIA driver enforces a default timeout of 2 seconds. No timeout is enforced on secondary GPUs that are not being used for display. Developers can query whether a runtime limit is being enforced on a given GPU by calling cuDeviceGetAttribute() with CU_DEVICE_ATTRIBUTE_KERNEL.exec_TIMEOUT, or by examining CUDADeviceProp:: kernelExecTimeoutEnabled.

7.2.5 LOCAL MEMORY

Since local memory is per-thread, and a grid in CUDA can contain thousands of threads, the amount of local memory needed by a CUDA grid can be

considerable. The developers of CUDA took pains to preallocate resources to minimize the likelihood that operations such as kernel launches would fail due to a lack of resources, but in the case of local memory, a conservative allocation simply would have consumed too much memory. As a result, kernels that use a large amount of local memory take longer and may be synchronous because the CUDA driver must allocate memory before performing the kernel launch. Furthermore, if the memory allocation fails, the kernel launch will fail due to a lack of resources.

By default, when the CUDA driver must allocate local memory to run a kernel, it frees the memory after the kernel has finished. This behavior additionally makes the kernel launch synchronous. But this behavior can be inhibited by specifying CU_CTX_LMEM_RESIZE_TO_MAX to cuCtxCreate() or by calling CUDASetDeviceFlags() with CUDADeviceLmemResizeToMax before the primary context is created. In this case, the increased amount of local memory available will persist after launching a kernel that required more local memory than the default.

7.2.6 SHARED MEMORY

Shared memory is allocated when the kernel is launched, and it stays allocated for the duration of the kernel's execution. Besides static allocations that can be declared in the kernel, shared memory can be declared as an unsized extern; in that case, the amount of shared memory to allocate for the unsized array is specified as the third parameter of the kernel launch, or the sharedMemBytes parameter to cuLaunchKernel().

7.2_Syntax