3.5_Kernels_(Functions)

3.5 Kernels (Functions)

Kernels are highlighted by the global keyword in .cu files. When using the CUDA runtime, they can be invoked in-line with the triple-angle-bracket <<< >> syntax. Chapter 7 gives a detailed description of how kernels can be invoked and how they execute on the GPU.

The GPU executable code of the module comes in the form of kernels that are invoked with the language integration features of the CUDA runtime (<<< >>> syntax) or the cuLaunchKernel () function in the driver API. At the time of this writing, CUDA does not do any dynamic residency management of the executable code in CUDA modules. When a module is loaded, all of the kernels are loaded into device memory.

Once a module is loaded, kernels may be queried with cuModuleGetFunction(); the kernel's attributes can be queried with cuFunc_GetAttribute(); and the kernel may be launched with cuLaunchKernel().cuLaunchKernel()

rendered a whole slew of API entry points obsolete: Functions such as cuFuncSetBlockShape() specified the block size to use the next time a given kernel was launched; functions such as cuParamSetv() specified the parameters to pass the next time a given kernel was launched; and cuLaunch(), cuLaunchGrid(), and cuLaunchGridAsync() launched a kernel using the previously set state. These APIs were inefficient because it took so many calls to set up a kernel launch and because parameters such as block size are best specified atomically with the request to launch the kernel.

The cuFuncGetAttribute() function may be used to query specific attributes of a function, such as

  • The maximum number of threads per block

  • The amount of statically allocated shared memory

  • The size of user-allocated constant memory

  • The amount of local memory used by each function

  • The number of registers used by each thread of the function

  • The virtual (PTX) and binary architecture versions for which the function was compiled

When using the driver API, it is usually a good idea to use extern "C" to inhibit the default name-mangling behavior of C++. Otherwise, you have to specify the mangled name to cuModuleGetFunction().

CUDA Runtime

As executables that were built with the CUDA runtime are loaded, they create global data structures in host memory that describe the CUDA resources to be allocated when a CUDA device is created. Once a CUDA device is initialized, these globals are used to create the CUDA resources all at once. Because these globals are shared process-wide by the CUDA runtime, it is not possible to incrementally load and unload CUDA modules using the CUDA runtime.

Because of the way the CUDA runtime is integrated with the C++ language, kernels and symbols should be specified by name (i.e., not with a string literal) to API functions such as CUDAFuncGetAttributes() and CUDAMemcpyToSymbol().

Cache Configuration

In Fermi-class architectures, the streaming multiprocessors have L1 caches that can be split as 16K shared/48K L1 cache or 48K shared/16K L1 cache.9 Initially, CUDA allowed the cache configuration to be specified on a per-kernel basis, using CUDAFuncSetCacheConfig() in the CUDA runtime or cuFuncSetCacheConfig() in the driver API. Later, this state was moved to be more global: cuCtxSetCacheConfig() / CUDADeviceSetCacheConfig() specifies the default cache configuration.