Figure 3.1 Software layers in CUDA.

CUDA libraries, such as cuBLAS, are built on top of the CUDA runtime or driver API. The CUDA runtime is the library targeted by CUDA's integrated C++/GPU toolchain. When the nvcc compiler splits . cu files into host and device portions, the host portion contains automatically generated calls to the CUDA runtime to facilitate operations such as the kernel launches invoked by nvcc's special triple-angle bracket << >> syntax.

The CUDA driver API, exported directly by CUDA's user mode driver, is the lowest-level API available to CUDA apps. The driver API calls into the user mode driver, which may in turn call the kernel mode driver to perform operations such as memory allocation. Functions in the driver API and CUDA runtime generally start with cu* () and CUDA* (), respectively. Many functions, such as CUDAEventElapsedTime(), are essentially identical, with the only difference being in the prefix.

3.1.1 CUDA RUNTIME AND DRIVER

The CUDA runtime (often abbreviated CUDA) is the library used by the language integration features of CUDA. Each version of the CUDA toolchain has its own specific version of the CUDA runtime, and programs built with that tool-chain will automatically link against the corresponding version of the runtime. A program will not run correctly unless the correct version of CUDA is available in the path.

The CUDA driver is designed to be backward compatible, supporting all programs written against its version of CUDA, or older ones. It exports a low-level "driver API" (in CUDA.h) that enables developers to closely manage resources and the timing of initialization. The driver version may be queried by calling cuDriverGetVersion().

CUresult CUDAAPI cuDriverGetVersion(int *driverVersion);

This function passes back a decimal value that gives the version of CUDA supported by the driver—for example, 3010 for CUDA 3.1 and 5000 for CUDA 5.0.

Table 3.1 summarizes the features that correspond to the version number passed back by cuDriverGetVersion(). For CUDA runtime applications, this information is given by the major and minor members of the CUDADeviceProp structure as described in Section 3.2.2.

The CUDA runtime requires that the installed driver have a version greater than or equal to the version of CUDA supported by the runtime. If the driver version is older than the runtime version, the CUDA application will fail to initialize with the errorudaErrorInsufficientDriver (35). CUDA 5.0 introduced the device runtime, a subset of the CUDA runtime that can be invoked from CUDA kernels. A detailed description of the device runtime is given in Chapter 7.

Table 3.1. CUDA Driver Features

3.1.2 DRIVER MODELS

Other than Windows Vista and subsequent releases of Windows, all of the operating systems that CUDA runs—Linux, MacOS, and Windows XP—access the hardware with user mode client drivers. These drivers sidestep the requirement, common to all modern operating systems, that hardware resources be manipulated by kernel code. Modern hardware such as GPUs can finesse that requirement by mapping certain hardware registers—such as the hardware register used to submit work to the hardware—into user mode. Since user mode code is not trusted by the operating system, the hardware must contain protections against rogue writes to the user mode hardware registers. The goal is to prevent user mode code from prompting the hardware to use its direct memory

access (DMA) facilities to read or write memory that it should not (such as the operating system's kernel code!).

Hardware designers protect against memory corruption by introducing a level of indirection into the command stream available to user mode software so DMA operations can only be initiated on memory that previously was validated and mapped by kernel code; in turn, driver developers must carefully validate their kernel code to ensure that it only gives access to memory that should be made available. The end result is a driver that can operate at peak efficiency by submitting work to the hardware without having to incur the expense of a kernel transition.

Many operations, such as memory allocation, still require kernel mode transitions because editing the GPU's page tables can only be done in kernel mode. In this case, the user mode driver may take steps to reduce the number of kernel mode transitions—for example, the CUDA memory allocator tries to satisfy memory allocation requests out of a pool.

Unified Virtual Addressing

Unified virtual addressing, described in detail in Section 2.4.5, is available on 64-bit Linux, 64-bit XPDDM, and MacOS. On these platforms, it is made available transparently. As of this writing, UVA is not available on WDDM.

Windows Display Driver Model

For Windows Vista, Microsoft introduced a new desktop presentation model in which the screen output was composed in a back buffer and page-flipped, like a video game. The new "Windows Desktop Manager" (WDM) made more extensive use of GPUs than Windows had previously, so Microsoft decided it would be best to revise the GPU driver model in conjunction with the presentation model. The resulting Windows Display Driver Model (WDDM) is now the default driver model on Windows Vista and subsequent versions. The term XPDDM was created to refer to the driver model used for GPUs on previous versions of Windows.1

As far as CUDA is concerned, these are the two major changes made by WDDM.

WDDM does not permit hardware registers to be mapped into user mode. Hardware commands—even commands to kick off DMA operations—must be

invoked by kernel code. The user $\rightarrow$ kernel transition is too expensive for the user mode driver to submit each command as it arrives, so instead the user mode driver buffers commands for later submission.

Since WDDM was built to enable many applications to use a GPU concurrently, and GPUs do not support demand paging, WDDM includes facilities to emulate paging on a "memory object" basis. For graphics applications, memory objects may be render targets, Z buffers, or textures; for CUDA, memory objects include global memory and CUDA arrays. Since the driver must set up access to CUDA arrays before each kernel invocation, CUDA arrays can be swapped by WDDM. For global memory, which resides in a linear address space (where pointers can be stored), every memory object for a given CUDA context must be resident in order for a CUDA kernel to launch.

The main effect of WDDM due to number 1 above is that work requested of CUDA, such as kernel launches or asynchronous memcpy operations, generally is not submitted to the hardware immediately.

The accepted idiom to force pending work to be submitted is to query the NULL stream: CUDAStreamQuery(0) or cuStreamQuery(NULL). If there is no pending work, these calls will return quickly. If any work is pending, it will be submitted, and since the call is asynchronous, execution may be returned to the caller before the hardware has finished processing. On non-WDDM platforms, querying the NULL stream is always fast.

The main effect of WDDM due to number 2 above is that CUDA's control of memory allocation is much less concrete. On user mode client drivers, successful memory allocations mean that the memory has been allocated and is no longer available to any other operating system client (such as a game or other CUDA application that may be running). On WDDM, if there are applications competing for time on the same GPU, Windows can and will swap memory objects out in order to enable each application to run. The Windows operating system tries to make this as efficient as possible, but as with all paging, having it never happen is much faster than having it ever happen.

Timeout Detection and Recovery

Because Windows uses the GPU to interact with users, it is important that compute applications not take inordinate amounts of GPU time. Under WDDM, Windows enforces a timeout (default of 2 seconds) that, if it should elapse, will cause a dialog box that says "Display driver stopped responding and has recovered," and the display driver is restarted. If this happens, all work in the CUDA context is lost. See http://bit.ly/16mG0dX.

Tesla Compute Cluster Driver

For compute applications that do not need WDDM, NVIDIA provides the Tesla Compute Cluster (TCC) driver, available only for Tesla-class boards. The TCC driver is a user mode client driver, so it does not require a kernel thunk to submit work to the hardware. The TCC driver may be enabled and disabled using the nvidia-smi tool.

3.1.3 NVCC, PTX, AND MICROCODE

nvcc is the compiler driver used by CUDA developers to turn source code into functional CUDA applications. It can perform many functions, from as complex as compiling, linking, and executing a sample program in one command (a usage encouraged by many of the sample programs in this book) to a simple targeted compilation of a GPU-only .cu file.

Figure 3.2 shows the two recommended workflows for using nvcc for CUDA runtime and driver API applications, respectively. For applications larger than the most trivial size, nvcc is best used strictly for purposes of compiling CUDA code and wrapping CUDA functionality into code that is callable from other tools. This is due to nvcc's limitations.

nvcc only works with a specific set of compilers. Many CUDA developers never notice because their compiler of choice happens to be in the set of supported compilers. But in production software development, the amount of CUDA code tends to be minuscule compared to the amount of other code, and the presence or absence of CUDA support may not be the dominant factor in deciding which compiler to use.
nvcc makes changes to the compile environment that may not be compatible with the build environment for the bulk of the application.
nvcc "pollutes" the namespace with nonstandard built-in types (e.g., int2) and intrinsic names (e.g., _popc()). Only in recent versions of CUDA have the intrinsics symbols become optional and can be used by including the appropriate sm*intrinsics.h header.

For CUDA runtime applications, nvcc embeds GPU code into string literals in the output executable. If the --fatbin option is specified, the executable will automatically load suitable microcode for the target GPU or, if no microcode is available, have the driver automatically compile the PTX into microcode.

Figure 3.2 rvcc workflows.

nvcc and PTX

PTX ("Parallel Thread eExecution") is the intermediate representation of compiled GPU code that can be compiled into native GPU microcode. It is the mechanism that enables CUDA applications to be "future-proof" against instruction set innovations by NVIDIA—as long as the PTX for a given CUDA kernel is available, the CUDA driver can translate it into microcode for whichever GPU the application happens to be running on (even if the GPU was not available when the code was written).

PTX can be compiled into GPU microcode both "offline" and "online." Offline compilation refers to building software that will be executed by some computer

in the future. For example, Figure 3.2 highlights the offline portions of the CUDA compilation process. Online compilation, otherwise known as "just-in-time" compilation, refers to compiling intermediate code (such as PTX) for the computer running the application for immediate execution.

nvcc can compile PTX offline by invoking the PTX assembler ptexas, which compiles PTX into the native microcode for a specific version of GPU. The resulting microcode is emitted into a CUDA binary called a "cubin" (pronounced like "Cuban"). Cubin files can be disassembled with cuobjdump --dump-sass; this will dump the SASS mnemonics for the GPU-specific microcode.2

PTX also can be compiled online (JITted) by the CUDA driver. Online compilation happens automatically when running CUDA applications that were built with the --fatbin option (which is the default). .cubin and PTX representations of every kernel are included in the executable, and if it is run on hardware that doesn't support any of the .cubin representations, the driver compiles the PTX version. The driver caches these compiled kernels on disk, since compiling PTX can be time consuming.

Finally, PTX can be generated at runtime and compiled explicitly by the driver by calling cuModuleLoadEx(). The driver API does not automate any of the embedding or loading of GPU microcode. Both .cubin and .ptx files can be given to cuModuleLoadEx(); if a .cubin is not suitable for the target GPU architecture, an error will be returned. A reasonable strategy for driver API developers is to compile and embed PTX, and they should always JIT-compile it onto the GPU with cuModuleLoadEx(), relying on the driver to cache the compiled microcode.

3.1_Software_Layers