3.11 The CUDA Runtime and CUDA Driver API

The CUDA runtime ("CUDART") facilitates the language integration that makes CUDA so easy to program out of the gate. By automatically taking care of tasks

such as initializing contexts and loading modules, and especially by enabling kernel invocation to be done in-line with other C++ code, CUDA lets developers focus on getting their code working quickly. A handful of CUDA abstractions, such as CUDA modules, are not accessible via CUDA.

In contrast, the driver API exposes all CUDA abstractions and enables them to be manipulated by developers as needed for the application. The driver API does not provide any performance benefit. Instead, it enables explicit resource management for applications that need it, like large-scale commercial applications with plug-in architectures.

The driver API is not noticeably faster than the CUDA runtime. If you are looking to improve performance in your CUDA application, look elsewhere.

Most CUDA features are available to both CUDA and the driver API, but a few are exclusive to one or the other. Table 3.9 summarizes the differences.

Table 3.9 CUDA Runtime versus Driver API Features

Table 3.9 CUDA Runtime versus Driver API Features (Continued)

Between the two APIs, operations like memcpy tend to be functionally identical, but the interfaces can be quite different. The stream APIs are almost identical.

The event APIs have minor differences, with CUDART providing a separate CUDAEventCreateWithFlags() function if the developer wants to specify a flags word (needed to create a blocking event).

continues

The memcpy functions are the family where the interfaces are the most different, despite identical underlying functionality. CUDA supports three variants of memory—host, device, and CUDA array—which are all permutations of participating memory types, and 1D, 2D, or 3D memcpy. So the memcpy functions must contain either a large family of different functions or a small number of functions that support many types of memcpy.

The simplest memcpy's in CUDA copy between host and device memory, but even those function interfaces are different: CUDA uses void * for the types of both host and device pointers and a single memcpy function with a direction parameter, while the driver API uses void * for host memory, CUdeviceptr for device memory, and three separate functions (cuMemcpyHtoD(), cuMemcpyDtoH(), and cuMemcpyDtoD()) for the different memcpy directions. Here are equivalent CUDA and driver API formulations of the three permutations of host $\leftrightarrow$ device memcpy.

For 2D and 3D memcpy's, the driver API implements a handful of functions that take a descriptor struct and support all permutations of memcpy, including lower-dimension memcpy's. For example, if desired, cuMemcpy3D() can be used to perform a 1D host→device memcpy instead of cuMemcpyHtoD().

CUDA_MEMCPY3D cp = {0};  
cp.dstMemoryType = CU_MEMORYTYPE_DEVICE;  
cp.dstDevice = dptr;  
cp.srcMemoryType = CU_MEMORYTYPE_HOST;  
cp.srcHost = host;  
cp.WidthInBytes = bytes;  
cp.Height = cp.Depth = 1;  
status = cuMemcpy3D(&cp);

CUDART uses a combination of descriptor structures for more complicated memcpy's (e.g., cuMemcpy3D(), while using different functions to cover the different memory types. Like cuMemcpy3D(), CUDART's cuMemcpy3D() function takes a descriptor struct that can describe any permutation of memcpy, including inter-dimensional memcpy's (e.g., performing a 1D copy to or from the row of a 2D CUDA array, or copying 2D CUDA arrays to or from slices of 3D CUDA arrays). Its descriptor struct is slightly different in that it embeds other structures; the two APIs' 3D memcpy structures are compared side-by-side in Table 3.10.

Usage of both 3D memcpy functions is similar. They are designed to be zero-initialized, and developers set the members needed for a given operation. For example, performing a host $\rightarrow$ 3D array copy may be done as follows.

struct CUDAMemcpy3DParams cp = {0};
cp.srcPtr.ptr = host;
cp.srcPtr.pitch = pitch;
cp.dstArray = hArray;
cpextent.width = Width;
cpextent.height = Height;
cpextent.depth = Depth;
cp-kind = CUDAMemcpyHostToDevice;
status = CUDAMemcpy3D(&cp);

For a 3D copy that covers the entire CUDA array, the source and destination offsets are set to 0 by the first line and don't have to be referenced again. Unlike parameters to a function, the code only needs to reference the parameters needed by the copy, and if the program must perform more than one similar copy (e.g., to populate more than one CUDA array or device memory region), the descriptor struct can be reused.

Table 3.10 3D Memcpy Structures

structudaMemcpy3DParams
{
    structudaArray *srcArray;
    structudaPos srcPos;
    structudaPitchedPtr srcPtr;
    structudaArray *dstArray;
    structudaPos dstPos;
    structudaPitchedPtr dstPtr;
    structudaExtent extent;
    enumudaMemcpyKind kind;
};
structudaPos
{
    size_t x;
    size_t y;
    size_t z;
};
structudaPitchedPtr
{
    void *ptr;
    size_t pitch;
    size_t xsize;
    size_t ysize;
};

Software Environment

This chapter gives an overview of the CUDA development tools and the software environments that can host CUDA applications. Sections are devoted to the various tools in the NVIDIA toolkit.

nvcc: the CUDA compiler driver

pxtas: the PTX assembler
cuobjdump: the CUDA object file dump utility
nvidia-smi: the NVIDIA System Management Interface

Section 4.5 describes Amazon's EC2 (Elastic Compute Cloud) service and how to use it to access GPU-capable servers over the Internet. This chapter is intended more as a reference than as a tutorial. Example usages are given in Part III of this book.

3.11_The_CUDA_Runtime_and_CUDA_Driver_API

3.11 The CUDA Runtime and CUDA Driver API

Software Environment