5.7 Memory Copy

CUDA has three different memory types—host memory, device memory, and CUDA arrays—and a full complement of functions to copy between them. For host←device memcpy, an additional set of functions provide asynchronous memcpy between pinned host memory and device memory or CUDA arrays. Additionally, a set of peer-to-peer memcpy functions enable memory to be copied between GPUs.

The CUDA runtime and the driver API take very different approaches. For 1D memcpy, the driver API defined a family of functions with type-strong parameters. The host-to-device, device-to-host, and device-to-device memcpy functions are separate.

CResult cuMemcpyHtoD(CUdeviceptr dstDevice, const void *srcHost, size_t ByteCount);  
CResult cuMemcpyDtoH(void *dstHost, CUdeviceptr srcDevice, size_t ByteCount);  
CResult cuMemcpyDtoD(CUdeviceptr dstDevice, CUdeviceptr srcDevice, size_t ByteCount);

In contrast, the CUDA runtime tends to define functions that take an extra "memcpy kind" parameter that depends on the memory types of the host and destination pointers.

enumuveaMemcpyKind   
{ CUDAMemcpyHostToHost  $= 0$  , CUDAMemcpyHostToDevice  $= 1$  , CUDAMemcpyDeviceToHost  $= 2$  , CUDAMemcpyDeviceToDevice  $= 3$  , CUDAMemcpyDefault  $= 4$    
}；

For more complex memcpy operations, both APIs use descriptor structures to specify the memcpy.

5.7.1 SYNCHRONOUS VERSUS ASYNCHRONOUS MEMCPY

Because most aspects of memcpy (dimensionality, memory type) are independent of whether the memory copy is asynchronous, this section examines the difference in detail, and later sections include minimal coverage of asynchronous memcpy.

By default, any memcpy involving host memory is synchronous: The function does not return until after the operation has been performed.[20] Even when operating on pinned memory, such as memory allocated withudaMallocHost(), synchronous memcpy routines must wait until the operation is completed because the application may rely on that behavior.[21]

When possible, synchronous memcpy should be avoided for performance reasons. Even when streams are not being used, keeping all operations asynchronous improves performance by enabling the CPU and GPU to run concurrently. If nothing else, the CPU can set up more GPU operations such as kernel launches and other memcpy while the GPU is running! If CPU/GPU concurrency is the only goal, there is no need to create any CUDA streams; calling an asynchronous memcpy with the NULL stream will suffice.

While memcpys involving host memory are synchronous by default, any memory copy not involving host memory (device $\leftrightarrow$ device or device $\leftrightarrow$ array) is asynchronous. The GPU hardware internally enforces serialization on these operations, so there is no need for the functions to wait until the GPU has finished before returning.

Asynchronous memcpy functions have the suffix Async(). For example, the driver API function for asynchronous host $\rightarrow$ device memcpy is cuMemcpyHtoDAsync() and the CUDA runtime function is CUDAMemcpyAsync().

The hardware that implements asynchronous memcpy has evolved over time. The very first CUDA-capable GPU (the GeForce 8800 GTX) did not have any copy engines, so asynchronous memcpy only enabled CPU/GPU concurrency. Later GPUs added copy engines that could perform 1D transfers while the SMs were running, and still

later, fully capable copy engines were added that could accelerate 2D and 3D transfers, even if the copy involved converting between pitch layouts and the block-linear layouts used by CUDA arrays. Additionally, early CUDA hardware only had one copy engine, whereas today, it sometimes has two. More than two copy engines wouldn't necessarily make sense. Because a single copy engine can saturate the PCI Express bus in one direction, only two copy engines are needed to maximize both bus performance and concurrency between bus transfers and GPU computation.

The number of copy engines can be queried by calling cuDeviceGetAttribute() with CU_DEVICE_ATTRIBUTE_asyncENGINE_COUNT, or by examining the CUDADeviceProp::asyncEngineCount.

5.7.2 UNIFIED VIRTUAL ADDRESSING

Unified Virtual Addressing enables CUDA to make inferences about memory types based on address ranges. Because CUDA tracks which address ranges contain device addresses versus host addresses, there is no need to specify CUDAMemcpyKind parameter to the CUDAMemcpy() function. The driver API added a cuMemcpy() function that similarly infers the memory types from the addresses.

CResult cuMemcpy(CUdeviceptr dst, CUdeviceptr src, size_t ByteCount);

The CUDA runtime equivalent, not surprisingly, is calledudaMemcpy().

CUDAError_t CUDAMemcpy(void *dst, const void *src, size_t bytes);

5.7.3 CUDA RUNTIME

Table 5.10 summarizes the memcpy functions available in the CUDA runtime.

Table 5.10 Memcpy Functions (CUDA Runtime)

Table 5.10 Memcpy Functions (CUDA Runtime) (Continued)

Table 5.11 CUDAMemcpy3DParms Structure Members

1D and 2D memcpy functions take base pointers, pitches, and sizes as required. The 3D memcpy routines take a descriptor structureudaMemcpy3Dparms, defined as follows.

structudaMemcpy3DParams
{
    structudaArray *srcArray;
    structudaPos srcPos;
    structudaPitchedPtr srcPtr;
    structudaArray *dstArray;
    structudaPos dstPos;
    structudaPitchedPtr dstPtr;
    structudaExtent extent;
    enumudaMemcpyKind kind;
};

Table 5.11 summarizes the meaning of each member of theudaMemcpy3D-Posts structure. TheudaPos andudaExtent structures are defined as follows.

struct CUDAExtent {
    size_t width;
    size_t height;

size_t depth;   
};   
struct CUDAPos{ size_t x; size_t y; size_t z;   
}；

5.7.4 DRIVER API

Table 5.12 summarizes the driver API's memcpy functions.

Table 5.12 Memcpy Functions (Driver API)

continues

Table 5.12 Memcpy Functions (Driver API) (Continued)

cuMemcpy3D() is designed to implement a strict superset of all previous memcpy functionality. Any 1D, 2D, or 3D memcpy may be performed between any host, device, or CUDA array memory, and any offset into either the source or destination may be applied. The WidthInBytes, Height, and Depth members of the input structure, CUDA_MEMCPY_3D, define the dimensionality of the memcpy: Height==0 implies a 1D memcpy, and Depth==0 implies a 2D memcpy. The source and destination memory types are given by the srcMemoryType and dstMemoryType structure elements, respectively.

Structure elements that are not needed by cuMemcpy3D() are defined to be ignored. For example, if a 1D host $\rightarrow$ device memcpy is requested, the srcPitch, srcHeight, dstPitch, and dstHeight elements are ignored. If srcMemoryType is CU_MEMORYTYPE_HOST, the srcDevice and srcArray elements are ignored. This API semantic, coupled with the C idiom that

assigning $\{0\}$ to a structure zero-initializes it, enables memory copies to be described very concisely. Most other memcpy functions can be implemented in a few lines of code, such as the following.

CUresult   
my_cuMemcpyHtoD(CUdevice dst, const void \*src,size_t N)   
{ CUDA_MEMCPY_3D cp  $=$  {0}; cp.srcMemoryType  $=$  CU_MEMORYTYPE_HOST; cp.srcHost  $=$  srcHost; cp.dstMemoryType  $=$  CU_MEMORYTYPE_DEVICE; cp.dstDevice  $=$  dst; cp.WidthInBytes  $=$  N; return cuMemcpy3D(&cp);   
}

This page intentionally left blank

Streams and Events

CUDA is best known for enabling fine-grained concurrency, with hardware facilities that enable threads to closely collaborate within blocks using a combination of shared memory and thread synchronization. But it also has hardware and software facilities that enable more coarse-grained concurrency:

CPU/GPU concurrency: Since they are separate devices, the CPU and GPU can operate independently of each other.
Memcpy/kernel processing concurrency: For GPUs that have one or more copy engines, host $\leftrightarrow$ device memcpy can be performed while the SMs are processing kernels.
Kernel concurrency: SM 2.x-class and later hardware can run up to 4 kernels in parallel.
Multi-GPU concurrency: For problems with enough computational density, multiple GPUs can operate in parallel. (Chapter 9 is dedicated to multi-GPU programming.)

CUDA streams enable these types of concurrency. Within a given stream, operations are performed in sequential order, but operations in different streams may be performed in parallel. CUDA events complement CUDA streams by providing the synchronization mechanisms needed to coordinate the parallel execution enabled by streams. CUDA events may be asynchronously "recorded" into a stream, and the CUDA event becomes signaled when the operations preceding the CUDA event have been completed.

CUDA events may be used for CPU/GPU synchronization, for synchronization between the engines on the GPU, and for synchronization between GPUs. They