5.1_Host_Memory

5.1 Host Memory

In CUDA, host memory refers to memory accessible to the CPU(s) in the system. By default, this memory is pageable, meaning the operating system may move the memory or evict it out to disk. Because the physical location of pageable memory may change without notice, it cannot be accessed by peripherals like GPUs. To enable "direct memory access" (DMA) by hardware, operating systems allow host memory to be "page-locked," and for performance reasons, CUDA includes APIs that make these operating system facilities available to application developers. So-called pinned memory that has been page-locked and mapped for direct access by CUDA GPU(s) enables

  • Faster transfer performance

  • Asynchronous memory copies (i.e., memory copies that return control to the caller before the memory copy necessarily has finished; the GPU does the copy in parallel with the CPU)

  • Mapped pinned memory that can be accessed directly by CUDA kernels

Because the virtual \rightarrow physical mapping for pageable memory can change unpredictably, GPUs cannot access pageable memory at all. CUDA copies pageable memory using a pair of staging buffers of pinned memory that are allocated by the driver when a CUDA context is allocated. Chapter 6 includes hand-crafted pageable memcpy routines that use CUDA events to do the synchronization needed to manage this double-buffering.

5.1.1 ALLOCATING PINNED MEMORY

Pinned memory is allocated and freed using special functions provided by CUDA: CUDAHostAlloc() /CUDAFreeHost() for the CUDA runtime, and cuMemHostAlloc() /cuMemFreeHost() for the driver API. These functions work with the host operating system to allocate page-locked memory and map it for DMA by the GPU(s).

CUDA keeps track of memory it has allocated and transparently accelerates memory copies that involve host pointers allocated with cuMemHostAlloc() / CUDAHostAlloc(). Additionally, some functions (notably the asynchronous memcpy functions) require pinned memory.

The bandwidthTest SDK sample enables developers to easily compare the performance of pinned memory versus normal pageable memory. The --memory=pinned option causes the test to use pinned memory instead of pageable memory. Table 5.1 lists the bandwidthTest numbers for a cg1.4xlarge instance in Amazon EC2, running Windows 7-x64 (numbers in MB/s). Because it involves a significant amount of work for the host, including a kernel transition, allocating pinned memory is expensive.

CUDA 2.2 added several features to pinned memory. Portable pinned memory can be accessed by any GPU; mapped pinned memory is mapped into the CUDA address space for direct access by CUDA kernels; and write-combined pinned memory enables faster bus transfers on some systems. CUDA 4.0 added two important features that pertain to host memory: Existing host memory ranges can be page-locked in place using host memory registration, and Unified Virtual Addressing (UVA) enables all pointers to be unique process-wide, including host and device pointers. When UVA is in effect, the system can infer from the address range whether memory is device memory or host memory.

5.1.2 PORTABLE PINNED MEMORY

By default, pinned memory allocations are only accessible to the GPU that is current when CUDAHostAlloc() or cuMemHostAlloc() is called. By specifying the CUDAHostAllocPortable flag to CUDAHostAlloc(), or the CU_MEMHOSTALLOC_PORTABLE flag to cuHostMemAlloc(), applications can request that the pinned allocation be mapped for all GPUs instead. Portable pinned allocations benefit from the transparent acceleration of memcpy described earlier and can participate in asynchronous memcpy for any GPU in

Table 5.1 Pinned versus Pageable Bandwidth

the system. For applications that intend to use multiple GPUs, it is good practice to specify all pinned allocations as portable.

NOTE

When UVA is in effect, all pinned memory allocations are portable.

5.1.3 MAPPED PINNED MEMORY

By default, pinned memory allocations are mapped for the GPU outside the CUDA address space. They can be directly accessed by the GPU, but only through memcpy functions. CUDA kernels cannot read or write the host memory directly. On GPUs of SM 1.2 capability and higher, however, CUDA kernels are able to read and write host memory directly; they just need allocations to be mapped into the device memory address space.

To enable mapped pinned allocations, applications using the CUDA runtime must call CUDASetDeviceFlags() with the CUDADeviceMapHost flag before any initialization has been performed. Driver API applications specify the CUCTX_MAP_HOST flag to cuCtxCreate().

Once mapped pinned memory has been enabled, it may be allocated by calling systemdAlloc() with the systemdAllocMapped flag, or cuMemHostAlloc() with the CU_MEMALLOCHOST_DEVICEMAP flag. Unless UVA is in effect, the application then must query the device pointer corresponding to the allocation with systemdGetDevicePointer() or cuMemHostGetDevicePointer(). The resulting device pointer then can be passed to CUDA kernels. Best practices with mapped pinned memory are described in the section "Mapped Pinned Memory Usage."

NOTE

When UVA is in effect, all pinned memory allocations are mapped.1

5.1.4 WRITE-COMBINED PINNED MEMORY

Write-combined memory, also known as write-combining or Uncacheable Write Combining (USWC) memory, was created to enable the CPU to write

to GPU frame buffers quickly and without polluting the CPU cache. 2^2 To that end, Intel added a new page table kind that steered writes into special write-combining buffers instead of the main processor cache hierarchy. Later, Intel also added "nontemporal" store instructions (e.g., MOVNTPS and MOVNTI) that enabled applications to steer writes into the write-combining buffers on a per-instruction basis. In general, memory fence instructions (such as MFENCE) are needed to maintain coherence with WC memory. These operations are not needed for CUDA applications because they are done automatically when the CUDA driver submits work to the hardware.

For CUDA, write-combining memory can be requested by calling CUDAHostAlloc() with the CUDAWriteCombined flag, or cuMemHostAlloc() with the CU_MEMHOSTALLOC_WRITECOMBINED flag. Besides setting the pagetable entries to bypass the CPU caches, this memory also is not snooped during PCI Express bus transfers. On systems with front side buses (pre-Opteron and pre-Nehalem), avoiding the snoops improves PCI Express transfer performance. There is little, if any, performance advantage to WC memory on NUMA systems.

Reading WC memory with the CPU is very slow (about 6x slower), unless the reads are done with the MOVNTDQA instruction (new with SSE4). On NVIDIA's integrated GPUs, write-combined memory is as fast as the system memory carveout—system memory that was set aside at boot time for use by the GPU and is not available to the CPU.

Despite the purported benefits, as of this writing, there is little reason for CUDA developers to use write-combined memory. It's just too easy for a host memory pointer to WC memory to "leak" into some part of the application that would try to read the memory. In the absence of empirical evidence to the contrary, it should be avoided.

NOTE

When UVA is in effect, write-combined pinned allocations are not mapped into the unified address space.

5.1.5 REGISTERING PINNED MEMORY

CUDA developers don't always get the opportunity to allocate host memory they want the GPU(s) to access directly. For example, a large, extensible application

may have an interface that passes pointers to CUDA-aware plugins, or the application may be using an API for some other peripheral (notably high-speed networking) that has its own dedicated allocation function for much the same reason CUDA does. To accommodate these usage scenarios, CUDA 4.0 added the ability to register pinned memory.

Pinned memory registration decouples allocation from the page-locking and mapping of host memory. It takes an already-allocated virtual address range, page-locks it, and maps it for the GPU. Just as with CUDAHostAlloc(), the memory optionally may be mapped into the CUDA address space or made portable (accessible to all GPUs).

The cuMemHostRegister() / CUDAHostRegister() and cuMemHost-Unregister() / CUDAHostUnregister() functions register and unregister host memory for access by the GPU(s), respectively. The memory range to register must be page-aligned: In other words, both the base address and the size must be evenly divisible by the page size of the operating system. Applications can allocate page-aligned address ranges in two ways.

  • Allocate the memory with operating system facilities that traffic in whole pages, such as VirtualAlloc() on Windows or malloc() or mmap() on other platforms.

  • Given an arbitrary address range (say, memory allocated with malloc() or operator new[], clamp the address range to the next-lower page boundary and pad to the next page size.

NOTE

Even when UVA is in effect, registered pinned memory that has been mapped into the CUDA address space has a different device pointer than the host pointer. Applications must calludaHostGetDevicePointer() / cuMemHostGetDevicePointer() in order to obtain the device pointer.

5.1.6 PINNED MEMORY AND UVA

When UVA (Unified Virtual Addressing) is in effect, all pinned memory allocations are both mapped and portable. The exceptions to this rule are write-combined memory and registered memory. For those, the device pointer

may differ from the host pointer, and applications still must query it with CUDAHostGetDevicePointer() / cuMemHostGetDevicePointer().

UVA is supported on all 64-bit platforms except Windows Vista and Windows 7. On Windows Vista and Windows 7, only the TCC driver (which may be enabled or disabled using nvidia-smi) supports UVA. Applications can query whether UVA is in effect by calling CUDAGetDeviceProperties() and examining the CUDADeviceProp::unifiedAddressing structure member, or by calling cuDeviceGetAttribute() with CU_DEVICE_ATTRIBUTE_UNIFIED_ADDRESS.

5.1.7 MAPPED PINNED MEMORY USAGE

For applications whose performance relies on PCI Express transfer performance, mapped pinned memory can be a boon. Since the GPU can read or write host memory directly from kernels, it eliminates the need to perform some memory copies, reducing overhead. Here are some common idioms for using mapped pinned memory.

  • Posting writes to host memory: Multi-GPU applications often must stage results back to system memory for interchange with other GPUs; writing these results via mapped pinned memory avoids an extraneous device→host memory copy. Write-only access patterns to host memory are appealing because there is no latency to cover.

  • Streaming: These workloads otherwise would use CUDA streams to coordinate concurrent memcpys to and from device memory, while kernels do their processing on device memory.

  • "Copy with panache": Some workloads benefit from performing computations as data is transferred across PCI Express. For example, the GPU may compute subarray reductions while transferring data for Scan.

Caveats

Mapped pinned memory is not a panacea. Here are some caveats to consider when using it.

  • Texturing from mapped pinned memory is possible, but very slow.

  • It is important that mapped pinned memory be accessed with coalesced memory transactions (see Section 5.2.9). The performance penalty for uncoalesced memory transactions ranges from 6×6 \times to 2×2 \times . But even on SM 2.x and

later GPUs, whose caches were supposed to make coalescing an obsolete consideration, the penalty is significant.

  • Polling host memory with a kernel (e.g., for CPU/GPU synchronization) is not recommended.

  • Do not try to use atomics on mapped pinned host memory, either for the host (locked compare-exchange) or the device (atomicAdd(). On the CPU side, the facilities to enforce mutual exclusion for locked operations are not visible to peripherals on the PCI Express bus. Conversely, on the GPU side, atomic operations only work on local device memory locations because they are implemented using the GPU's local memory controller.

5.1.8 NUMA, THREAD AFFINITY, AND PINNED MEMORY

Beginning with the AMD Opteron and Intel Nehalem, CPU memory controllers were integrated directly into CPUs. Previously, the memory had been attached to the so-called "front-side bus" (FSB) of the "northbridge" of the chipset. In multi-CPU systems, the northbridge could service memory requests from any CPU, and memory access performance was reasonably uniform from one CPU to another. With the introduction of integrated memory controllers, each CPU has its own dedicated pool of "local" physical memory that is directly attached to that CPU. Although any CPU can access any other CPU's memory, "nonlocal" accesses—accesses by one CPU to memory attached to another CPU—are performed across the AMD HyperTransport (HT) or Intel QuickPath Interconnect (QPI), incurring latency penalties and bandwidth limitations. To contrast with the uniform memory access times exhibited by systems with FSBs, these system architectures are known as NUMA for nonuniform memory access.

As you can imagine, performance of multithreaded applications can be heavily dependent on whether memory references are local to the CPU that is running the current thread. For most applications, however, the higher cost of a nonlocal access is offset by the CPUs' on-board caches. Once nonlocal memory is fetched into a CPU, it remains in-cache until evicted or needed by a memory access to the same page by another CPU. In fact, it is common for NUMA systems to include a System BIOS option to "interleave" memory physically between CPUs. When this BIOS option is enabled, the memory is evenly divided between CPUs on a per-cache line (typically 64 bytes) basis, so, for example, on a 2-CPU system, about 50%50\% of memory accesses will be nonlocal on average.

For CUDA applications, PCI Express transfer performance can be dependent on whether memory references are local. If there is more than one I/O hub (IOH)

in the system, the GPU(s) attached to a given IOH have better performance and reduce demand for QPI bandwidth when the pinned memory is local. Because some high-end NUMA systems are hierarchical but don't associate the pools of memory bandwidth strictly with CPUs, NUMA APIs refer to nodes that may or may not strictly correspond with CPUs in the system.

If NUMA is enabled on the system, it is good practice to allocate host memory on the same node as a given GPU. Unfortunately, there is no official CUDA API to affiliate a GPU with a given CPU. Developers with a priori knowledge of the system design may know which node to associate with which GPU. Then platform-specific, NUMA-aware APIs may be used to perform these memory allocations, and host memory registration (see Section 5.1.5) can be used to pin those virtual allocations and map them for the GPU(s).

Listing 5.1 gives a code fragment to perform NUMA-aware allocations on Linux, and Listing 5.2 gives a code fragment to perform NUMA-aware allocations on Windows.

Listing 5.1 NUMA-aware allocation (Linux).

bool   
numNodes(int \*p)   
{ if ( numa-available() >= 0 ) { \*p = numa_max_node(   ) + 1; return true; } return false;   
}   
void \*   
pageAlignedNumaAlloc( size_t bytes, int node )   
{ void \*ret; printf( "Allocating on node %d\n", node ); fflush(stdout); ret = numa_alloc_onnode( bytes, node ); return ret;   
}   
void   
pageAlignedNumaFree( void \*p, size_t bytes )   
{ numa_free( p, bytes );

Listing 5.2 NUMA-aware allocation (Windows).

bool   
numNodes(int \*p)   
{ ULONG maxNode; if(GetNumaHighestNodeNumber(&maxNode)){ \*p=(int)maxNode+1; return true; 1 return false;   
}   
void\*   
pageAlignedNumaAlloc(size_t bytes,int node) { void \*ret; printf("Allocating on node%d\n",node);fflush(stdout); ret  $=$  VirtualAllocExNuma(GetCurrentProcess(), NULL, bytes, MEM_COMMIT|MEM_RESERVE, PAGE_READWRITE, node); return ret;   
}   
void   
pageAlignedNumaFree(void \*p)   
{ VirtualFreeEx(GetCurrentProcess(),p,0,MEMrelease);