3.8_Host_Memory
3.8 Host Memory
"Host" memory is CPU memory—the stuff we all were managing with malloc()/free() and new[]/delete[] for years before anyone had heard of CUDA. On all operating systems that run CUDA, host memory is virtualized; memory protections enforced by hardware are in place to protect CPU processes from reading or writing each other's memory without special provisions.[12] "Pages" of memory, usually 4K or 8K in size, can be relocated without changing their virtual address; in particular, they can be swapped to disk, effectively enabling the computer to have more virtual memory than physical memory. When a page is marked "nonresident," an attempt to access the page will signal a "page fault" to the operating system, which will prompt the operating system to find a physical page available to copy the data from disk and resume execution with the virtual page pointing to the new physical location.
The operating system component that manages virtual memory is called the "virtual memory manager" or VMM. Among other things, the VMM monitors memory activity and uses heuristics to decide when to "evict" pages to disk and resolves the page faults that happen when evicted pages are referenced.
The VMM provides services to hardware drivers to facilitate direct access of host memory by hardware. In modern computers, many peripherals, including disk controllers, network controllers, and GPUs, can read or write host memory using a facility known as "direct memory access" or DMA. DMA gives two
performance benefits: It avoids a data copy and enables the hardware to operate concurrently with the CPU. A tertiary benefit is that hardware may achieve better bus performance over DMA.
To facilitate DMA, operating system VMMs provide a service called "page-locking." Memory that is page-locked has been marked by the VMM as ineligible for eviction, so its physical address cannot change. Once memory is page-locked, drivers can program their DMA hardware to reference the physical addresses of the memory. This hardware setup is a separate and distinct operation from the page-locking itself. Because page-locking makes the underlying physical memory unavailable for other uses by the operating system, page-locking too much memory can adversely affect performance.
Memory that is not page-locked is known as "pageable." Memory that is page-locked is sometimes known as "pinned" memory, since its physical address cannot be changed by the operating system (it has been pinned in place).
3.8.1 PINNED HOST MEMORY
"Pinned" host memory is allocated by CUDA with the functions cuMemHostAlloc() / CUDAHostAlloc(). This memory is page-locked and set up for DMA by the current CUDA context.
CUDA tracks the memory ranges allocated in this way and automatically accelerates memcpy operations that reference pinned memory. Asynchronous memcpy operations only work on pinned memory. Applications can determine whether a given host memory address range is pinned using the cuMemHostGetFlags() function.
In the context of operating system documentation, the terms page-locked and pinned are synonymous, but for CUDA purposes, it may be easier to think of "pinned" memory as host memory that has been page-locked and mapped for access by the hardware. "Page-locking" refers only to the operating system mechanism for marking host memory pages as ineligible for eviction.
3.8.2 PORTABLE PINNED MEMORY
Portable pinned memory is mapped for all CUDA contexts after being page-locked. The underlying mechanism for this operation is complicated: When a portable pinned allocation is performed, it is mapped into all CUDA contexts before returning. Additionally, whenever a CUDA context is created, all portable pinned memory allocations are mapped into the new CUDA context before returning. For either portable memory allocation or context creation, any failure to perform these mappings will cause the allocation or context creation to fail. Happily, as of CUDA 4.0, if UVA (unified virtual addressing) is in force, all pinned allocations are portable.
3.8.3 MAPPED PINNED MEMORY
Mapped pinned memory is mapped into the address space of the CUDA context, so kernels may read or write the memory. By default, pinned memory is not mapped into the CUDA address space, so it cannot be corrupted by spurious writes by a kernel. For integrated GPUs, mapped pinned memory enables "zero copy": Since the host (CPU) and device (GPU) share the same memory pool, they can exchange data without explicit copies.
For discrete GPUs, mapped pinned memory enables host memory to be read or written directly by kernels. For small amounts of data, this has the benefit of eliminating the overhead of explicit memory copy commands. Mapped pinned memory can be especially beneficial for writes, since there is no latency to cover. As of CUDA 4.0, if UVA (unified virtual addressing) is in effect, all pinned allocations are mapped.
3.8.4 HOST MEMORY REGISTRATION
Since developers (especially library developers) don't always get to allocate memory they want to access, CUDA 4.0 added the ability to "register" existing virtual address ranges for use by CUDA. The cuMemHostRegister() / CUDAHostRegister() functions take a virtual address range and page-locks and maps it for the current GPU (or for all GPUs, if CU_MEMHOSTREGISTER_PORTABLE or cuDaHostRegisterPortable is specified). Host memory registration has a perverse relationship with UVA (unified virtual addressing), in that any address range eligible for registration must not have been included in the virtual address ranges reserved for UVA purposes when the CUDA driver was initialized.