README

Glossary / TLA Decoder

aliasing - Creating more than one way to access the same memory. Examples: A mapped pinned buffer in CUDA is aliased by a host pointer and a device pointer; a texture reference bound to device memory aliases the device memory.

AOS - See array of structures.

API - Application programming interface.

array of structures - Memory layout in which the elements that describe an object are contiguous in memory (as if declared in a structure). Contrast with structure of arrays.

asynchronous - Function calls that return before the requested operation has been performed. For correct results, CUDA applications using asynchronous operations subsequently must perform CPU/GPU synchronization using CUDA streams or events.

computational density - Amount of computation relative to external memory traffic.

constant memory – Read-only memory, optimized for broadcast when a single memory location is read.

CPU - Central processing unit. The conductor of the orchestra that is a modern computer these days, be it x86, x86-64, or ARM.

CUDA array - 1D, 2D, or 3D array whose layout is opaque to developers. Applications can read or write CUDA arrays using memcpy functions. CUDA kernels can read CUDA arrays via texture fetch, or read or write them using surface load/ store intrinsics.

CUDART - CUDA runtime. The "high-level" API that comes with language integration.

DDI - Device driver interface. Examples of DDIs include XPDDM and WDDM.

demand paging – A system where the operating system can mark pages nonresident such that when an application tries to access a nonresident page, the hardware can signal an interrupt. The operating system can use this facility to mark pages nonresident that have not been accessed “in a while” according to some heuristic, writing their contents to disk to free up more physical memory for more active virtual pages.1 If an application accesses the page again, the page is reloaded “on demand” (possibly to a different physical page). To date, GPUs implement a reasonably competent virtual memory system that decouples virtual and physical addresses, but they do not implement hardware for demand paging.

device memory – Memory that is readily accessible to the GPU. CUDA arrays, global memory, constant memory, and local memory are all different ways to manipulate device memory.

DMA – Direct memory access. When peripherals read or write CPU memory asynchronously and independently of the CPU.

driver - Software that uses OS facilities to expose a peripheral's hardware capabilities.

Driver API - The "low-level" API that enables full access to CUDA's facilities.

dynamic instruction count - The number of machine instructions actually executed by a program. Contrast with static instruction count.

ECC – Error correction code. Some CUDA hardware protects the external memory interface of the GPU by setting aside 12.5%12.5\% of video memory (1 bit per 8 bits of accessible memory) and using it to detect and sometimes to correct errors in the memory transactions. nvidia-smi or the NVIDIA Management Library can be used to query whether correctable (single-bit) or uncorrectable (double-bit) errors have occurred.

front side bus (FSB) - Chipset interface to memory on non-NUMA system configurations.

global memory - Device memory that is read or written by CUDA kernels using pointers.

GPU - Graphics processing unit.

GPU time – Time as measured by CUDA events, as opposed to the system timer. Such times can be used to direct optimization, but they do not give an accurate picture of overall performance. Contrast with wall clock time.

HPC - High performance computing.

ILP - See instruction level parallelism.

instruction level parallelism – The fine-grained parallelism between operations during program execution.

intrinsic function - A function that directly corresponds to a low-level machine instruction.

JIT - Just-in-time compilation. See also online compilation.

kernel mode - Privileged execution mode that can perform sensitive operations such as editing page tables.

kernel thunk – The transition from user mode to kernel mode. This operation takes several thousand clock cycles, so drivers running on operating systems that require kernel thunks in order to submit commands to the hardware must queue up hardware commands in user mode before performing the kernel thunk in order to submit them.

lane - Thread within a warp. The lane ID may be computed as threadIdx.x&31.

MMU – Memory management unit. The hardware in the CPU or GPU that translates virtual addresses to physical addresses and signals a problem when invalid addresses are specified.

node - A unit of memory bandwidth in NUMA systems. In inexpensive NUMA systems, nodes typically correspond to physical CPUs.

NUMA – Nonuniform memory access. Refers to the memory architecture of AMD Opteron or Intel Nehalem processors, where the memory controller is integrated into the CPU for lower latency and higher performance.

occupancy – The ratio of the number of warps executing in an SM as compared to the theoretical maximum.

online compilation - Compilation done at runtime, not when the developer builds the application.

opt-in – An API provision where the developer must request a behavior change at the interface level. For example, creating a blocking event is an “opt-in” because the developer must pass a special flag to the event creation APIs. Opt-ins are a way to expose new functionality without running the risk of regressions due to existing applications relying on the old behavior.

opt-out - An API provision to suppress a legacy behavior—for example, creating an event with timing disabled.

pageable memory – Memory that is eligible for eviction by the VMM. Operating system designers prefer memory to be pageable because it enables the operating system to "swap out" pages to disk and make the physical memory available for some other purpose.

page fault – The execution fault that happens when an application accesses virtual memory that is marked nonresident by the operating system. If the access was valid, the operating system updates its data structures (perhaps by pulling the page into physical memory and updating the physical address to point there) and resumes execution. If the access was not valid, the operating system signals an exception in the application.

page-locked memory – Memory that has been physically allocated and marked as nonpageable by the operating system. Usually this is to enable hardware to access the memory via DMA.

PCIe - PCI Express bus, used by CUDA for data interchange between host and device memory.

pinned memory; see page-locked memory.

pitched memory allocation – An allocation where the number of bytes per row is specified separately from the row elements multiplied by the element size. Used to accommodate alignment constraints that must stay the same from one row of the array to the next.

pitch-linear layout – The memory layout used for a pitched memory allocation, specified by a “tuple” of a base address and the number of bytes per row (the “pitch”).

predicate - A single bit or Boolean true/false value. In C, an integer may be converted to a predicate by evaluating whether it is nonzero (true) or zero (false).

process - Unit of execution in multitasking operating systems, with its own address space and lifetime management of resources (such as file handles).

When the process exits, all resources associated with it are "cleaned up" by the operating system.

PTE - Page table entry.

PTX - Parallel Thread eXecution, the intermediate assembly language and bytecode used as input to the driver's JIT process when compiling onto a specific GPU architecture.

SASS – The assembly-level, native instruction set for CUDA GPUs. The meaning of the acronym has been lost in the mists of time, but Shader ASSembly language seems like a plausible guess!

SBIOS - System BIOS ("basic input/output system"). The firmware that controls the most basic aspects of a computer system's I/O subsystem, such as whether to enable CPU or chipset features that may not be supported by certain operating systems. The SBIOS is lower-level than the operating system.

shared memory – Onboard GPU memory used by CUDA kernels as a fast "scratchpad" to hold temporary results.

SIMD – Single instruction, multiple data—a primitive for parallel programming that involves performing a uniform operation across different data in parallel. The streaming multiprocessors in CUDA hardware operate in SIMD manner across 32 threads. SSE instructions in x86 hardware operate in SIMD manner on packed data across wide registers.

SM – Streaming multiprocessor—one of the core execution units of the GPU. The number of SMs in a GPU may range from 2 to dozens. Additionally, the instruction set of a GPU may be designated with a version number—for example, "SM 2.0."

SMX - Streaming multiprocessor, as implemented in Kepler-class (SM 3.x) hardware.

SSE - Streaming SIMD extensions. An instruction set extension added to x86 in the late 1990s that could perform four single-precision floating point operations in a single instruction. Later additions have enabled SIMD operations on integers and have widened the operations from 128 bits to 256 bits.

static instruction count – The number of machine instructions in a program; the amount of data occupied by the program increases with the static instruction count. Contrast with dynamic instruction count.

structure of arrays (SOA) – Memory layout that uses an array for each data element that describes an object. Contrast with array of structures (AOS).

synchronous - An adjective used to describe functions that do not return until the requested operation has completed.

TCC – Tesla Compute Cluster driver, an XPDDM class driver that can run on Windows Vista and later. It does not get the benefits of WDDM (Windows Desktop Manager acceleration, graphics interoperability, emulated paging), but can submit commands to the hardware without performing a kernel thunk and implement the 64-bit unified address space.

Thrust - C++-based productivity library for CUDA, loosely based on the STL.

TLA - Three-letter acronym.

TLS - Thread local storage.

ulp - Unit of last precision—the least significant digit in the mantissa of a floating point value.

user mode – The unprivileged execution mode, where memory is generally pageable and hardware resources can only be accessed through APIs that interact with the operating system's kernel mode software.

UVA – Unified virtual addressing.

VMM - Virtual memory manager. The part of the operating system that manages memory: allocation, page-locking, managing page faults, and so on.

wall clock time - Time as measured by reading a system clock before and after performing a set of operations. The wall clock time includes all system effects and gives the most accurate measure of overall performance. Contrast with GPU time.

warp - The basic unit of execution for streaming multiprocessors. For the first three generations of CUDA hardware, warps have had exactly 32 threads, so the warp ID in a 1D threadblock may be computed as threadIdx.x>>5. Also see lane.

WDDM – Windows Display Driver Model. This driver model, new with Windows Vista, moved most of the display driver logic from kernel mode into user mode.

XPDDM – Windows XP Display Driver Model. Architecturally, this driver model actually dates back to Windows NT 4.0 (c. 1996). This acronym was invented at the same time as "WDDM" for contrast.