3.3 Contexts

Contexts are analogous to processes on CPUs. With few exceptions, they are containers that manage the lifetimes of all other objects in CUDA, including the following.

All memory allocations (including linear device memory, host memory, and CUDA arrays)
Modules
CUDA streams
CUDA events
Texture and surface references
Device memory for kernels that use local memory
Internal resources for debugging, profiling, and synchronization
The pinned staging buffers used for pageable memcpy

The CUDA runtime does not provide direct access to CUDA contexts. It performs context creation through deferred initialization. Every CUDA library call or kernel invocation checks whether a CUDA context is current and, if necessary, creates a CUDA context (using the state previously set by calls such as CUDASetDevice(), CUDASetDeviceFlags(), CUDAGLSetGLDevice(), etc.).

Many applications prefer to explicitly control the timing of this deferred initialization. To force CUDArt to initialize without any other side effects, call

CUDAFree(0);

CUDA runtime applications can access the current-context stack (described below) via the driver API.

For functions that specify per-context state in the driver API, the CUDA runtime conflates contexts and devices. Instead of cuCtxSynchronize(), the CUDA runtime has CUDADeviceSynchronize(); instead of cuCtxSetCacheConfig(), the CUDA runtime has CUDADeviceSetCacheConfig().

Current Context

Instead of the current-context stack, the CUDA runtime provides the CUDASetDevice() function, which sets the current context for the calling thread. A device can be current to more than one CPU thread at a time.3

3.3.1 LIFETIME AND SCOPING

All of the resources allocated in association with a CUDA context are destroyed when the context is destroyed. With few exceptions, the resources created for a given CUDA context may not be used with any other CUDA context. This restriction applies not only to memory but also to objects such as CUDA streams and CUDA events.

3.3.2 PRELOCATION OF RESOURCES

CUDA tries to avoid "lazy allocation," where resources are allocated as needed to avoid failing operations for lack of resources. For example, pageable memory copies cannot fail with an out-of-memory condition because the pinned staging buffers needed to perform pageable memory copies are allocated at context creation time. If CUDA is not able to allocate these buffers, the context creation fails.

There are some isolated cases where CUDA does not preallocate all the resources that it might need for a given operation. The amount of memory needed to hold local memory for a kernel launch can be prohibitive, so CUDA does not preallocate the maximum theoretical amount needed. As a result, a kernel launch may fail if it needs more local memory than the default allocated by CUDA for the context.

3.3.3 ADDRESS SPACE

Besides objects that are automatically destroyed ("cleaned up") when the context is destroyed, the key abstraction embodied in a context is its address space: the private set of virtual memory addresses that it can use to allocate linear device memory or to map pinned host memory. These addresses are unique per context. The same address for different contexts may or may not be valid and certainly will not resolve to the same memory location unless special provisions are made. The address space of a CUDA context is separate and distinct from the CPU address space used by CUDA host code. In fact, unlike shared-memory multi-CPU systems, CUDA contexts on multi-GPU configurations do not share an address space. When UVA (unified virtual addressing) is in effect, the CPU and GPU(s) share the same address space, in that any given allocation has a unique address within the process, but the CPUs and GPUs can only read or write each other's memory under special circumstances, such as mapped pinned memory (see Section 5.1.3) or peer-to-peer memory (see Section 9.2.2).

3.3.4 CURRENT CONTEXT STACK

Most CUDA entry points do not take a context parameter. Instead, they operate on the "current context," which is stored in a thread-local storage (TLS) handle in the CPU thread. In the driver API, each CPU thread has a stack of current contexts; creating a context pushes the new context onto the stack.

The current-context stack has three main applications.

Single-threaded applications can drive multiple GPU contexts.
Libraries can create and manage their own CUDA contexts without interfering with their callers' CUDA contexts.
Libraries can be agnostic with respect to which CPU thread calls into the CUDA-aware library.

The original motivation for the current-context stack to CUDA was to enable a single-threaded CUDA application to drive multiple CUDA contexts. After creating and initializing each CUDA context, the application can pop it off the current-context stack, making it a "floating" context. Since only one CUDA context at a time may be current to a CPU thread, a single-threaded CUDA application drives multiple contexts by pushing and popping the contexts in turn, keeping all but one of the contexts "floating" at any given time.

On most driver architectures, pushing and popping a CUDA context is inexpensive enough that a single-threaded application can keep multiple GPUs busy. On WDDM (Windows Display Driver Model) drivers, which run only on Windows Vista and later, popping the current context is only fast if there are no GPU commands pending. If there are commands pending, the driver will incur a kernel thunk to submit the commands before popping the CUDA context.4

Another benefit of the current-context stack is the ability to drive a given CUDA context from different CPU threads. Applications using the driver API can "migrate" a CUDA context to other CPU threads by popping the context with cuCtxPopCurrent(), then calling cuCtxPushCurrent() from another thread. Libraries can use this functionality to create CUDA contexts without the knowledge or involvement of their callers. For example, a CUDA-aware plugin library could create its own CUDA context on initialization, then pop it and keep it floating except when called by the main application. The floating context enables the library to be completely agnostic about which CPU thread is used to call into it. When used in this way, the containment enforced by CUDA contexts is a mixed blessing. On the one hand, the floating context's memory cannot be polluted by spurious writes by third-party CUDA kernels, but on the other hand, the library can only operate on CUDA resources that it allocated.

Attaching and Detaching Contexts

Until CUDA 4.0, every CUDA context had a "usage count" set to 1 when the context was created. The functions cuCtxAttach() and cuCtxDetach() incremented and decremented the usage count, respectively. The usage count was intended to enable libraries to "attach" to CUDA contexts created by the application into which the library was linked. This way, the application and its libraries could interoperate via a CUDA context that was created by the application.

If a CUDA context is already current when CUDA is first invoked, it attaches the CUDA context instead of creating a new one. The CUDA runtime did not provide access to the usage count of a context. As of CUDA 4.0, the usage count is deprecated, and cuCtxAttach() / cuCtxDetach() do not have any side effects.

3.3.5 CONTEXT STATE

The cuCtxSetLimit() and cuCtxGetLimit() functions configure limits related to CPU-like functionality: in-kernel malloc() and printf(). The cuCtxSetCacheConfig() specifies the preferred cache configuration to use when launching kernels (whether to allocate 16K or 48K to shared memory and L1 cache). This is a hint, since any kernel that uses more than 16K of shared memory needs the configuration setting with 48K of shared memory. Additionally, the context state can be overridden by a kernel-specific state (cuFuncSetCache-Config()). These states have context scope (in other words, they are not specified for each kernel launch) because they are expensive to change.

3.3_Contexts