9.5 Single-Threaded Multi-GPU

When using the CUDA runtime, a single-threaded application can drive multiple GPUs by calling CUDADevice() to specify which GPU will be operated by the calling CPU thread. This idiom is used in Listing 9.1 to switch between the source and destination GPUs during the peer-to-peer memcpy, as well as the single-threaded, multi-GPU implementation of N-body described in Section 9.5.2. In the driver API, CUDA maintains a stack of current contexts so that sub-routines can easily change and restore the caller's current context.

9.5.1 CURRENT CONTEXT STACK

Driver API applications can manage the current context with the current-context stack: cuCtxPushCurrent() makes a new context current, pushing it onto the top of the stack, and cuCtxPopCurrent() pops the current context and restores the previous current context. Listing 9.2 gives a driver API version of chMemcpyPeerToPeer(), which uses cuCtxPopCurrent() and cuCtx-PushCurrent() to perform a peer-to-peer memcpy between two contexts.

The current context stack was introduced to CUDA in v2.2, and at the time, the CUDA runtime and driver API could not be used in the same application. That restriction has been relaxed in subsequent versions.

Listing 9.2 chMemcpyPeerToPeer (driver API version).

CUresult
chMemcpyPeerToPeer(
    void *_dst, CUcontext dstContext, int dstDevice,
    const void *_src, CUcontext srcContext, int srcDevice,
    size_t N)
{
    CUresult status;
    CUdeviceptr dst = (CUdeviceptr) (intrptr_t) _dst;
    CUdeviceptr src = (CUdeviceptr) (intrptr_t) _src;

9.5_Single-Threaded_Multi-GPU

9.5 Single-Threaded Multi-GPU

9.5.1 CURRENT CONTEXT STACK