9.2 Peer-to-Peer

When multiple GPUs are used by a CUDA program, they are known as "peers" because the application generally treats them equally, as if they were coworkers collaborating on a project. CUDA enables two flavors of peer-to-peer: explicit memcpy and peer-to-peer addressing.1

9.2.1 PEER-TO-PEER MEMCPY

Memory copies can be performed between the memories of any two different devices. When UVA (Unified Virtual Addressing) is in effect, the ordinary family of memcpy function can be used for peer-to-peer memcpy, since CUDA can infer which device "owns" which memory. If UVA is not in effect, the peer-to-peer

memcpy must be done explicitly usingudaMemcpyPeer(),udaMemcpyPeerAsync(),udaMemcpy3DPeer(),orudaMemcpy3DPeerAsync().

NOTE

CUDA can copy memory between any two devices, not just devices that can directly address one another's memory. If necessary, CUDA will stage the memory copy through host memory, which can be accessed by any device in the system.

Peer-to-peer memcpy operations do not run concurrently with any other operation. Any pending operations on either GPU must complete before the peer-to-peer memcpy can begin, and no subsequent operations can start to execute until after the peer-to-peer memcpy is done. When possible, CUDA will use direct peer-to-peer mappings between the two pointers. The resulting copies are faster and do not have to be staged through host memory.

9.2.2 PEER-TO-PEER ADDRESSING

Peer-to-peer mappings of device memory, shown in Figure 2.20, enable a kernel running on one GPU to read or write memory that resides in another GPU. Since the GPUs can only use peer-to-peer to read or write data at PCI Express rates, developers have to partition the workload in such a way that

Each GPU has about an equal amount of work to do.
The GPUs only need to interchange modest amounts of data.

Examples of such systems might be a pipelined computer vision system where each stage in the pipeline of GPUs computes an intermediate data structure (e.g., locations of identified features) that needs to be further analyzed by the next GPU in the pipeline or a large so-called "Stencil" computation in which separate GPUs can perform most of the computation independently but must exchange edge data between computation steps.

In order for peer-to-peer addressing to work, the following conditions apply.

Unified virtual addressing (UVA) must be in effect.
Both GPUs must be SM 2.x or higher and must be based on the same chip.
The GPUs must be on the same I/O hub.

cu(da) DeviceCanAccessPeer () may be called to query whether the current device can map another device's memory.

CUDAError_t CUDADeviceCanAccessPeer(int *canAccessPeer, int device, int peerDevice);

CUresult cuDeviceCanAccessPeer(int *canAccessPeer, CUdevice device, CUdevice peerDevice);

Peer-to-peer mappings are not enabled automatically; they must be specifically requested by callingckaDeviceEnablePeerAccess() or cuCtxEnablePeerAccess().

CUDAError_t CUDADeviceEnablePeerAccess(int peerDevice, unsigned int flags);

CResult cuCtxEnablePeerAccess(CUcontext peerContext, unsigned int Flags);

Once peer-to-peer access has been enabled, all memory in the peer device—including new allocations—is accessible to the current device until CUDADeviceDisablePeerAccess() or cuCtxDisablePeerAccess() is called.

Peer-to-peer access uses a small amount of extra memory (to hold more page tables) and makes memory allocation more expensive, since the memory must be mapped for all participating devices. Peer-to-peer functionality enables contexts to read and write memory belonging to other contexts, both via memcpy (which may be implemented by staging through system memory) and directly by having kernels read or write global memory pointers.

The CUDADeviceEnablePeerAccess() function maps the memory belonging to another device. Peer-to-peer memory addressing is asymmetric; it is possible for GPU A to map GPU B's allocations without its allocations being available to GPU B. In order for two GPUs to see each other's memory, each GPU must explicitly map the other's memory.

// tell device 1 to map device 0 memory  
CUDASetDevice(1);  
CUDADeviceEnablePeerAccess(0, CUDAPeerAccessDefault);  
// tell device 0 to map device 1 memory  
CUDASetDevice(0);  
CUDADeviceEnablePeerAccess(1, CUDAPeerAccessDefault);

NOTE

On GPU boards with PCI Express 3.0-capable bridge chips (such as the Tesla K10), the GPUs can communicate at PCI Express 3.0 speeds even if the board is plugged into a PCI Express 2.0 slot.

9.2_Peer-to-Peer

9.2 Peer-to-Peer

9.2.1 PEER-TO-PEER MEMCPY

NOTE

9.2.2 PEER-TO-PEER ADDRESSING

NOTE