9.4_Inter-GPU_Synchronization
9.4 Inter-GPU Synchronization
CUDA events may be used for inter-GPU synchronization using cu (da) StreamWaitEvent(). If there is a producer/consumer relationship between two GPUs, the application can have the producer GPU record an event and then have the consumer GPU insert a stream-wait on that event into its command stream. When the consumer GPU encounters the stream-wait, it will stop processing commands until the producer GPU has passed the point of execution where cu (da) EventRecord() was called.
NOTE
In CUDA 5.0, the device runtime, described in Section 7.5, does not enable any inter-GPU synchronization whatsoever. That limitation may be relaxed in a future release.
Listing 9.1 gives chMemcpyPeerToPeer(), an implementation of peer-to-peer memcpy that uses portable memory and inter-GPU synchronization to implement the same type of memcpy that CUDA uses under the covers, if no direct mapping between the GPUs exists. The function works similarly to the chMemcpyHtoD() function in Listing 6.2 that performs host→device memcpy:
A staging buffer is allocated in host memory, and the memcpy begins by having the source GPU copy source data into the staging buffer and recording an event. But unlike the host device memcpy, there is never any need for the CPU to synchronize because all synchronization is done by the GPUs. Because both the memcpy and the event-record are asynchronous, immediately after kicking off the initial memcpy and event-record, the CPU can request that the destination GPU wait on that event and kick off a memcpy of the same buffer. Two staging buffers and two CUDA events are needed, so the two GPUs can copy to and from staging buffers concurrently, much as the CPU and GPU concurrently operate on staging buffers during the host device memcpy. The CPU loops over the input buffer and output buffers, issuing memcpy and event-record commands and ping-pong between staging buffers, until it has requested copies for all bytes and all that's left to do is wait for both GPUs to finish processing.
NOTE
As with the implementations in the CUDA support provided by NVIDIA, our peer-to-peer memcpy is synchronous.
Listing 9.1 chMemcpyPeerToPeer().
varaError_t
chMemcpyPeerToPeer( void \*dst, int dstDevice, const void \*_src, int srcDevice, size_t N)