9.1_Overview
9.1 Overview
Systems with multiple GPUs generally contain multi-GPU boards with a PCI Express bridge chip (such as the GeForce GTX 690) or multiple PCI Express slots, or both, as described in Section 2.3. Each GPU in such a system is separated by PCI Express bandwidth, so there is always a huge disparity in bandwidth between memory connected directly to a GPU (its device memory) and its connections to other GPUs as well as the CPU.
Many CUDA features designed to run on multiple GPUs, such as peer-to-peer addressing, require the GPUs to be identical. For applications that can make assumptions about the target hardware (such as vertical applications built for specific hardware configurations), this requirement is innocuous enough. But applications targeting systems with a variety of GPUs (say, a low-power one for everyday use and a powerful one for gaming) may have to use heuristics to decide which GPU(s) to use or load-balance the workload across GPUs so the faster ones contribute more computation to the final output, commensurate with their higher performance.
A key ingredient to all CUDA applications that use multiple GPUs is portable pinned memory. As described in Section 5.1.2, portable pinned memory is pinned memory that is mapped for all CUDA contexts such that any GPU can read or write the memory directly.
CPU Threading Models
Until CUDA 4.0, the only way to drive multiple GPUs was to create a CPU thread for each one. The CUDASetDevice() function had to be called once per CPU thread, before any CUDA code had executed, in order to tell CUDA which device to initialize when the CPU thread started to operate on CUDA. Whichever CPU thread made that call would then get exclusive access to the GPU, because the CUDA driver had not yet been made thread-safe in a way that would enable multiple threads to access the same GPU at the same time.
In CUDA 4.0, CUDASetDevice() was modified to implement the semantics that everyone had previously expected: It tells CUDA which GPU should perform subsequent CUDA operations. Having multiple threads operating on the same GPU at the same time may incur a slight performance hit, but it should be expected to work. Our example N-body application, however, only has one CPU thread operating on any given device at a time. The multithreaded formulation has each of threads operate on a specific device, and the single-threaded formulation has one thread operate on each of the devices in turn.