14.6_Multiple_GPUs_and_Scalability

14.6 Multiple GPUs and Scalability

Because the computational density is so high, N-Body scales well across multiple GPUs. Portable pinned memory is used to hold the body descriptions so they can easily be referenced by all GPUs in the system. For a system containing $k$ GPUs, each GPU is assigned $N / k$ forces to compute.8 Our multi-GPU implementation of N-Body is featured in Chapter 9. The rows are evenly divided among GPUs, the input data is broadcast to all GPUs via portable pinned memory, and each GPU computes its output independently. CUDA applications that use multiple GPUs can be multithreaded or single-threaded. Chapter 9 includes optimized N-Body implementations that illustrate both approaches.

For N-Body, the single- and multithreaded implementations have the same performance, since there is little work for the CPU to do. Table 14.3 summarizes the scalability of the multithreaded implementation for a problem size of $96\mathrm{K}$ bodies and up to 4 GPUs. The efficiency is the percentage of measured performance as compared to perfect scaling. There is room for improvement over this result, since the performance results reported here include allocation and freeing of device memory on each GPU for each timestep.

Table 14.3 N-Body Scalability