13.8_Further_Reading_Parallel_Prefix_Sum_Circuits
13.8 Further Reading (Parallel Prefix Sum Circuits)
There is a rich literature on circuits to compute parallel prefix sums. Besides the Brent-Kung, Sklansky, and Kogge-Stone formulations, other examples of scan circuits include Ladner-Fischer and more recent work by Lin and Hsiao. Hinze describes an algebra of scans that can be used to reason about Scan implementations. The details of his work are outside the scope of this book, but his paper is highly recommended reading.
Sean Baxter's Web site, http://www.moderngpu.com, is an excellent resource for optimized Scan and its applications.
Brent, Richard P., and H.T. Kung. A regular layout for parallel adders. IEEE Transactions on Computers C-31, 1982, pp. 260-264.
Hinze, Ralf. An algebra of scans. In Mathematics of Program Construction, Springer, 2004, Stirling, Scotland, pp. 186-210.
Kogge, Peter M., and Harold S. Stone. A parallel algorithm for the efficient solution of a general class of recurrence equations. IEEE Transactions on Computers C-22, 1973, pp. 783-791.
Sklansky, J. Conditional sum addition logic. IRE Trans. Electron. Comput. 9 [2], June 1960, pp. 226-231.
This page intentionally left blank
N-Body computations are a family of computation that models a set of particles (known as bodies), each of which must consider all the other bodies during the computation. Example applications of N-Body include (but are not limited to) the following.
Gravitational simulation in which stars exert gravitational forces
Molecular modeling in which ions exert electrostatic forces
Particle systems in computer graphics to simulate water and fire
"Boids," a technique for computer animation designed to simulate flocking behavior
Typically the paths of the bodies are being simulated per timestep, and computing each timestep costs operations for bodies. In most formulations, the forces quickly decrease with distance, leading to hierarchical algorithms in which (for example) the mass and location of the center-of-mass for a collection of bodies are used to avoid performing the full computations needed otherwise. Barnes-Hut algorithms reduce the runtime to by introducing a spatial hierarchy that approximates the forces between clusters of objects; for applications where the "leaf nodes" of the computation contain bodies, computations must be performed in a given leaf. It is this portion of the computation at which GPUs excel.
N-Body workloads have proven the most effective way for GPUs to approach their theoretical limit in processing power. In their GPU Gems 3 paper "Fast N-Body Simulation with CUDA,"1 Harris et al. frequently cite this theoretical limit
in explaining why further performance improvements are not possible. The GPU in question, NVIDIA GeForce 8800 GTX, was so effective at N-Body computations that it outperformed custom GRAPE-6 hardware that had been specifically designed to perform astrophysics computation.
In the hopes that readers will be able to "plug in" their computation and find the fastest method, this chapter illustrates several different ways to implement N-Body and related computations using CUDA.
A naïve implementation illustrates the technique and underscores the effectiveness of caches and the importance of loop unrolling.
A shared memory implementation (for our gravitational computation, the fastest) duplicates Harris et al.'s result, tiling the computation over thread-block-sized collections of bodies to minimize memory latencies in the inner-most loop.
A constant memory implementation, inspired by Stone et al.'s implementation of Direct Coulomb Summation (DCS), uses constant memory to hold body descriptions, freeing shared memory for other uses.
Because readers' applications may not happen to be gravitational N-Body, these different implementations are not presented with the singular goal of optimizing that particular computation. It may make sense to adapt a different implementation, depending on the target SM architecture, problem size, and details of the central calculation.
Since gravitational N-Body has been presented as a poster child for theoretical performance of GPUs, with speedups of up to 400x reported, the chapter concludes by presenting an implementation optimized for CPUs. By rewriting the calculation to use SSE (Streaming SIMD Extensions) and multithreading, a speedup of more than 300x is obtained. Nevertheless, as reported in Section 14.9, a GK104-based GPU is significantly faster than a high-end server with a pair of Intel Xeon E2670 CPUs. The CUDA implementation is faster, more readable, and more maintainable than the optimized CPU implementation.
Throughout the chapter, performance results are reported using a server-class machine with two Xeon E2670 "Sandy Bridge" CPUs and up to four GK104 GPUs that are underclocked to conserve power and minimize heat dissipation. Rather than reporting results in GFLOPS, we report performance results in terms of body-body interactions per second.