14.8 Conclusion

Since instruction sets and architectures differ, performance is measured in body-body interactions per second rather than GFLOPS. Performance was measured on a dual-socket Sandy Bridge system with two E5-2670 CPUs (similar to Amazon's cc2.8xlarge instance type), 64GB of RAM, and four (4) GK104 GPUs clocked at about 800MHz. The GK104s are on two dual-GPU boards plugged into 16-lane PCI Express 3.0 slots.

Table 14.4 summarizes the speedups due to CPU optimizations. All measurements were performed on a server with dual Xeon E2670 CPUs (2.6GHz). On this system, the generic CPU code in Listing 14.2 performs 17.2M interactions per

Table 14.4 Speedups Due to CPU Optimizations

second; the single-threaded SSE code performs 307M interactions per second, some 17.8x faster! As expected, multithreading the SSE code achieves good speedups, with 32 CPU threads delivering 5650M interactions per second, about 18x as fast as one thread. Between porting to SSE and multithreading, the total speedup on this platform for CPUs is more than 300x.

Because we got such a huge performance improvement from our CPU optimizations, the performance comparisons aren't as pronounced in favor of GPUs as most. $^{12}$ The highest-performing kernel in our testing (the shared memory implementation in Listing 14.4, with a loop unroll factor of 4) delivered 45.2 billion body-body interactions per second, exactly 8x faster than the fastest multithreaded SSE implementation. This result understates the performance advantages of CUDA in some ways, since the server used for testing had two high-end CPUs, and the GPUs are derated to reduce power consumption and heat dissipation.

Furthermore, future improvements can be had for both technologies: For CPUs, porting this workload to the AVX ("Advanced Vector eXtensions") instruction set would potentially double performance, but it would run only on Sandy Bridge and later chips, and the optimized CPU implementation does not exploit symmetry. For GPUs, NVIDIA's GK110 is about twice as big (and presumably about twice as fast) as the GK104. Comparing the source code for Listings 14.1 and 14.9 (the GPU and SSE implementations of the core body-body interaction code), though, it becomes clear that performance isn't the only reason to favor CUDA over

optimizing CPU code. Dr. Vincent Natoli alluded to this tradeoff in his June 2010 article "Kudos for CUDA."13

Similarly, we have found in many cases that the expression of algorithmic parallelism in CUDA in fields as diverse as oil and gas, bioinformatics, and finance is more elegant, compact, and readable than equivalently optimized CPU code, preserving and more clearly presenting the underlying algorithm. In a recent project we reduced 3500 lines of highly optimized C code to a CUDA kernel of about 800 lines. The optimized C was peppered with inline assembly, SSE macros, unrolled loops, and special cases, making it difficult to read, extract algorithmic meaning, and extend in the future. By comparison, the CUDA code was cleaner and more readable. Ultimately it will be easier to maintain.

Although it was feasible to develop an SSE implementation of this application, with a core body-body computation that takes about 50 lines of code to express (Listing 14.8), it's hard to imagine what the source code would look like for an SSE-optimized implementation of something like Boids, where each body must evaluate conditions and, when running on CUDA hardware, the code winds up being divergent. SSE supports divergence both in the form of predication (using masks and Boolean instruction sequences such as ANDPS/ANDNOTPS/ORPS to construct the result) and branching (often using MOVMSKPS to extract evaluated conditions), but getting the theoretical speedups on such workloads would require large engineering investments unless they can be extracted automatically by a vectorizing compiler.

14.8_Conclusion

14.8 Conclusion