1.2_Code
1.2 Code
Developers want CUDA code that is illustrative yet not a toy; useful but does not require a technical dive into a far-afield topic; and high performance but does not obscure the path taken by implementors from their initial port to the final version. To that end, this book presents three types of code examples designed to address each of those considerations: microbenchmarks, microdemos, and optimization journeys.
1.2.1 MICROBENCHMARKS
Microbenchmarks are designed to illustrate the performance implications of a very specific CUDA question, such as how uncoalesced memory transactions degrade device memory bandwidth or the amount of time it takes the WDDM driver to perform a kernel thunk. They are designed to be compiled standalone and will look familiar to many CUDA programmers who've already implemented microbenchmarks of their own. In a sense, I wrote a set of microbenchmarks to obviate the need for other people to do the same.
1.2.2 MICRODEMOS
Microdemos are small applications designed to shed light on specific questions of how the hardware or software behaves. Like microbenchmarks, they are small and self-contained, but instead of highlighting a performance question, they highlight a question of functionality. For example, the chapter on texturing includes microdemos that illustrate how to texture from 1D device memory, how the float→int conversion is performed, how different texture addressing modes work, and how the linear interpolation performed by texture is affected by the 9-bit weights.
Like the microbenchmarks, these microdemos are offered in the spirit in which developers probably wanted to write them, or at least have them available. I wrote them so you don't have to!
1.2.3 OPTIMIZATION JOURNEYS
Many papers on CUDA present their results as a fait accompli, perhaps with some side comments on tradeoffs between different approaches that were investigated before settling on the final approach presented in the paper. Authors often have length limits and deadlines that work against presenting more complete treatments of their work.
For some select topics central to the data parallel programming enabled by CUDA, this book includes optimization journeys in the spirit of Mark Harris's "Optimizing Parallel Reduction in CUDA" presentation that walks the reader through seven increasingly complex implementations of increasing performance. The topics we've chosen to address this way include reduction, parallel prefix sum ("scan"), and the N-body problem.