1.4 Road Map

The remaining chapters in Part I provide architectural overviews of CUDA hardware and software.

Chapter 2 details both the CUDA hardware platforms and the GPUs themselves.
Chapter 3 similarly covers the CUDA software architecture.
Chapter 4 covers the CUDA software environment, including descriptions of CUDA software tools and Amazon's EC2 environment.

In Part II, Chapters 5 to 10 cover various aspects of the CUDA programming model in great depth.

Chapter 5 covers memory, including device memory, constant memory, shared memory, and texture memory.
Chapter 6 covers streams and events—the mechanisms used for "coarse-grained" parallelism between the CPU and GPU, between hardware units of the GPU such as copy engines and the streaming multiprocessors, or between discrete GPUs.
Chapter 7 covers kernel execution, including the dynamic parallelism feature that is new in SM 3.5 and CUDA 5.0.
Chapter 8 covers every aspect of streaming multiprocessors.
Chapter 9 covers multi-GPU applications, including peer-to-peer operations and embarrassingly parallel operations, with N-body as an example.
Chapter 10 covers every aspect of CUDA texturing.

Finally, in Part III, Chapters 11 to 15 discuss various targeted CUDA applications.

Chapter 11 describes bandwidth-bound, streaming workloads such as vector-vector multiplication.
Chapters 12 and 13 describe reduction and parallel prefix sum (otherwise known as scan), both important building blocks in parallel programming.
Chapter 14 describes N-body, an important family of applications with high computational density that derive a particular benefit from GPU computing.
Chapter 15 takes an in-depth look at an image processing operation called normalized cross-correlation that is used for feature extraction. Chapter 15 features the only code in the book that uses texturing and shared memory together to deliver optimal performance.

This page intentionally left blank

Hardware Architecture

This chapter provides more detailed descriptions of CUDA platforms, from the system level to the functional units within the GPUs. The first section discusses the many different ways that CUDA systems can be built. The second section discusses address spaces and how CUDA's memory model is implemented in hardware and software. The third section discusses CPU/GPU interactions, with special attention paid to how commands are submitted to the GPU and how CPU/GPU synchronization is performed. Finally, the chapter concludes with a high-level description of the GPUs themselves: functional units such as copy engines and streaming multiprocessors, with block diagrams of the different types of streaming multiprocessors over three generations of CUDA-capable hardware.

1.4_Road_Map

1.4 Road Map

Hardware Architecture