5.6 Shared Memory

Shared memory is used to exchange data between CUDA threads within a block. Physically, it is implemented with a per-SM memory that can be accessed very quickly. In terms of speed, shared memory is perhaps 10x slower than register accesses but 10x faster than accesses to global memory. As a result, shared memory is often a critical resource to reduce the external bandwidth needed by CUDA kernels.

Since developers explicitly allocate and reference shared memory, it can be thought of as a "manually managed" cache or "scratchpad" memory. Developers can request different cache configurations at both the kernel and the device level: CUDADeviceSetCacheConfig() / cuCtxSetCacheConfig()

specify the preferred cache configuration for a CUDA device, while CUDAFuncSetCacheConfig() /cuFuncSetCacheConfig() specify the preferred cache configuration for a given kernel. If both are specified, the per-kernel request takes precedence, but in any case, the requirements of the kernel may override the developer's preference.

Kernels that use shared memory typically are written in three phases.

Load shared memory and syncthreads()
Process shared memory and syncthreads()
Write results

Developers can make the compiler report the amount of shared memory used by a given kernel with the nvcc options: -Xptxas -v, abi=no. At runtime, the amount of shared memory used by a kernel may be queried with cuFuncGetAttribute(CU FUNC_ATTRIBUTE_SHARED_SIZEBytes).

5.6.1 UNSIZED SHARED MEMORY DECLARATIONS

Any shared memory declared in the kernel itself is automatically allocated for each block when the kernel is launched. If the kernel also includes an unsized declaration of shared memory, the amount of memory needed by that declaration must be specified when the kernel is launched.

If there is more than one extern shared memory declaration, they are aliased with respect to one another, so the declaration

extern shared char sharedChars[]; extern shared int sharedInts[];

enables the same shared memory to be addressed as 8- or 32-bit integers, as needed. One motivation for using this type of aliasing is to use wider types when possible to read and write global memory, while using the narrow ones for kernel computations.

NOTE

If you have more than one kernel that uses unsized shared memory, they must be compiled in separate files.

5.6.2 WARP-SYNCHRONOUS CODING

Shared memory variables that will be participating in warp-synchronous programming must be declared as volatile to prevent the compiler from applying optimizations that will render the code incorrect.

5.6.3 POINTERS TO SHARED MEMORY

It is valid—and often convenient—to use pointers to refer to shared memory. Example kernels that use this idiom include the reduction kernels in Chapter 12 (Listing 12.3) and the scanBlock kernel in Chapter 13 (Listing 13.3).

5.6_Shared_Memory

5.6 Shared Memory

5.6.1 UNSIZED SHARED MEMORY DECLARATIONS

NOTE

5.6.2 WARP-SYNCHRONOUS CODING

5.6.3 POINTERS TO SHARED MEMORY