5.4 Local Memory

Local memory contains the stack for every thread in a CUDA kernel. It is used as follows.

To implement the application binary interface (ABI)—that is, the calling convention
To spill data out of registers
To hold arrays whose indices cannot be resolved by the compiler

In early implementations of CUDA hardware, any use of local memory was the "kiss of death." It slowed things down so much that developers were encouraged to take whatever measure was needed to get rid of the local memory usage. With the advent of an L1 cache in Fermi, these performance concerns are less urgent, provided the local memory traffic is confined to L1.[16]

Developers can make the compiler report the amount of local memory needed by a given kernel with the nvcc options: -Xptxas -v, abi=no. At runtime, the amount of local memory used by a kernel may be queried with

cuFuncGetAttribute(CU FUNC_ATTRIBUTE_LOCAL_SIZEBytes).

Paulius Micikevicius of NVIDIA gave a good presentation on how to determine whether local memory usage was impacting performance and what to do about it.[17] Register spilling can incur two costs: an increased number of instructions and an increase in the amount of memory traffic.

The L1 and L2 performance counters can be used to determine if the memory traffic is impacting performance. Here are some strategies to improve performance in this case.

At compile time, specify a higher limit in -maxregcount. By increasing the number of registers available to the thread, both the instruction count and

the memory traffic will decrease. The launch_bounds directive may be used to tune this parameter when the kernel is being compiled online by PTXAS.

Use noncaching loads for global memory, such as nvcc -Xptxas -d1cm=cg.
Increase the L1 size to 48K. (Call CUDAFuncSetCacheConfig() or CUDADeviceSetCacheconfig().)

When launching a kernel that uses more than the default amount of memory allocated for local memory, the CUDA driver must allocate a new local memory buffer before the kernel can launch. As a result, the kernel launch may take extra time; may cause unexpected CPU/GPU synchronization; and, if the driver is unable to allocate the buffer for local memory, may fail. $^{18}$ By default, the CUDA driver will free these larger local memory allocations after the kernel has launched. This behavior can be inhibited by specifying the CU_CTX_RESIZE_LMEM_TO_MAX flag to cuCtxCreate() or callingudaSetDeviceFlags() with theudaDeviceLmemResizeToMax flag set.

It is not difficult to build a templated function that illustrates the "performance cliff" when register spills occur. The templated GlobalCopy kernel in Listing 5.10 implements a simple memcpy routine that uses a local array temp to stage global memory references. The template parameter n specifies the number of elements in temp and thus the number of loads and stores to perform in the inner loop of the memory copy.

As a quick review of the SASS microcode emitted by the compiler will confirm, the compiler can keep temp in registers until n becomes too large.

Listing 5.10 GlobalCopy kernel.

template<class T, const int n> global void GlobalCopy(T *out, const T *in, size_t N) {
    T temp[n];
    size_t i;
    for (i = n*blockIdx.x*blockDim.x+threadIdx.x; i < N-n*blockDim.x*gridDim.x; i += n*blockDim.x*gridDim.x) {
        for (int j = 0; j < n; j++) {

5.4_Local_Memory

5.4 Local Memory

Listing 5.10 GlobalCopy kernel.