8.1_Memory

8.1 Memory

8.1.1 REGISTERS

Each SM contains thousands of 32-bit registers that are allocated to threads as specified when the kernel is launched. Registers are both the fastest and most plentiful memory in the SM. As an example, the Kepler-class (SM 3.0) SMX contains 65,536 registers or 256K, while the texture cache is only 48K.

CUDA registers can contain integer or floating-point data; for hardware capable of performing double-precision arithmetic (SM 1.3 and higher), the operands are contained in even-valued register pairs. On SM 2.0 and higher hardware, register pairs also can hold 64-bit addresses.

CUDA hardware also supports wider memory transactions: The built-in int2/ float2 and int4/float4 data types, residing in aligned register pairs or quads, respectively, may be read or written using single 64- or 128-bit-wide loads or stores. Once in registers, the individual data elements can be referenced as .x/.y (for int2/float2) or .x/.y/.z/.w (for int4/float4).

Developers can cause nvcc to report the number of registers used by a kernel by specifying the command-line option --ptexas-options --verbose. The number of registers used by a kernel affects the number of threads that can fit in an SM and often must be tuned carefully for optimal performance. The maximum number of registers used for a compilation may be specified with --ptexas-options --maxregcount N.

Register Aliasing

Because registers can hold floating-point or integer data, some intrinsics serve only to coerce the compiler into changing its view of a variable. The int as __float() and float as __int() intrinsics cause a variable to "change personalities" between 32-bit integers and single-precision floating point.

float __int_as_float(int i);  
int __float_as_int(float f);

The __double2float(),__double2hiint(),and__hiloint2double()
intrinsicsimilaritycause registers to change personality (usually in-place).
__double_as_longlong() and__longlong_as-double()coerce register
pairs in-place;__double2float() and__double2hiint()return the least
and the most significant 32 bits of the input operand, respectively; and
__hiloint2double() constructs a double out of the high and low halves.

int double21oint( double d );   
int double2hiint( double d );   
int hiloint2double( int hi, int lo );   
double long_as_double(long long int i);   
long long int __double_as_longlong( double d );

8.1.2 LOCAL MEMORY