8.2 Integer Support

The SMs have the full complement of 32-bit integer operations.

Addition with optional negation of an operand for subtraction
Multiplication and multiply-add
Integer division
Logical operations
Condition code manipulation
Conversion to/from floating point
Miscellaneous operations (e.g., SIMD instructions for narrow integers, population count, find first zero)

CUDA exposes most of this functionality through standard C operators. Non-standard operations, such as 24-bit multiplication, may be accessed using inline PTX assembly or intrinsic functions.

8.2.1 MULTIPLICATION

Multiplication is implemented differently on Tesla- and Fermi-class hardware. Tesla implements a 24-bit multiplier, while Fermi implements a 32-bit multiplier. As a consequence, full 32-bit multiplication on SM 1.x hardware requires four instructions. For performance-sensitive code targeting Tesla-class

Table 8.4 Multiplication Infrinics

hardware, it is a performance win to use the intrinsics for 24-bit multiply. $^{8}$

Table 8.4 shows the intrinsics related to multiplication.

8.2.2 MISCELLANEOUS (BIT MANIPULATION)

The CUDA compiler implements a number of intrinsics for bit manipulation, as summarized in Table 8.5. On SM 2.x and later architectures, these intrinsics

Table 8.5 Bit Manipulation Infrinsics

map to single instructions. On pre-Fermi architectures, they are valid but may compile into many instructions. When in doubt, disassemble and look at the microcode! 64-bit variants have "11" (two ells for "long long") appended to the intrinsic name __clz11(), ffs11(), popcl1(), brevl().

8.2.3 FUNNEL SHIFT (SM 3.5)

GK110 added a 64-bit "funnel shift" instruction that concatenates two 32-bit values together (the least significant and most significant halves are specified as separate 32-bit inputs, but the hardware operates on an aligned register pair), shifts the resulting 64-bit value left or right, and then returns the most significant (for left shift) or least significant (for right shift) 32 bits.

Funnel shift may be accessed with the intrinsics given in Table 8.6. These intrinsics are implemented as inline device functions (using inline PTX assembler) in sm_35_intrinsics.h. By default, the least significant 5 bits of the shift count are masked off; the _1c and _rc intrinsics clamp the shift value to the range 0..32.

Applications for funnel shift include the following.

Multiword shift operations
Memory copies between misaligned buffers using aligned loads and stores
Rotate

Table 8.6 Funnel Shift Intrinsics

To right-shift data sizes greater than 64 bits, use repeated __funnelshift_r() calls, operating from the least significant to the most significant word. The most significant word of the result is computed using operator>>, which shifts in zero or sign bits as appropriate for the integer type. To left-shift data sizes greater than 64 bits, use repeated __funnelshift_l() calls, operating from the most significant to the least significant word. The least significant word of the result is computed using operator<<. If the hi and lo parameters are the same, the funnel shift effects a rotate operation.

8.2_Integer_Support

8.2 Integer Support

8.2.1 MULTIPLICATION

8.2.2 MISCELLANEOUS (BIT MANIPULATION)

8.2.3 FUNNEL SHIFT (SM 3.5)