14.5_Warp_Shuffle
14.5 Warp Shuffle
SM 3.x added a warp shuffle instruction (described in Section 8.6.1) that enables threads to interchange data between registers without writing the data to shared memory. The __shf1() intrinsic can be used to broadcast one thread's register value to all other threads in the warp. As shown in Listing 14.4, instead of using tiles sized to the threadblock and using shared memory, we can use tiles of size 32 (corresponding to the warp size) and broadcast the body description read by each thread to the other threads within the warp.
Interestingly, this strategy has lower performance than the shared memory implementation (34 billion as opposed to 45.2 billion interactions per second). The warp shuffle instruction takes about as long as a read from shared memory, and the computation is tiled at the warp size (32 threads) rather than a thread block size. So it seems the benefits of warp shuffle are best realized when replacing both a write and a read to shared memory, not just a read.
Warp shuffle should only be used if the kernel needs shared memory for other purposes.
Listing 14.7 ComputeNBodyGravitation_Shuffle.
__global__ void
ComputeNBodyGravitation_Shuffle( float \*force, float \*posMass, float softeningSquared, size_t N)
{ const int laneid $=$ threadIdx.x & 31; for ( int i $=$ blockIdx.x\*blockDim.x + threadIdx.x; i $< \mathbb{N}$ . i $+ =$ blockDim.x\*gridDim.x ) { float acc[3] $= \{0\}$ ; float4 myPosMass $=$ ((float4 \*) posMass)[i]; for ( int j $= 0$ ;j $< \mathbb{N}$ ;j $+ = 32$ ) { float4 shufSrcPosMass $=$ ((float4 \*) posMass)[j+laneid]; #pragma unroll 32 for ( int k $= 0$ ;k $< 32$ ;k++){ float fx,fy,fz; float4 shufDstPosMass; shufDstPosMass.x $=$ _shfl(shufSrcPosMass.x,k); shufDstPosMass.y $=$ _shfl(shufSrcPosMass.y,k); shufDstPosMass.z $=$ _shfl(shufSrcPosMass.z,k); shufDstPosMass.w $=$ _shfl(shufSrcPosMass.w,k); bodyBodyInteraction( &fx,&fy,&fz, myPosMass.x,myPosMass.y,myPosMass.z, shufDstPosMass.x, shufDstPosMass.y, shufDstPosMass.z, shufDstPosMass.w, softeningSquared); acc[0] $+ =$ fx; acc[1] $+ =$ fy; acc[2] $+ =$ fz; } } force[3\*i+0] $=$ acc[0]; force[3\*i+1] $=$ acc[1]; force[3\*i+2] $=$ acc[2]; }