6.7 Concurrent Kernel Processing

SM 2.x-class and later GPUs are capable of concurrently running multiple kernels, provided they are launched in different streams and have block sizes

that are small enough so a single kernel will not fill the whole GPU. The code in Listing 6.5 (lines 9-14) will cause kernels to run concurrently, provided the number of blocks in each kernel launch is small enough. Since the kernels can only communicate through global memory, we can add some instrumentation to AddKernel () to track how many kernels are running concurrently. Using the following "kernel concurrency tracking" structure

static const int g_maxStreams  $= 8$    
typedef struct KernelConcurrencyData_st{ int mask; // mask of active kernels int maskMax; // atomic max of mask popcount int masks[g_maxStreams]; int count; // number of active kernels int countMax; // atomic max of kernel count int counts[g_maxStreams];   
} KernelConcurrencyData;

we can add code to AddKernel () to "check in" and "check out" at the beginning and end of the function, respectively. The "check in" takes the "kernel id" parameter kid (a value in the range 0..NumStreams-1 passed to the kernel), computes a mask 1<<kid corresponding to the kernel ID into a global, and atomically OR's that value into the global. Note that atomicOR() returns the value that was in the memory location before the OR was performed. As a result, the return value has one bit set for every kernel that was active when the atomic OR operation was performed.

Similarly, this code tracks the number of active kernels by incrementing kernelData->count and calling atomicMax() on a shared global.

// check in, and record active kernel mask and count  
// as seen by this kernel.  
if (_kernelData && blockIdx.x == 0 && threadIdx.x == 0) {  
    int myMask = atomicOr( &kernelData->mask, 1 << kid);  
    kernelData->masks[kid] = myMask | (1 << kid);  
    int myCount = atomicAdd( &kernelData->count, 1);  
    atomicMax( &kernelData->countMax, myCount + 1);  
    kernelData->counts[kid] = myCount + 1;  
}

At the bottom of the kernel, similar code clears the mask and decrements the active-kernel count.

// check out  
if (EEPROMData && blockIdx.x==0 && threadIdx.x==0) { atomicAnd( &kernelData->mask, ~(1<<kid)); atomicAdd( &kernelData->count, -1); }

The kernelData parameter refers to a device variable declared at file scope.

device_KernelConcurrencyData g_kernelData;

Remember that the pointer to g_kernelData must be obtained by calling CUDAGetSymbolAddress(). It is possible to write code that references &g_kernelData, but CUDA's language integration will not correctly resolve the address.

The concurrencyKernelKernel.cu program adds support for a command line option blocksPerSM to specify the number of blocks with which to launch these kernels. It will generate a report on the number of kernels that were active. Two sample invocations of concurrencyKernelKernel are as follows.

$ ./concurrencyKernelKernel -blocksPerSM 2
Using 2 blocks per SM on GPU with 14 SMs = 28 blocks
Timing sequential operations... Kernel data:
Masks: (0x1 0x0 0x0 0x0 0x0 0x0 0x0)
Up to 1 kernels were active: (0x1 0x0 0x0 0x0 0x0 0x0 0x0)

Timing concurrent operations...   
Kernel data: Masks: ( 0x1 0x3 0x7 0xe 0x1c 0x38 0x60 0xe0 ) Up to 3 kernels were active: (0x1 0x2 0x3 0x3 0x3 0x3 0x2 0x3)

$ ./concurrencyKernelKernel -blocksPerSM 3
Using 3 blocks per SM on GPU with 14 SMs = 42 blocks
Timing sequential operations... Kernel data:
Masks: (0x1 0x0 0x0 0x0 0x0 0x0 0x0)
Up to 1 kernels were active: (0x1 0x0 0x0 0x0 0x0 0x0 0x0)

Timing concurrent operations... Kernel data:  
Masks: (0x1 0x3 0x6 0xc 0x10 0x30 0x60 0x80)  
Up to 2 kernels were active: (0x1 0x2 0x2 0x2 0x1 0x2 0x2 0x1)

Note that blocksPerSM is the number of blocks specified to each kernel launch, so a total of numStreams*blocksPerSM blocks are launched in numStreams separate kernels. You can see that the hardware can run more kernels concurrently when the kernel grids are smaller, but there is no performance benefit to concurrent kernel processing for the workload discussed in this chapter.

6.8 GPU/GPU Synchronization: CUDAStreamWaitEvent()

Up to this point, all of the synchronization functions described in this chapter have pertained to CPU/GPU synchronization. They either wait for or query the status of a GPU operation. The CUDAStreamWaitEvent() function is asynchronous with respect to the CPU and causes the specified stream to wait until an event has been recorded. The stream and event need not be associated with the same CUDA device. Section 9.3 describes how such inter-GPU synchronization may be performed and uses the feature to implement a peer-to-peer memcpy (see Listing 9.1).

6.8.1 STREAMS AND EVENTS ON MULTI-GPU: NOTES AND LIMITATIONS

Streams and events exist in the scope of the context (or device). When cuCtxDestroy() or CUDADeviceReset() is called, the associated streams and events are destroyed.
Kernel launches and cu (da) EventRecord() can only use CUDA streams in the same context/device.
-udaMemcpy() can be called with any stream, but it is best to call it from the source context/device.
-CORDWaitEvent() may be called on any event, using any stream.

6.7_Concurrent_Kernel_Processing

6.7 Concurrent Kernel Processing

6.8 GPU/GPU Synchronization: CUDAStreamWaitEvent()

6.8.1 STREAMS AND EVENTS ON MULTI-GPU: NOTES AND LIMITATIONS