6.4_CUDA_Events_Timing

6.4 CUDA Events: Timing

CUDA events work by submitting a command to the GPU that, when the preceding commands have been completed, causes the GPU to write a 32-bit memory

location with a known value. The CUDA driver implements cuEventQuery() and cuEventSynchronize() by examining that 32-bit value. But besides the 32-bit "tracking" value, the GPU also can write a 64-bit timer value that is sourced from a high-resolution, GPU-based clock.

Because they use a GPU-based clock, timing using CUDA events is less subject to perturbations from system events such as page faults or interrupts, and the function to compute elapsed times from timestamps is portable across all operating systems. That said, the so-called "wall clock" times of operations are ultimately what users see, so CUDA events are best used in a targeted fashion to tune kernels or other GPU-intensive operations, not to report absolute times to the user.

The stream parameter to cuEventRecord() is for interstream synchronization, not for timing. When using CUDA events for timing, it is best to record them in the NULL stream. The rationale is similar to the reason the machine instructions in superscalar CPUs to read time stamp counters (e.g., RDTSC on x86) are serializing instructions that flush the pipeline: Forcing a "join" on all the GPU engines eliminates any possible ambiguity on the operations being timed. Just make sure the cu (da) EventRecord() calls bracket enough work so that the timing delivers meaningful results.

Finally, note that CUDA events are intended to time GPU operations. Any synchronous CUDA operations will result in the GPU being used to time the resulting CPU/GPU synchronization operations.

CUDART_CHECK(udaEventRecord( startEvent, NULL ) ); // synchronous memcpy - invalidates CUDA event timing
CUDART_CHECK(udaMemcpy( deviceIn, hostIn, N*sizeof(int) );
CUDART_CHECK(udaEventRecord( stopEvent, NULL ) );

The example explored in the next section illustrates how to use CUDA events for timing.