A.1 Timing

The CUDA Handbook library includes a portable timing library that uses QueryPerformanceCounter() on Windows and gettimeofday() on non-Windows platforms. An example usage is as follows.

float   
TimeNULLKernelLaunches(int cIterations  $= 1000000$  ） { chTimerTimestamp start，stop;

chTimerGetTime( &start ); for ( int i = 0; i < cIterations; i++ ) { NullKernel<<<1,1>>(); } CUDAThreadSynchronize(); chTimerGetTime( &stop ); return le6*chTimerElapsedTime( &start, &stop ) / (float) cIterations; }

This function times the specified number of kernel launches and returns the microseconds per launch. chTimerTimestamp is a high-resolution timestamp. Usually it is a 64-bit counter that increases monotonically over time, so two timestamps are needed to compute a time interval.

The chTimerGetTime() function takes a snapshot of the current time. The chTimerElapsedTime() function returns the number of seconds that elapsed between two timestamps. The resolution of these timers is very fine (perhaps a microsecond), so chTimerElapsedTime() returns double.

#ifndef _WIN32
#include <windows.h>
typedef LARGE_INTEGER chTimerTimestamp;
#else
typedef struct timeval chTimerTimestamp;
#endif
void chTimerGetTime(chTimerTimestamp *p);
double chTimerElapsedTime( chTimerTimestamp *pStart, chTimerTimestamp *pEnd );
double chTimerBandwidth( chTimerTimestamp *pStart, chTimerTimestamp *pEnd, double cBytes );

We may use CUDA events when measuring performance in isolation on the CUDA-capable GPU, such as when measuring device memory bandwidth of a kernel. Using CUDA events for timing is a two-edged sword: They are less affected by spurious system-level events, such as network traffic, but that sometimes can lead to overly optimistic timing results.

A.1_Timing

A.1 Timing