11.2 Asynchronous Memcpy

Unless the input and output data can stay resident on the GPU, the logistics of streaming the data through the GPU—copying the input and output data to and from device memory—become the primary consideration. The two tools best suited to improve transfer performance are pinned memory and asynchronous memcpy (which can only operate on pinned memory).

The stream2Async.cu application illustrates the effect of moving the pageable memory of stream1Device.cu to pinned memory and invoking the memcpy as synchronously.

Measuring times with 128M floats (use --N to specify number of Mfloats)
Memcpy(host->device): 181.03 ms (5931.33 MB/s)
Kernel processing: 13.87 ms (116152.99 MB/s)
Memcpy (device->host): 90.07 ms (5960.35 MB/s)

Total time (wall clock): 288.68 ms (5579.29 MB/s)

Listing 11.5 contrasts the difference between the timed portions of stream1Device.cu (which performs synchronous transfers) and stream2Async.cu (which performs asynchronous transfers). In both cases, four CUDA events are used to record the times at the start, after the host $\rightarrow$ device transfers, after the kernel launch, and at the end. For stream2Async.cu, all of these operations are requested of the GPU in quick succession, and the GPU records the event times as it performs them. For stream1Device.cu, the GPU event-based times are a bit suspect, since for anyudaMemcpy(), calls must wait for the GPU to complete before proceeding, causing a pipeline bubble before the CUDAEventRecord() calls for evHtoD and evDtoH are processed.

Note that despite using the slower, naive implementation of saxpyGPU (from Listing 11.2), the wall clock time from this application shows that it completes the computation almost twice as fast: 289 ms versus 570.22 ms. The combination of faster transfers and asynchronous execution delivers much better performance.

Despite the improved performance, the application output highlights another performance opportunity: Some of the kernel processing can be performed concurrently with transfers. The next two sections describe two different methods to overlap kernel execution with transfers.

Listing 11.5 Synchronous (stream1Device.cu) versus asynchronous (stream2Async.cu).

//   
// from stream1Device.cu   
//   
CUDAEventRecord( evStart, 0 );   
CUDAMemcpy( dptrX, hptrX, ...,udaMemcpyHostToDevice);   
CUDAMemcpy( dptrY, hptrY, ...,udaMemcpyHostToDevice);   
CUDAEventRecord( evHtoD, 0 ); saxpyGPU<<nBlocks, nThreads>>>( dptrOut, dptrX, dptrY, N, alpha;   
CUDAEventRecord( evKernel, 0 );   
CUDAMemcpy( hptrOut, dptrOut, N*sizeOf(float),udaMemcpyDeviceToHost );   
CUDAEventRecord( evDtoH, 0 );   
CUDADeviceSynchronize();   
//   
// from stream2Async.cu   
//   
CUDAEventRecord( evStart, 0 );   
CUDAMemcpyAsync( dptrX, hptrX, ...,udaMemcpyHostToDevice, NULL );   
CUDAMemcpyAsync( dptrY, hptrY, ...,udaMemcpyHostToDevice, NULL );   
CUDAEventRecord( evHtoD, 0 ); saxpyGPU<<nBlocks, nThreads>>>( dptrOut, dptrX, dptrY, N, alpha;   
CUDAEventRecord( evKernel, 0 );   
CUDAMemcpyAsync( hptrOut, dptrOut, N*sizeOf(float), ... , NULL );   
CUDAEventRecord( evDtoH, 0 );   
CUDADeviceSynchronize();

11.2_Asynchronous_Memcpy

11.2 Asynchronous Memcpy