11.4_Mapped_Pinned_Memory
11.4 Mapped Pinned Memory
For transfer-bound, streaming workloads such as SAXPY, reformulating the application to use mapped pinned memory for both the input and output confers a number of benefits.
As shown by the excerpt from stream4Mapped.cu (Listing 11.7), it eliminates the need to call CUDAMemcpy().
It eliminates the need to allocate device memory.For discrete GPUs, mapped pinned memory performs bus transfers but minimizes the amount of "overhang" alluded to in the previous section. Instead of waiting for a host device memcpy to finish, input data can be processed by the SMs as soon as it arrives. Instead of waiting for a kernel to complete before initiating a device host transfer, the data is posted to the bus as soon as the SMs are done processing.
For integrated GPUs, host and device memory exist in the same memory pool, so mapped pinned memory enables "zero copy" and eliminates any need to transfer data over the bus at all.
Listing 11.7 stream4Mapped excerpt.
chTimerGetTime( &chStart );
CUDAEventRecord( evStart, 0 );
saxpyGPU<<nBlocks, nThreads>>>( dptrOut, dptrX, dptrY, N, alpha );
CUDAEventRecord( evStop, 0 );
CUDADeviceSynchronize();Mapped pinned memory works especially well when writing to host memory (for example, to deliver the result of a reduction to the host) because unlike reads, there is no need to wait until writes arrive before continuing execution. Workloads that read mapped pinned memory are more problematic. If the GPU cannot sustain full bus performance while reading from mapped pinned memory, the smaller transfer performance may overwhelm the benefits of a smaller overhang. Also, for some workloads, the SMs have better things to do than drive (and wait for) PCI Express bus traffic.
In the case of our application, on our test system, mapped pinned memory is a definite win.
Measuring times with 128M floats (use --N to specify number of Mfloats)
Total time: 204.54 ms (7874.45 MB/s)It completes the computation in 204.54 ms, significantly faster than the 273 ms of the second-fastest implementation. The effective bandwidth of 7.9 GiB/s shows that the GPU is pushing both directions of PCI Express.
Not all combinations of systems and GPUs can sustain such high levels of performance with mapped pinned memory. If there's any doubt, keep the data in device memory and use the asynchronous memcpy formulations, similar to stream2Async.cu.