6.6_Mapped_Pinned_Memory

6.6 Mapped Pinned Memory

Mapped pinned memory can be used to overlap PCI Express transfers and kernel processing, especially for device→host copies, where there is no need to cover the long latency to host memory. Mapped pinned memory has stricter alignment requirements than the native GPU memcpy, since they must be coalesced. Uncoalesced memory transactions run two to six times slower when using mapped pinned memory.

A naïve port of our concurrencyMemcpyKernelMapped.cu program yields an interesting result: On a cg1.4xlarge instance in Amazon EC2, mapped pinned memory runs very slowly for values of cycles below 64.

For small values of cycles, the kernel takes a long time to run, as if cycles were greater than 200! Only NVIDIA can discover the reason for this performance anomaly for certain, but it is not difficult to work around: By unrolling the inner loop of the kernel, we create more work per thread, and performance improves.

Listing 6.6 AddKernel() with loop unrolling.

template<const int unrollFactor>
device(void
AddKernel_helper( int *out, const int *in, size_t N, int increment, int cycles )
{
    for ( size_t i = unrollFactor*blockIdx.x*blockDim.x+threadIdx.x;
        i < N;
            i += unrollFactor*blockDim.x*gridDim.x )
{
        int values[unrollFactor];
    for ( int iUnroll = 0; iUnroll < unrollFactor; iUnroll++ ) {