15.4_Image_in_Shared_Memory

Figure 15.4 Image in shared memory.

The total amount of shared memory needed per block is then the pitch multiplied by the number of rows (block height plus template height).

sharedMem = sharedPitch* (threads.y+hTemplate);

The host code to launch corrShared_kernel(), shown in Listing 15.6, detects whether the kernel launch will require more shared memory than is available. If that is the case, it calls corrTexTex2D(), which will work for any template size.

Listing 15.6 corrShared() (host code).

void
corrShared(
float *dCorr, int CorrPitch,
int wTile,
int wTemplate, int hTemplate,
float cPixels,
float fDenomExp,
int sharedPitch,
int xOffset, int yOffset,
int xTemplate, int yTemplate,
int xUL, int yUL, int w, int h,
dim3 threads, dim3 blocks,
int sharedMem)
{
    int device;
    CUDADeviceProp props;
   udaError_t status;
    CUDA_CHECK(udaGetDevice(&device));
    CUDA_CHECK(udaGetDeviceProperties(&props, device));
    if (sharedMem > props(sharedMemPerBlock) {
        dim3 threads88(8, 8, 1);
        dim3 blocks88;
        blocks88.x = INTCEIL(w, 8);
        blocks88.y = INTCEIL(h, 8);
        blocks88.z = 1;
        return corrTexTex2D(dCorr, CorrPitch, wTile, wTemplate, hTemplate, cPixels, fDenomExp, sharedPitch, xOffset, yOffset, xTemplate, yTemplate, xUL, yUL, w, h, threads88, blocks88, sharedMem);
    }
    corrShared_kernel<<blocks, threads, sharedMem>>>(dCorr, CorrPitch, wTile, wTemplate, hTemplate, (float) xOffset, (float) yOffset, cPixels, fDenomExp, sharedPitch, (float) xUL, (float) yUL, w, h);
}