10.9_2D_Texturing_Copy_Avoidance
10.9 2D Texturing: Copy Avoidance
When CUDA was first introduced, CUDA kernels could read from CUDA arrays only via texture. Applications could write to CUDA arrays only with memory copies; in order for CUDA kernels to write data that would then be read through texture, they had to write to device memory and then perform a device→array memcpy. Since then, two mechanisms have been added that remove this step for 2D textures.
A 2D texture can be bound to a pitch-allocated range of linear device memory.
Surface load/store intrinsics enable CUDA kernels to write to CUDA arrays directly.
3D texturing from device memory and 3D surface load/store are not supported.
For applications that read most or all the texture contents with a regular access pattern (such as a video codec) or applications that must work on Tesla-class hardware, it is best to keep the data in device memory. For applications that perform random (but localized) access when texturing, it is probably best to keep the data in CUDA arrays and use surface read/write intrinsics.
10.9.1 2D TEXTURING FROM DEVICE MEMORY
Texturing from 2D device memory does not have any of the benefits of "block linear" addressing—a cache line fill into the texture cache pulls in a horizontal span of texels, not a 2D or 3D block of them—but unless the application performs random access into the texture, the benefits of avoiding a copy from device memory to a CUDA array likely outweigh the penalties of losing block linear addressing.
To bind a 2D texture reference to a device memory range, call CUDABindTexture2D().
CUDABindTexture2D(
NULL,
&tex,
texDevice,
&channelDesc,
inWidth,
inHeight,
texPitch);The above call binds the texture reference tex to the 2D device memory range given by texDevice / texPitch. The base address and pitch must conform to
hardware-specific alignment constraints. The base address must be aligned with respect to CUDADeviceProp.textureAlignment, and the pitch must be aligned with respect to CUDADeviceProp.texturePitchAlignment. The microdemo tex2d_addressing_device.cu is identical to tex2d_addressing.cu, but it uses device memory to hold the texture data. The two programs are designed to be so similar that you can look at the differences. A device pointer/pitch tuple is declared instead of a CUDA array.
< CUDAArray \*texArray $= 0$ .
$\mathrm{T}$ \*texDevice $= 0$ ..
$\mathrm{size\_t}$ texPitch;sudoMallocPitch() is called instead of calling sudoMallocArray().
sudoMallocPitch() delegates selection of the base address and pitch to the driver, so the code will continue working on future generations of hardware (which have a tendency to increase alignment requirements).
< CUDART_CHECK(cudaMallocArray( &texArray, < &channelDesc, < inWidth, > CUDART_CHECK(cudaMallocPitch( &texDevice, > &texPitch, > inWidth* sizeof(T), inHeight));Next,CORD2D() is called instead of codaBindTextureToArray().
< CUDART_CHECK(cudaBindTextureToArray.tex, texArray) );
> CUDART_CHECK(cudaBindTexture2D(NULL,
> &tex,
> texDevice,
> &channelDesc,
> inWidth,
> inHeight,
> texPitch );The final difference is that instead of freeing the CUDA array,udaFree() is called on the pointer returned byudaMallocPitch().
< CUDAFreeArrayTEXArray);
> CUDAFreeTexDevice);10.9.2 2D SURFACE READ/WRITE
As with 1D surface read/write, Fermi-class hardware enables kernels to write directly into CUDA arrays with intrinsic surface read/write functions.
template<class Type> Type surf2Dread(surface<void, 1> surfRef, int x, int y, boundaryMode =EMENTBoundaryModeTrap);
template<class Type> Type surf2Dwrite(surface<void, 1> surfRef, Type data, int x, int y, boundaryMode =EMENTBoundaryModeTrap);
The surface reference declaration and corresponding CUDA kernel for 2D surface memset, given in surf2Dmemset.cu, is as follows.
```c
surfacevoid,2>surf2D;
template<typename T> global__void surf2Dmemset_kernel( T value, int xOffset,int yOffset, int Width,int Height) { for (int row $=$ (blockIdx.y\*blockDim.y $^+$ threadIdx.y; row $<$ Height; row $+ =$ blockDim.y\*gridDim.y) { for (int col $=$ blockIdx.x\*blockDim.x $^+$ threadIdx.x; col $<$ Width; col $+ =$ blockDim.x\*gridDim.x) { surf2Dwrite(value, surf2D, (xOffset+col)\*sizeof(T), yOffset+row); } }Remember that the X offset parameter to surf2Dwrite() is given in bytes.