10.13 Texturing Quick References

10.13.1 HARDWARE CAPABILITIES

Hardware Limits

Queries—Driver API

Most of the hardware limits listed above can be queried with cuDevice-Attribute(), which may be called with the following values to query.

Queries—CUDA Runtime

The following members ofudaDeviceProp contain hardware limits as listed above.

10.13.2 CUDA RUNTIME

1D Textures

2D Textures

If surface load/store is desired, specify theCORDArraySurfaceLoadStore flag.

3D Textures

1D Layered Textures

2D Layered Textures

10.13.3 DRIVER API

1D Textures

The texture size limit for device memory is not queryable; it is $2^{27}$ elements on all CUDA-capable GPUs. The texture size limit for 1D CUDA arrays may be queried by calling cuDeviceGetAttribute() with CU_DEVICE_ATTRIBUTE_MAXIMUM-textURE1D_WIDTH.

2D Textures

3D Textures

1D Layered Textures

2D Layered Textures

PART III

This page intentionally left blank

Streaming Workloads

Streaming workloads are among the simplest that can be ported to CUDA: computations where each data element can be computed independently of the others, often with such low computational density that the workload is bandwidth-bound. Streaming workloads do not use many of the hardware resources of the GPU, such as caches and shared memory, that are designed to optimize reuse of data.

Since GPUs give the biggest benefits on workloads with high computational density, it might be useful to review some cases when it still makes sense for streaming workloads to port to GPUs.

If the input and output are in device memory, it doesn't make sense to transfer the data back to the CPU just to perform one operation.
If the GPU has much better instruction-level support than the CPU for the operation (e.g., Black-Scholes options computation, which uses Special Function Unit instructions intensively), the GPU can outperform the CPU despite memory transfer overhead.
The GPU operating concurrently with the CPU can approximately double performance, even if they are the same speed.
The CUDA code for a given workload may be more readable or maintainable than highly optimized CPU code for the same computation.
On integrated systems (i.e., systems-on-a-chip with CPU and CUDA-capable GPU operating on the same memory), there is no transfer overhead. CUDA can use "zero-copy" methods and avoid the copy entirely.

This chapter covers every aspect of streaming workloads, giving different formulations of the same workload to highlight the different issues that arise. The workload in question—the SAXPY operation from the BLAS library—performs a scalar multiplication and vector addition together in a single operation.

Listing 11.1 gives a trivial C implementation of SAXPY. For corresponding elements in the two input arrays, one element is scaled by a constant, added to the other, and written to the output array. Both input arrays and the output arrays consist of $N$ elements. Since GPUs have a native multiply-add instruction, the innermost loop of SAXPY has an extremely modest number of instructions per memory access.

Listing 11.1 saxpyCPU.

void   
saxpyCPU( float \*out, const float \*x, const float \*y, size_t N, float alpha)   
{ for（size_t i = 0; i < N; i++) { out[i] += alpha\*x[i] + y[i]; }

Listing 11.2 gives a trivial CUDA implementation of SAXPY. This version works for any grid or block size, and it performs adequately for most applications. This kernel is so bandwidth-bound that most applications would benefit more from restructuring the application to increase the computational density than from optimizing this tiny kernel.

Listing 11.2 saxpyGPU.

global void  
saxpyGPU( float *out, const float *x, const float *y, size_t N, float alpha)  
{

for ( size_t i = blockIdx.x*blockDim.x + threadIdx.x; i < N; i += blockDim.x*gridDim.x) { out[i] = alpha*x[i] + y[i]; }

The bulk of this chapter discusses how to move data to and from host memory efficiently, but first we'll spend a moment examining how to improve this kernel's performance when operating on device memory.

10.13_Texturing_Quick_References

10.13 Texturing Quick References

10.13.1 HARDWARE CAPABILITIES

Queries—Driver API

Queries—CUDA Runtime

10.13.2 CUDA RUNTIME

1D Textures

2D Textures

10.13.3 DRIVER API

PART III

Streaming Workloads