README
Index
64-bit addressing, xxii device pointers, 132 and UVA, 30-31
A
Absolute value, 260
Address spaces, 22-32
Adobe CS5, 5
Affinity, 15-16, 128-130 all() intrinsic, 271
Amazon Machine Image (AMI), 113-114
Amazon Web Services, 109-117
AMBER molecular modeling package, 427
Amdahl's Law, 35-36, 188, 195
AMI, see Amazon Machine Image any() intrinsic, 271
ARM, 19
Array of structures (AOS), 429-430
Arrays, CUDA, see CUDA Arrays
Asynchronous operations kernel launch, 205-206, 209 memory copy, 178-181
atomicAdd() intrinsic, 201, 236 and reduction, 376-377 and single-pass reduction, 373-376
atomicAnd() intrinsic, 151, 236
atomicCAS() intrinsic, 152, 236
atomicExch() intrinsic, 153, 236
atomicOr() intrinsic, 200, 236
Atomic operations in global memory, 152-155, 216
in host memory, 237
and reduction, 367, 373-377
in shared memory, 239-240
Availability zones, 112
AWS, see Amazon Web Services
B
Ballot instruction, xxii, 271
Barriers, memory, 240-241
Bit reversal, 242
Block ID, 212-213, 275
Block-levelprimitives,272
blockDim, 213, 275
blockIdx, 213, 275
Blocking waits, 79, 186
Block-levelprimitives,272
Block linear addressing, 308-309
Boids, 421, 447
brev() intrinsic, 242
Bridge chip, PCI Express, 19-21
Brook, 5
BSD license, 7, 471
Buck, Ian, 5
byte_perm() intrinsic, 242
C
Cachecoherency,209
Cache configuration, 75
Callbacks, stream, 77
chLib, see CUDA Handbook Library
chTimerGetTime(), 175, 471-472
Clockregister,275
clock() intrinsic, 275
clock64() intrinsic, 275
clz() intrinsic, 242
Coalescing constraints, 143-147
Coherency, 209
Command buffers, 32-35
Concurrency
CPU/GPU, 174-178
inter-engine, 187-196
inter-GPU, 202
kernel execution, 199-201
Condition codes, 267
Constant memory, 156-158
and dynamic parallelism, 224
and N-body, 434-436
and normalized cross-correlation, 456-459
Contexts, 67-71
Convergence, 268-269
Copy-on-write, 25
cuArray3DGetDescriptor(), 313
cuArray3DCreate(), 312
cuArrayCreate(),312
cuCtxAttach(), 70
cuCtxCreate(),
andblockingwaits,39
and local memory usage, 159, 211
and mapped pinned memory, 124
cuCtxDestroy(), 70, 202
cuCtxDetach(),70
cuCtxGetLimit(), 71
cuCtxPopCurrent(), 70, 294-296
cuCtxPushCurrent(), 70, 294-296
cuCtxSetCacheConfig(), 71, 75
cuCtxSetLimit(), 71
cuCtxSynchronize(), 77, 209
CUDA arrays, 82, 308-313
vs.devicememory,313
CUDA By Example, xxi-xxii
CUDA Handbook Library, 471-479
Command line parsing, 476-477
Driver API support, 474-475
Error handling, 477-479
Shmoos, 475-476
Threading, 472-474
Timing, 471-472
CUDA runtime
lazy initialization, 53
memory copies, 166-169
vs. driver API, 87-92
CUDA MEMCPY3D structure, 92
CUDABindTexture(), 85, 155, 315
CUDABindTexture2D(), 85, 315, 338
CUDABindTextureToArray(), 85, 315
CUDADeviceProp structure, 61-63
asyncEngineCount member, 166
integrated member, 18
kernelExecTimeoutEnabled member, 210
maxTexture1DLayered member, 343
maxTexture2DLayered member,343
maxTexture3D member, 210
totalGlobalMem member, 75, 137
unifiedAddressing member, 127
cudadevicereset(),202
CUDADeviceSetCacheConfig(), 75, 162-163
CUDADeviceSynchronize(), 77, 209
device runtime, 223
in multi-GPU N-body, 297-299
CUDAEventCreate(),
andblockingwaits,39
and disabling timing, 225
CUDAEventCreateWithFlags(), 89-90
CUDAEventQuery(),186
cudaeventRecord(),183-184,359
cudaEventSynchronize(),89-90,183-184
cudaExtent structure, 135, 168, 311
CUDAFree(),133
and deferred initialization, 67
cudaFree (0), 67
CUDAFuncSetCacheConfig(), 75, 162-163
CUDAGetDeviceCount(), 60
CUDAGetLastError(), 210
and device runtime, 225
luaGetSymbolAddress(), 139, 157, 201
and scan, 405
cudahostAlloc(),81
cudaeHostGetDevicePointer(),81
cudahostRegister(),81,126
CUDAHostUnregister(), 81, 126
CUDAAlloc(),133
cudamalloc3D(),75,134
CUDAAlloc3DArray(), 341-343
and layered textures, 343
CUDAAllocArray(), 309-310, 341-342
cudaMallocPitch(),134,339
cudamemcpy(),31,166
cudememcpyAsync(),165,359-361
cudaMemcpy3DParms structure, 92, 168
cudaMemcpyFromSymbol(),138,157
cudamemcpyKind enumeration, 164
cudMemcpyToSymbol(),138,157
and notrimalized cross-correlation, 456-458
cudMemcpyToSymbolAsync()
and N-body computations, 435-436
cudaMemset(), 139
cudamemset2D(),139
cudaPitchedPtr structure, 134, 342
CUDAPointerAttributes structure, 141, 291-292
cudaPos structure, 169, 342
CUDASetDevice(),288
CUDASetDeviceFlags()
andblockingwaits,39
and local memory usage, 159, 211
and mapped pinned memory, 124
cudaDeviceSetLimit(), 135-136
input values, 227-228
and malloc() in kernels, 136
and synchronization depth, 222, 226-227
cudaStreamCreate()
and device runtime, 225
nonblocking streams, 225
cudaStreamQuery(), 186-187
and kernel thunks, 56
CUDAWaitEvent(),41,202, 292-293
cuDeviceComputeCapability(),60
cuDeviceGet(),60,66
cuDeviceGetAttribute(), 60
asynchronous engine count, 166
integrated GPU, 18
kernel execution timeout, 210
texturing dimensions, 341
unified addressing, 127
cuDeviceGetCount(),60,66
cuDeviceGetName(),66
cuDeviceTotalMem(),138
cuDriverGetVersion(),53
cuEventCreate(),184 andblockingwaits,39
cuFuncGetAttribute(), 74 and local memory usage, 158
cuFuncSetCacheConfig(), 75, 163
cuInit(),59,65-67
cuLaunchGrid(),210
cuLaunchKernel(), 73-74, 207-208
cuMemAlloc(), 76, 133
cuMemAllocPitch(), 135 and coalescing, 145
cuMemcpy(), 31, 166
cuMemcpy3D(),91,166
cuMemcpyDtoD(),164
cuMemcpyDtoH(), 164
cuMemcpyHtoD(),164
cuMemcpyHtoDAsync(), 165
cuMemFree(), 76, 133
cuMemGetAddressRange(), 141
cuMemGetInfo(), 76
cuMemHostAlloc(), 124-125, 135 and mapped pinned memory, 124 and write combining memory, 12
cuMemHostGetDevicePointer(),124
cuMemHostGetFlags(),80
cuMemHostRegister(), 81, 126 and UVA, 31, 126
cuMemset*(),139-140
cuModuleGetFunction(), 73
cuModuleGetGlobal(), 73, 139, 157
cuModuleGetTexRef(),73
cuModuleLoadDataEx(), 103-104
cuobjdump, 105-106, 275
cuPointerGetAttribute(),142,291
Current context stack, 69-70
cuStreamAddCallback(),77
cuStreamCreate(),89
cuStreamQuery(), and kernel thunks, 56
cuStreamSynchronize(),89
cuTexRefSetAddress(),85,155 and state changes,332
cuTexRefSetAddress2D(),85
cuTexRefSetArray(),85 and state changes,332
cuTexRefSetFormat(),316-317
D
_dadd_rn() intrinsic, 249
suppressing multiply-add, 253
Demand paging, 25
Device memory vs.CUDA arrays,313
Devices, 59-63
dim3 structure, 207
Direct memory access, 27-28,79-80
Direct3D, 3, 86-87
Divergence, 267-269
DMA, see Direct Memory Access
dmulrn() intrinsic,249
suppressing multiply-add, 253
double2float() intrinsic, 234
double2loint() intrinsic, 234
double as long long() intrinsic, 234
Driver API
vs.CUDA runtime,87-92
facilities, 474-475
memory copies, 169-171
Driver models
User mode client driver, 54-55
WDDM (Windows Display Driver Model), 55-56
XPDDM (Windows XP Driver Model), 55
Dynamic parallelism, xxii, 222-230
E
EBS, see Elastic Block Storage
EC2, see Elastic Compute Cloud
ECC, see Error correcting codes
Elastic Block Storage, 113
Elastic Compute Cloud, 109-117
Error correcting codes (ECC), 155-156
Events, 78-79
and CPU/CPU concurrency, 183
queries, 186
and timing, 186-187
Extreme values, floating point, 247-248
F
fadd_rn() intrinsic, 249
suppressing multiply-add, 251
False sharing, 15-16
fdivide_rn() intrinsic, 251
Fermi
comparison with Tesla, 43-46
instruction set, 279-285
ffs() intrinsic
float_as_int() intrinsic, 234, 251
float2 structure, 235
float4 structure, 235, 318
loat2half() intrinsic, 253
Floating point
conversion, 249-250
double precision, 253
extreme values, 247-248
formats, 245
half precision, 253
intrinsicsofconversion,250
intrinsicsofrounding,249
library, 259-265
representations, 245
rounding,248-249
single precision, 250-253
streaming multiprocessor support, 244-265
fmul_rn() intrinsic, 249
suppressing multiply-add, 251
Front-side bus, 12-13
Functions (Cufunction), 73-75
Funnel shift, 243-244
G
Gelsinger, Pat, 4
GL Utility Library, 335
Global memory
allocating, 132-137
and dynamic parallelism, 224
pointers, 131-132
querying total amount, 75-76
static allocations, 138-139
Glossary, 481-486
GLUT, see GL Utility Library
GPGPU (general-purpose GPU programming), 5
Graphics interoperability, 86-87
gridDim,213,275
H
halftofloat() intrinsic, 253
hiloint2double() intrinsic, 234
Hostinterface,39-41
Host memory
allocating, 122-123
mapped, 28-29, 81, 124, 127
pinned, 27-28, 80, 122-123
portable, 29-30, 81, 123-124, 287-288
registering, 81, 125-126
and UVA, 126-127
Host memory registration, see Registration
HT, see HyperTransport
Hyper-Q, 77
HyperTransport, 14-15
1
Integrated GPUs, 17-19
Interleaving, see Memory interleaving
Intra-GPU synchronization, 39-40
Inter-GPU synchronization, 41
Intrinsics
for block-level primitives, 272
for floating point conversion, 250
for rounding, 249
for SFU, 252
for warp shuffle, 271
int2 structure, 235
int4 structure, 235, 319
int_as_float() intrinsic, 234, 251
I/O hub, 14-17
isglobal() intrinsic, 142, 224
isochronousbandwidth,12
K
Kandrot, Edwards, xxi
Kepler
instruction set, 279-285
Kernel mode
vs.usermode,26
Kernelthunk,26
and stream and event queries, 186
and WDDM, 55-56
Kernels, 73-75
declaring, 73
launch overhead, 174-178
launching, 206-208
L
Lanes,PCIExpress,12
Lanes, thread, 213
Layered textures, 342-343
Lazy allocation, 25
Linux
driver model, 54-55
in EC2, 114
Local memory, 158-161
and context creation, 159
and dynamic parallelism, 224-225
long_as-double() intrinsic, 234
Loop unrolling, 430-431
M
make CUDAPitchedPtr function, 342
Mapped file 1/0, 25
Mapped pinned memory, 81, 124, 361-362
Math library, floating point, 259-265
Maximum, 269
Memset, see Memory set
Memorycopy,27-28,164-171
asynchronous,165-166
CUDA runtime v. driver API, 90-92
driver overhead, 179-180
functions, CUDA runtime, 166-169
functions, driver API, 169-170
pageable, 80, 183-184
peer-to-peer, 288-289, 293-296
Memory interleaving, 16
Memory set, 139-140
Microbenchmarks, 6
Kernel launch overhead, 174-178
Memory allocation, 135-137
Memory copy overhead (device@host), 181
Memory copy overhead (host@device), 179-180
Global memory bandwidth, 147-151
Register spilling, 159-161
Microdemos, 7
Concurrency, CPU/GPU, 183-186
concurrency, inter-engine, 189-196
concurrency, intra-GPU, 189-196
concurrency, kernel execution, 199-201
float@half conversion, 253-258
pageablememcpy,183-186
peer-to-peer memcpy, 293-294
spin locks, 152-155
surface read/write, 1D, 333-335
surface read/write, 2D, 340
texturing: 9-bit interpolation, 329-331
texturing: addressing modes, 335-333
texturing: increasing address space coverage, 318-321
texturing:unnormized coordinates,325-328
threadID,216-220
Minimum, 269
Modules, 71-73
Moore's Law, 4
mul24() intrinsic, 44, 242
mul64hi() intrinsic, 242
mulhi() intrinsic, 242
Multiple GPU programming
with current context stack, 294-296
and multiple CPU threads, 299-303
and inter-GPU synchronization, 292-294
hardware, 19-22
and N-body, 296-302
scalability, 438
and single CPU thread, 294-299
Multithreading
and N-body, 442-444
N
name mangling, 74
N-body, 421-447
and constant memory, 434-436
and multiple GPUs, 296-302
and shared memory, 432-434
Nehalem [Intel i7], 15
Newton-Raphson iteration, 440
Nonblocking streams, 183, 225
Nonuniform memory access (NUMA)
hardware, 14-17
software, 128-130
Normalized cross-correlation, 449-452
Northbridge, 12-14
NULL stream, 77-78, 178-182 and concurrency breaks, 181, 196 and nonblocking streams, 183
NUMA, see Nonuniform memory access
nvcc, 57-58, 93-100
code generation options, 99-100
compilation trajectories, 94-95
compiler/linker options, 96-97
environment options, 95-96
miscellaneous options, 97-98
passthroughoptions,97
nvidia-smi, 106-109
0
Occupancy, 220-222
OpenGL, 86-87, 335-337
Open source, 7-8, 471
Opteron, 14
Optimization journeys, 7
N-body, 428-434
normalized cross-correlation, 452-464
reduction, 367-372
SAXPY (host memory), 358-363
Scan, 394-407
P
Page, memory, 23-24
Page table, 24-25
Page table entry (PTE), 23-25
Parallel prefix sum, see Scan
PCIe, see PCI Express
PCI Express, 12 integration with CPUs, 17
Peer-to-peer, 21, 143
mappings, 31-32
memory copies, 288-289
Performance counters, 272
Pinned memory, 27-28
registering, 125-126
Pitch, 133-135, 307-308
popc () intrinsic, 242
Pointers, 131-132
Pointer queries, 140-142
Population count, 242
Portable pinned memory, 81, 123-124, 288
__prof_trigger() intrinsic, 272
PTE, see page table entry
PTX (parallel thread execution), 57-59, 100-104, 411
ptxas, the PTX assembler, 100-104
command line options, 101-103
Q
QPI, see QuickPath Interconnect
Queries
amount of global memory, 75-76
device attributes, 60-63
event, 186
pointer, 140-142
stream, 56, 186
QuickPath Interconnect, 14-15
R
RDTSC instruction, 78
Reciprocal, 251
Reciprocal square root, 251-252, 440 accuracy by SFU, 252
Reduction, 365-383 of arbitrary data types, 378-381 with atomics, 376-377 of predicates, 382 single-pass 373-376 two-pass, 367-372 warps, 382-383
Registers, 233-234
Registration, host memory, 28, 31, 81, 125-126
Rotation (bitwise), 243-244
S
S3, see Simple Storage Service
_sad() intrinsic
_saturate() intrinsic, 253
Sanders, Jason, xxi
SASS, see Streaming Assemblysaturate () intrinsic, 253
SAXPY (scaled vector addition), 354-363
Scalable Link Interface (SLI), 19-21
Scan (parallel prefix sum), 385-419 and circuit design, 390-393
exclusive v. inclusive, 386, 391
reduce-then-scan (recursive), 400-403
reduce-then-scan (single pass), 403-407
scan-then-fan, 394-400
and stream compaction, 414-417
warp scan, 407-414
and warp shuffle, 410-414
SDK (Software Development Kit)
SFU, see Special Function Unit
Shared memory, 162-164
atomic operations, 239-240
and dynamic parallelism, 242
and N-body, 432-434
and normalized cross-correlation, 459-460
pointers, 164
and Scan, 395-396
un-sized declarations, 163
and the volatile keyword, 164
and warp synchronous code, 164
shfl() intrinsics, 271-272
Shmoo, 475-477
and kernel concurrency, 191
Shuffle instruction, 271
Simple Storage Service (S3), 112-113
SLI, see Scalable Link Interface
SOC, see System on a Chip
Software pipelining of streams, 76-77, 192-193
Special Function Unit, 251-252
Spin locks, 152-154
SSE, see Streaming SIMD Extensions
Stream callbacks, 77
Stream compaction, 414-417
Streaming Assembly (SASS), 105, 275-285
forwarp scan,412-414
Streaming Multiprocessors, (SMs), 46-50, 231-285
Streaming SIMD Extensions (SSE), 4
and N-body, 440-441
Streaming workloads, 353-363
in device memory, 355-357
and mapped pinned memory, 361-362
and streams, 359-361
Streams, 76-78
and software pipelining, 76-77, 359-361
NULL stream, 77-78, 181, 196
queries, 56, 186
string literals
to reference kernels and symbols, 74, 138-139
Structure of Arrays (S0A), 429
Surface load/store
1D,333-335
2D,340
SASS instructions, 283-284
Surface references, 85-86
1D,333-334
2D,340
Stream callbacks, 77
Streaming workloads, 353-363
Sum of absolute differences, 242
surf1Dread() intrinsic, 333
surf1Dwrite() intrinsic, 333-335
Synchronous operations
Memorycopy,165-166
syncthreads() intrinsic, 163, 240
avoiding - see warp synchronous code
and reduction, 368-369
and scan, 395-397
syncthreads_and() intrinsic, 272
syncthreads_count() intrinsic, 272, 365
syncthreads_or() intrinsic, 272
Symmetric multiprocessors, 13-14
System on a chip (SOC), 19
T
TCC, see Tesla Compute Cluster driver
TDR, see Timeout Detection and Recovery
Tesla
comparison with Fermi, 43-46
instruction set, 276-279
Tesla Compute Cluster driver, 57
Texture references, 82-85
tex1Dfetch() intrinsic, 318
Texturing, 305-349
1D,314-317
2D,335-339
3D, 340-342
and coalescing constraints, 317-318
and normalized cross-correlation, 452-456
from device memory, 155, 338-339
hardware capabilities, 345-347
from host memory, 321-323
from layered textures, 342-343
with normalized coordinates, 331-332
quick reference, 345-350
withunnormized coordinates,323-331
Thread affinity, 128-131
threadfence() intrinsic, 240
threadfence_block() intrinsic, 240
threadfence_system() intrinsic, 241
Thread ID, 216
threadIdx, 213, 275
Threads, CPU, and affinity, 128-129 library support, 472-474 Threads, GPU, 213
Timeout Detection and Recovery (TDR), 56-57
Timing, CPU-based, 471-472
Timing, GPU-based CUDA events, 78-79 hardware, 39
TLB, see Translation Lookaside Buffer
Translation Lookaside Buffer, 25
U
umul24 () intrinsic, 463 umul64hi () intrinsic, 463 umulhi () intrinsic, 463 Unified virtual addressing (UVA), xxii, 30-31, 55, 69, 126-127 and mapped pinned memory, 124, 125 and memcpy functions, 166 inferring device from address, 291-292 usad () intrinsic, 242 User mode v. kernel mode, 26 UVA, see unified virtual addressing
V
valloc(), 126
Video instructions, scalar, 272-274
Video instructions, vector, 273-274
VirtualAlloc(), 126
VirtualAllocExNuma(), 130
VirtualFreeEx(), 130
volatile keyword and shared memory, 164
W
Warp-level primitives, 270-272
Warp shuffle, xxii, 271-272 and N-body, 436-437 and reduction, 382-383 and scan, 410-412
Warp synchronous code, 164 for reduction, 369-372 and the volatile keyword, 174
Warps, 213 and occupancy, 220
WDDM, see Windows Display Driver Model
Width-in-bytes, see Pitch
Windows, 55-57, 64-67
Windows Display Driver Model, 55-56
Write combining memory, 18, 124-125
Z
Zero-copy, 19, 361-362
This page intentionally left blank