Index

64-bit addressing, xxii device pointers, 132 and UVA, 30-31

A

Absolute value, 260

Address spaces, 22-32

Adobe CS5, 5

Affinity, 15-16, 128-130 all() intrinsic, 271

Amazon Machine Image (AMI), 113-114

Amazon Web Services, 109-117

AMBER molecular modeling package, 427

Amdahl's Law, 35-36, 188, 195

AMI, see Amazon Machine Image any() intrinsic, 271

ARM, 19

Array of structures (AOS), 429-430

Arrays, CUDA, see CUDA Arrays

Asynchronous operations kernel launch, 205-206, 209 memory copy, 178-181

atomicAdd() intrinsic, 201, 236 and reduction, 376-377 and single-pass reduction, 373-376

atomicAnd() intrinsic, 151, 236

atomicCAS() intrinsic, 152, 236

atomicExch() intrinsic, 153, 236

atomicOr() intrinsic, 200, 236

Atomic operations in global memory, 152-155, 216

in host memory, 237

and reduction, 367, 373-377

in shared memory, 239-240

Availability zones, 112

AWS, see Amazon Web Services

B

Ballot instruction, xxii, 271

Barriers, memory, 240-241

Bit reversal, 242

Block ID, 212-213, 275

Block-levelprimitives,272

blockDim, 213, 275

blockIdx, 213, 275

Blocking waits, 79, 186

Block-levelprimitives,272

Block linear addressing, 308-309

Boids, 421, 447

brev() intrinsic, 242

Bridge chip, PCI Express, 19-21

Brook, 5

BSD license, 7, 471

Buck, Ian, 5

byte_perm() intrinsic, 242

C

Cachecoherency,209

Cache configuration, 75

Callbacks, stream, 77

chLib, see CUDA Handbook Library

chTimerGetTime(), 175, 471-472

Clockregister,275

clock() intrinsic, 275

clock64() intrinsic, 275

clz() intrinsic, 242

Coalescing constraints, 143-147

Coherency, 209

Command buffers, 32-35

Concurrency

CPU/GPU, 174-178

inter-engine, 187-196

inter-GPU, 202

kernel execution, 199-201

Condition codes, 267

Constant memory, 156-158

and dynamic parallelism, 224

and N-body, 434-436

and normalized cross-correlation, 456-459

Contexts, 67-71

Convergence, 268-269

Copy-on-write, 25

cuArray3DGetDescriptor(), 313

cuArray3DCreate(), 312

cuArrayCreate(),312

cuCtxAttach(), 70

cuCtxCreate(),

andblockingwaits,39

and local memory usage, 159, 211

and mapped pinned memory, 124

cuCtxDestroy(), 70, 202

cuCtxDetach(),70

cuCtxGetLimit(), 71

cuCtxPopCurrent(), 70, 294-296

cuCtxPushCurrent(), 70, 294-296

cuCtxSetCacheConfig(), 71, 75

cuCtxSetLimit(), 71

cuCtxSynchronize(), 77, 209

CUDA arrays, 82, 308-313

vs.devicememory,313

CUDA By Example, xxi-xxii

CUDA Handbook Library, 471-479

Command line parsing, 476-477

Driver API support, 474-475

Error handling, 477-479

Shmoos, 475-476

Threading, 472-474

Timing, 471-472

CUDA runtime

lazy initialization, 53

memory copies, 166-169

vs. driver API, 87-92

CUDA MEMCPY3D structure, 92

CUDABindTexture(), 85, 155, 315

CUDABindTexture2D(), 85, 315, 338

CUDABindTextureToArray(), 85, 315

CUDADeviceProp structure, 61-63

asyncEngineCount member, 166

integrated member, 18

kernelExecTimeoutEnabled member, 210

maxTexture1DLayered member, 343

maxTexture2DLayered member,343

maxTexture3D member, 210

totalGlobalMem member, 75, 137

unifiedAddressing member, 127

cudadevicereset(),202

CUDADeviceSetCacheConfig(), 75, 162-163

CUDADeviceSynchronize(), 77, 209

device runtime, 223

in multi-GPU N-body, 297-299

CUDAEventCreate(),

andblockingwaits,39

and disabling timing, 225

CUDAEventCreateWithFlags(), 89-90

CUDAEventQuery(),186

cudaeventRecord(),183-184,359

cudaEventSynchronize(),89-90,183-184

cudaExtent structure, 135, 168, 311

CUDAFree(),133

and deferred initialization, 67

cudaFree (0), 67

CUDAFuncSetCacheConfig(), 75, 162-163

CUDAGetDeviceCount(), 60

CUDAGetLastError(), 210

and device runtime, 225

luaGetSymbolAddress(), 139, 157, 201

and scan, 405

cudahostAlloc(),81

cudaeHostGetDevicePointer(),81

cudahostRegister(),81,126

CUDAHostUnregister(), 81, 126

CUDAAlloc(),133

cudamalloc3D(),75,134

CUDAAlloc3DArray(), 341-343

and layered textures, 343

CUDAAllocArray(), 309-310, 341-342

cudaMallocPitch(),134,339

cudamemcpy(),31,166

cudememcpyAsync(),165,359-361

cudaMemcpy3DParms structure, 92, 168

cudaMemcpyFromSymbol(),138,157

cudamemcpyKind enumeration, 164

cudMemcpyToSymbol(),138,157

and notrimalized cross-correlation, 456-458

cudMemcpyToSymbolAsync()

and N-body computations, 435-436

cudaMemset(), 139

cudamemset2D(),139

cudaPitchedPtr structure, 134, 342

CUDAPointerAttributes structure, 141, 291-292

cudaPos structure, 169, 342

CUDASetDevice(),288

CUDASetDeviceFlags()

andblockingwaits,39

and local memory usage, 159, 211

and mapped pinned memory, 124

cudaDeviceSetLimit(), 135-136
input values, 227-228
and malloc() in kernels, 136
and synchronization depth, 222, 226-227
cudaStreamCreate()
and device runtime, 225
nonblocking streams, 225
cudaStreamQuery(), 186-187
and kernel thunks, 56

CUDAWaitEvent(),41,202, 292-293

cuDeviceComputeCapability(),60

cuDeviceGet(),60,66

cuDeviceGetAttribute(), 60
asynchronous engine count, 166
integrated GPU, 18
kernel execution timeout, 210
texturing dimensions, 341
unified addressing, 127

cuDeviceGetCount(),60,66

cuDeviceGetName(),66

cuDeviceTotalMem(),138

cuDriverGetVersion(),53

cuEventCreate(),184 andblockingwaits,39

cuFuncGetAttribute(), 74 and local memory usage, 158

cuFuncSetCacheConfig(), 75, 163

cuInit(),59,65-67

cuLaunchGrid(),210

cuLaunchKernel(), 73-74, 207-208

cuMemAlloc(), 76, 133

cuMemAllocPitch(), 135 and coalescing, 145

cuMemcpy(), 31, 166

cuMemcpy3D(),91,166

cuMemcpyDtoD(),164

cuMemcpyDtoH(), 164

cuMemcpyHtoD(),164

cuMemcpyHtoDAsync(), 165

cuMemFree(), 76, 133

cuMemGetAddressRange(), 141

cuMemGetInfo(), 76

cuMemHostAlloc(), 124-125, 135 and mapped pinned memory, 124 and write combining memory, 12

cuMemHostGetDevicePointer(),124

cuMemHostGetFlags(),80

cuMemHostRegister(), 81, 126 and UVA, 31, 126

cuMemset*(),139-140

cuModuleGetFunction(), 73

cuModuleGetGlobal(), 73, 139, 157

cuModuleGetTexRef(),73

cuModuleLoadDataEx(), 103-104

cuobjdump, 105-106, 275

cuPointerGetAttribute(),142,291

Current context stack, 69-70

cuStreamAddCallback(),77

cuStreamCreate(),89

cuStreamQuery(), and kernel thunks, 56

cuStreamSynchronize(),89

cuTexRefSetAddress(),85,155 and state changes,332

cuTexRefSetAddress2D(),85

cuTexRefSetArray(),85 and state changes,332

cuTexRefSetFormat(),316-317

D

_dadd_rn() intrinsic, 249
suppressing multiply-add, 253

Demand paging, 25

Device memory vs.CUDA arrays,313

Devices, 59-63

dim3 structure, 207

Direct memory access, 27-28,79-80

Direct3D, 3, 86-87

Divergence, 267-269

DMA, see Direct Memory Access

dmulrn() intrinsic,249

suppressing multiply-add, 253

double2float() intrinsic, 234

double2loint() intrinsic, 234

double as long long() intrinsic, 234

Driver API

vs.CUDA runtime,87-92

facilities, 474-475

memory copies, 169-171

Driver models

User mode client driver, 54-55

WDDM (Windows Display Driver Model), 55-56

XPDDM (Windows XP Driver Model), 55

Dynamic parallelism, xxii, 222-230

E

EBS, see Elastic Block Storage

EC2, see Elastic Compute Cloud

ECC, see Error correcting codes

Elastic Block Storage, 113

Elastic Compute Cloud, 109-117

Error correcting codes (ECC), 155-156

Events, 78-79

and CPU/CPU concurrency, 183

queries, 186

and timing, 186-187

Extreme values, floating point, 247-248

F

fadd_rn() intrinsic, 249

suppressing multiply-add, 251

False sharing, 15-16

fdivide_rn() intrinsic, 251

Fermi

comparison with Tesla, 43-46

instruction set, 279-285

ffs() intrinsic

float_as_int() intrinsic, 234, 251

float2 structure, 235

float4 structure, 235, 318

loat2half() intrinsic, 253

Floating point

conversion, 249-250

double precision, 253

extreme values, 247-248

formats, 245

half precision, 253

intrinsicsofconversion,250

intrinsicsofrounding,249

library, 259-265

representations, 245

rounding,248-249

single precision, 250-253

streaming multiprocessor support, 244-265

fmul_rn() intrinsic, 249

suppressing multiply-add, 251

Front-side bus, 12-13

Functions (Cufunction), 73-75

Funnel shift, 243-244

G

Gelsinger, Pat, 4

GL Utility Library, 335

Global memory

allocating, 132-137

and dynamic parallelism, 224

pointers, 131-132

querying total amount, 75-76

static allocations, 138-139

Glossary, 481-486

GLUT, see GL Utility Library

GPGPU (general-purpose GPU programming), 5

Graphics interoperability, 86-87

gridDim,213,275

H

halftofloat() intrinsic, 253

hiloint2double() intrinsic, 234

Hostinterface,39-41

Host memory

allocating, 122-123

mapped, 28-29, 81, 124, 127

pinned, 27-28, 80, 122-123

portable, 29-30, 81, 123-124, 287-288

registering, 81, 125-126

and UVA, 126-127

Host memory registration, see Registration

HT, see HyperTransport

Hyper-Q, 77

HyperTransport, 14-15

1

Integrated GPUs, 17-19

Interleaving, see Memory interleaving

Intra-GPU synchronization, 39-40

Inter-GPU synchronization, 41

Intrinsics

for block-level primitives, 272

for floating point conversion, 250

for rounding, 249

for SFU, 252

for warp shuffle, 271

int2 structure, 235

int4 structure, 235, 319

int_as_float() intrinsic, 234, 251

I/O hub, 14-17

isglobal() intrinsic, 142, 224

isochronousbandwidth,12

K

Kandrot, Edwards, xxi

Kepler

instruction set, 279-285

Kernel mode

vs.usermode,26

Kernelthunk,26

and stream and event queries, 186

and WDDM, 55-56

Kernels, 73-75

declaring, 73

launch overhead, 174-178

launching, 206-208

L

Lanes,PCIExpress,12

Lanes, thread, 213

Layered textures, 342-343

Lazy allocation, 25

Linux

driver model, 54-55

in EC2, 114

Local memory, 158-161

and context creation, 159

and dynamic parallelism, 224-225

long_as-double() intrinsic, 234

Loop unrolling, 430-431

M

make CUDAPitchedPtr function, 342

Mapped file 1/0, 25

Mapped pinned memory, 81, 124, 361-362

Math library, floating point, 259-265

Maximum, 269

Memset, see Memory set

Memorycopy,27-28,164-171

asynchronous,165-166

CUDA runtime v. driver API, 90-92

driver overhead, 179-180

functions, CUDA runtime, 166-169

functions, driver API, 169-170

pageable, 80, 183-184

peer-to-peer, 288-289, 293-296

Memory interleaving, 16

Memory set, 139-140

Microbenchmarks, 6

Kernel launch overhead, 174-178

Memory allocation, 135-137

Memory copy overhead (device@host), 181

Memory copy overhead (host@device), 179-180

Global memory bandwidth, 147-151

Microdemos, 7

Concurrency, CPU/GPU, 183-186

concurrency, inter-engine, 189-196

concurrency, intra-GPU, 189-196

concurrency, kernel execution, 199-201

float@half conversion, 253-258

pageablememcpy,183-186

peer-to-peer memcpy, 293-294

spin locks, 152-155

surface read/write, 1D, 333-335

surface read/write, 2D, 340

texturing: 9-bit interpolation, 329-331

texturing: addressing modes, 335-333

texturing: increasing address space coverage, 318-321

texturing:unnormized coordinates,325-328

threadID,216-220

Minimum, 269

Modules, 71-73

Moore's Law, 4

mul24() intrinsic, 44, 242

mul64hi() intrinsic, 242

mulhi() intrinsic, 242

Multiple GPU programming

with current context stack, 294-296

and multiple CPU threads, 299-303

and inter-GPU synchronization, 292-294

hardware, 19-22

and N-body, 296-302

scalability, 438

and single CPU thread, 294-299

Multithreading

and N-body, 442-444

N

name mangling, 74

N-body, 421-447

and constant memory, 434-436

and multiple GPUs, 296-302

and shared memory, 432-434

Nehalem [Intel i7], 15

Newton-Raphson iteration, 440

Nonblocking streams, 183, 225

Nonuniform memory access (NUMA)

hardware, 14-17

software, 128-130

Normalized cross-correlation, 449-452

Northbridge, 12-14

NULL stream, 77-78, 178-182 and concurrency breaks, 181, 196 and nonblocking streams, 183

NUMA, see Nonuniform memory access
nvcc, 57-58, 93-100

code generation options, 99-100

compilation trajectories, 94-95

compiler/linker options, 96-97

environment options, 95-96

miscellaneous options, 97-98

passthroughoptions,97

nvidia-smi, 106-109

0

Occupancy, 220-222

OpenGL, 86-87, 335-337

Open source, 7-8, 471

Opteron, 14

Optimization journeys, 7
N-body, 428-434
normalized cross-correlation, 452-464
reduction, 367-372
SAXPY (host memory), 358-363
Scan, 394-407

P

Page, memory, 23-24

Page table, 24-25

Page table entry (PTE), 23-25

Parallel prefix sum, see Scan

PCIe, see PCI Express

PCI Express, 12 integration with CPUs, 17

Peer-to-peer, 21, 143

mappings, 31-32

memory copies, 288-289

Performance counters, 272

Pinned memory, 27-28

registering, 125-126

Pitch, 133-135, 307-308

popc () intrinsic, 242

Pointers, 131-132

Pointer queries, 140-142

Population count, 242

Portable pinned memory, 81, 123-124, 288

__prof_trigger() intrinsic, 272

PTE, see page table entry

PTX (parallel thread execution), 57-59, 100-104, 411

ptxas, the PTX assembler, 100-104

command line options, 101-103

Q

QPI, see QuickPath Interconnect

Queries
amount of global memory, 75-76
device attributes, 60-63
event, 186
pointer, 140-142
stream, 56, 186

QuickPath Interconnect, 14-15

R

RDTSC instruction, 78

Reciprocal, 251

Reciprocal square root, 251-252, 440 accuracy by SFU, 252

Reduction, 365-383 of arbitrary data types, 378-381 with atomics, 376-377 of predicates, 382 single-pass 373-376 two-pass, 367-372 warps, 382-383

Registers, 233-234

Registration, host memory, 28, 31, 81, 125-126

Rotation (bitwise), 243-244

S

S3, see Simple Storage Service

_sad() intrinsic
_saturate() intrinsic, 253

Sanders, Jason, xxi

SASS, see Streaming Assemblysaturate () intrinsic, 253

SAXPY (scaled vector addition), 354-363

Scalable Link Interface (SLI), 19-21

Scan (parallel prefix sum), 385-419 and circuit design, 390-393

exclusive v. inclusive, 386, 391

reduce-then-scan (recursive), 400-403

reduce-then-scan (single pass), 403-407

scan-then-fan, 394-400

and stream compaction, 414-417

warp scan, 407-414

and warp shuffle, 410-414

SDK (Software Development Kit)

SFU, see Special Function Unit

Shared memory, 162-164

atomic operations, 239-240

and dynamic parallelism, 242

and N-body, 432-434

and normalized cross-correlation, 459-460

pointers, 164

and Scan, 395-396

un-sized declarations, 163

and the volatile keyword, 164

and warp synchronous code, 164

shfl() intrinsics, 271-272

Shmoo, 475-477

and kernel concurrency, 191

Shuffle instruction, 271

Simple Storage Service (S3), 112-113

SLI, see Scalable Link Interface

SOC, see System on a Chip

Software pipelining of streams, 76-77, 192-193

Special Function Unit, 251-252

Spin locks, 152-154

SSE, see Streaming SIMD Extensions

Stream callbacks, 77

Stream compaction, 414-417

Streaming Assembly (SASS), 105, 275-285

forwarp scan,412-414

Streaming Multiprocessors, (SMs), 46-50, 231-285

Streaming SIMD Extensions (SSE), 4

and N-body, 440-441

Streaming workloads, 353-363

in device memory, 355-357

and mapped pinned memory, 361-362

and streams, 359-361

Streams, 76-78

and software pipelining, 76-77, 359-361

NULL stream, 77-78, 181, 196

queries, 56, 186

string literals

to reference kernels and symbols, 74, 138-139

Structure of Arrays (S0A), 429

Surface load/store

1D,333-335

2D,340

SASS instructions, 283-284

Surface references, 85-86

1D,333-334

2D,340

Stream callbacks, 77

Streaming workloads, 353-363

Sum of absolute differences, 242

surf1Dread() intrinsic, 333

surf1Dwrite() intrinsic, 333-335

Synchronous operations

Memorycopy,165-166

syncthreads() intrinsic, 163, 240

avoiding - see warp synchronous code

and reduction, 368-369

and scan, 395-397

syncthreads_and() intrinsic, 272

syncthreads_count() intrinsic, 272, 365

syncthreads_or() intrinsic, 272

Symmetric multiprocessors, 13-14

System on a chip (SOC), 19

T

TCC, see Tesla Compute Cluster driver

TDR, see Timeout Detection and Recovery

Tesla

comparison with Fermi, 43-46

instruction set, 276-279

Tesla Compute Cluster driver, 57

Texture references, 82-85

tex1Dfetch() intrinsic, 318

Texturing, 305-349

1D,314-317

2D,335-339

3D, 340-342

and coalescing constraints, 317-318

and normalized cross-correlation, 452-456

from device memory, 155, 338-339

hardware capabilities, 345-347

from host memory, 321-323

from layered textures, 342-343

with normalized coordinates, 331-332

quick reference, 345-350

withunnormized coordinates,323-331

Thread affinity, 128-131

threadfence() intrinsic, 240

threadfence_block() intrinsic, 240

threadfence_system() intrinsic, 241

Thread ID, 216

threadIdx, 213, 275

Threads, CPU, and affinity, 128-129 library support, 472-474 Threads, GPU, 213
Timeout Detection and Recovery (TDR), 56-57
Timing, CPU-based, 471-472
Timing, GPU-based CUDA events, 78-79 hardware, 39
TLB, see Translation Lookaside Buffer
Translation Lookaside Buffer, 25

U

umul24 () intrinsic, 463 umul64hi () intrinsic, 463 umulhi () intrinsic, 463 Unified virtual addressing (UVA), xxii, 30-31, 55, 69, 126-127 and mapped pinned memory, 124, 125 and memcpy functions, 166 inferring device from address, 291-292 usad () intrinsic, 242 User mode v. kernel mode, 26 UVA, see unified virtual addressing

V

valloc(), 126
Video instructions, scalar, 272-274

Video instructions, vector, 273-274
VirtualAlloc(), 126
VirtualAllocExNuma(), 130
VirtualFreeEx(), 130
volatile keyword and shared memory, 164

W

Warp-level primitives, 270-272
Warp shuffle, xxii, 271-272 and N-body, 436-437 and reduction, 382-383 and scan, 410-412
Warp synchronous code, 164 for reduction, 369-372 and the volatile keyword, 174
Warps, 213 and occupancy, 220
WDDM, see Windows Display Driver Model
Width-in-bytes, see Pitch
Windows, 55-57, 64-67
Windows Display Driver Model, 55-56
Write combining memory, 18, 124-125

Z

Zero-copy, 19, 361-362

This page intentionally left blank

README

Index

A

B

C

D

E

F

G

H

1

K

L

M

N

0

P

Q

R

S

T

U

V

W

Z