OpenTech
Books
Toggle theme
Toggle theme
Table of Contents
01_第1章_简介
README
1.1_方法
1.2_代码
1.3_资源
1.4_结构
01_Chapter_1_Background
1.1_Our_Approach
1.2_Code
1.3_Administrative_Items
1.4_Road_Map
02_Chapter_2_Hardware_Architecture
2.1_CPU_Configurations
2.2_Integrated_GPUs
2.3_Multiple_GPUs
2.4_Address_Spaces_in_CUDA
2.5_CPU_GPU_Interactions
2.6_GPU_Architecture
2.7_Further_Reading
02_第2章_硬件架构
README
2.1_CPU配置
2.2_集成GPU
2.3_多GPU
2.4_CUDA中的地址空间
2.5_CPU与GPU交互
2.6_GPU架构
2.7_延伸阅读
03_Chapter_3_Software_Architecture
3.1_Software_Layers
3.2_Devices_and_Initialization
3.3_Contexts
3.4_Modules_and_Functions
3.5_Kernels_(Functions)
3.6_Device_Memory
3.7_Streams_and_Events
3.8_Host_Memory
3.9_CUDA_Arrays_and_Texturing
3.10_Graphics_Interoperability
3.11_The_CUDA_Runtime_and_CUDA_Driver_API
03_第3章_软件架构
README
3.1_软件层
3.2_设备与初始化
3.3_上下文
3.4_模块与函数
3.5_内核(函数)
3.6_设备内存
3.7_流与事件
3.8_主机内存
3.9_CUDA数组与纹理操作
3.10_图形互操作性
3.11_CUDA运行时与CUDA驱动程序API
04_第4章_软件环境
README
4.1_nvcc——CUDA编译器驱动程序
4.2_pxas——PTX汇编工具
4.3_cuobjdump
4.4_nvidia-smi
4.5_亚马逊Web服务
04_Chapter_4_Software_Environment
4.1_nvcc-CUDA_Compiler_Driver
4.3_cuobjdump
4.4_nvidia-smi
4.5_Amazon_Web_Services
05_第5章_内存
README
5.1_主机内存
5.2_全局内存
5.3_常量内存
5.4_本地内存
5.5_纹理内存
5.6_共享内存
5.7_内存复制
05_Chapter_5_Memory
5.1_Host_Memory
5.2_Global_Memory
5.3_Constant_Memory
5.4_Local_Memory
5.5_Texture_Memory
5.6_Shared_Memory
5.7_Memory_Copy
06_第6章_流与事件
README
6.1_CPU与GPU的并发:隐藏驱动程序开销
6.2_异步的内存复制
6.3_CUDA事件:CPU与GPU同步
6.4_CUDA事件:计时
6.5_并发复制和内核处理
6.6_映射锁页内存
6.7_并发内核处理
6.9_源代码参考
06_Chapter_6_Streams_and_Events
6.1_CPU_GPU_Concurrency_Covering_Driver_Overhead
6.2_Asynchronous_Memcpy
6.3_CUDA_Events_CPU_GPU_Synchronization
6.4_CUDA_Events_Timing
6.5_Concurrent_Copying_and_Kernel_Processing
6.6_Mapped_Pinned_Memory
6.7_Concurrent_Kernel_Processing
6.9_Source_Code_Reference
07_第7章_内核执行
README
7.1_概况
7.2_语法
7.3_线程块、线程、线程束、束内线程
7.4_占用率
7.5_动态并行
07_Chapter_7_Kernel_Execution
7.1_Overview
7.2_Syntax
7.3_Blocks_Threads_Warps_and_Lanes
7.4_Occupancy
7.5_Dynamic_Parallelism
08_第8章_流处理器簇
README
8.1_内存
8.2_整型支持
8.3_浮点支持
8.4_条件代码
8.5_纹理与表面操作
8.6_其他指令
8.7_指令集
08_Chapter_8_Streaming_Multiprocessors
8.1_Memory
8.2_Integer_Support
8.3_Floating-Point_Support
8.4_Conditional_Code
8.5_Textures_and_Surfaces
8.6_Miscellaneous_Instructions
8.7_Instruction_Sets
09_第9章_多GPU
README
9.1_概述
9.2_点对点机制
9.4_多GPU间同步
9.5_单线程多GPU方案
9.6_多线程多GPU方案
09_Chapter_9_Multiple_GPUs
9.1_Overview
9.2_Peer-to-Peer
9.3_UVA_Inferring_Device_from_Address
9.4_Inter-GPU_Synchronization
9.5_Single-Threaded_Multi-GPU
9.6_Multithreaded_Multi-GPU
10_第10章_纹理操作
README
10.1_简介
10.2_纹理内存
10.3_一维纹理操作
10.4_纹理作为数据读取方式
10.5_使用非归一化坐标的纹理操作
10.6_使用归一化坐标的纹理操作
10.7_一维表面内存的读写
10.8_二维纹理操作
10.9_二维纹理操作:避免复制
10.10_三维纹理操作
10.11_分层纹理
10.12_最优线程块大小选择以及性能
10.13_纹理操作快速参考
10_Chapter_10_Texturing
10.1_Overview
10.2_Texture_Memory
10.3_1D_Texturing
10.4_Texture_as_a_Read_Path
10.5_Texturing_with_Unnormalized_Coordinates
10.6_Texturing_with_Normalized_Coordinates
10.7_1D_Surface_Read_Write
10.8_2D_Texturing
10.9_2D_Texturing_Copy_Avoidance
10.10_3D_Texturing
10.11_Layered_Textures
10.12_Optimal_Block_Sizing_and_Performance
10.13_Texturing_Quick_References
11_第11章_流式负载
README
11.1_设备内存
11.2_异步内存复制
11.3_流
11.4_映射锁页内存
11.5_性能评价与本章小结
11_Chapter_11_Streaming_Workloads
11.1_Device_Memory
11.2_Asynchronous_Memcpy
11.3_Streams
11.4_Mapped_Pinned_Memory
11.5_Performance_and_Summary
12_第12章_归约算法
README
12.1_概述
12.2_两遍归约
12.3_单遍归约
12.4_使用原子操作的归约
12.5_任意线程块大小的归约
12.6_适应任意数据类型的归约
12.7_基于断定的归约
12.8_基于洗牌指令的线程束归约
12_Chapter_12_Reduction
12.1_Overview
12.2_Two-Pass_Reduction
12.3_Single-Pass_Reduction
12.4_Reduction_with_Atomics
12.5_Arbitrary_Block_Sizes
12.6_Reduction_Using_Arbitrary_Data_Types
12.7_Predicate_Reduction
12.8_Warp_Reduction_with_Shuffle
13_第13章_扫描算法
README
13.1_定义与变形
13.2_概述
13.3_扫描和电路设计
13.4_CUDA实现
13.5_线程束扫描
13.6_流压缩
13.7_参考文献(并行扫描算法)
13.8_延伸阅读(并行前缀求和电路)
13_Chapter_13_Scan
13.1_Definition_and_Variations
13.2_Overview
13.3_Scan_and_Circuit_Design
13.4_CUDA_Implementations
13.5_Warp_Scans
13.6_Stream_Compaction
13.7_References_Parallel_Scan_Algorithms
13.8_Further_Reading_Parallel_Prefix_Sum_Circuits
14_第14章_N-体问题
README
14.1_概述
14.2_简单实现
14.3_基于共享内存实现
14.4_基于常量内存实现
14.5_基于线程束洗牌实现
14.6_多GPU及其扩展性
14.7_CPU的优化
14.8_小结
14_Chapter_14_N-Body
14.1_Introduction
14.2_Naive_Implementation
14.3_Shared_Memory
14.4_Constant_Memory
14.5_Warp_Shuffle
14.6_Multiple_GPUs_and_Scalability
14.7_CPU_Optimizations
14.8_Conclusion
14.9_References_and_Further_Reading
15_第15章_图像处理的归一化相关系数计算
README
15.1_概述
15.2_简单的纹理实现
15.3_常量内存中的模板
15.4_共享内存中的图像
15.5_进一步优化
15.6_源代码
15.7_性能评价
15.8_延伸阅读
15_Chapter_15_Image_Processing_Normalized_Correlation
15.1_Overview
15.3_Template_in_Constant_Memory
15.4_Image_in_Shared_Memory
15.5_Further_Optimizations
15.6_Source_Code
15.7_Performance_and_Further_Reading
15.8_Further_Reading
16_Appendix_A_The_CUDA_Handbook_Library
A.1_Timing
A.2_Threading
A.3_Driver_API_Facilities
A.4_Shmoos
A.5_Command_Line_Parsing
A.6_Error_Handling
17_Glossary_TLA_Decoder
README
18_Index
README