9.6 Multithreaded Multi-GPU

CUDA has supported multiple GPUs since the beginning, but until CUDA 4.0, each GPU had to be controlled by a separate CPU thread. For workloads that required a lot of CPU power, that requirement was never very onerous because the full power of modern multicore processors can be unlocked only through multithreading.

The multithreaded implementation of multi-GPU N-Body creates one CPU thread per GPU, and it delegates the dispatch and synchronization of the work for a given N-body pass to each thread. The main thread splits the work evenly between GPUs, delegates work to each worker thread by signaling an event (or a semaphore, on POSIX platforms such as Linux), and then waits for all of the worker threads to signal completion before proceeding. As the number of GPUs grows, synchronization overhead starts to chip away at the benefits from parallelism.

This implementation of N-body uses the same multithreading library as the multithreaded implementation of N-body, described in Section 14.9. The

workerThread class, described in Appendix A.2, enables the application thread to "delegate" work to CPU threads, then synchronize on the worker threads' completion of the delegated task.

Listing 9.5 gives the host code that creates and initializes the CPU threads. Two globals, g_numGPUs and g_GPUThreadPool, contain the GPU count and a worker thread for each. After each CPU thread is created, it is initialized by synchronously calling the initializeGPU() function, which affiliates the CPU thread with a given GPU—an affiliation that never changes during the course of the application's execution.

Listing 9.5 Multithreaded multi-GPU initialization code.

workerThread \*g_CPUThreadPool; int g_numCPUCores;   
workerThread \*g_GPUThreadPool; int g_numGPUs;   
structgpuInit_struct { int iGPU; cudaError_t status; };   
void initializeGPU(void \*p) { cudaError_t status; gpuInit_struct \*p = (gpuInit_struct \*) _p; CUDA_CHECK(cudaSetDevice(p->iGPU)); CUDA_CHECK(cudaSetDeviceFlags(cudaDeviceMapHost)); CUDA_CHECK(cudaFree(0)); Error: p->status  $=$  status; }   
//...below is from main() if（g_numGPUs）{ chCommandLineGet(&g_numGPUs，"numgpus",argc,argv); g_GPUThreadPool  $=$  new workerThread[g_numGPUs]; for(size_t i=0;i<g_numGPUs;i++) { if(!g_GPUThreadPool[i].initialize()){ fprintf(stderr,"Error initializing thread pool\n"); return 1; } } for（int i=0;i<g_numGPUs;i++）{ gpuInit_struct initGPU  $=$  {i};

g_GPUThreadPool[i].delegatesynchronous( initializeGPU, &initGPU); if（ CUDASuccess  $! =$  initGPU.status）{ fprintf（stderr，"Initializing GPU %d failed " "with  $\% \mathrm{d}$  (%s)\n", i, initGPU.status, CUDAGetErrorString（initGPU.status））; return 1; } }

Once the worker threads are initialized, they suspend waiting on a thread synchronization primitive until the application thread dispatches work to them. Listing 9.6 shows the host code that dispatches work to the GPUs: The gpuDelegation structure encapsulates the work that a given GPU must do, and the gpuWorkerThread function is invoked for each of the worker threads created by the code in Listing 9.5. The application thread code, shown in Listing 9.7, creates a gpuDelegation structure for each worker thread and calls the delegateAsynchronous() method to invoke the code in Listing 9.6. The waitAll() method then waits until all of the worker threads have finished. The performance and scaling results of the single-threaded and multithreaded version of multi-GPU N-body are summarized in Section 14.7.

Listing 9.6 Host code (worker thread).

struct gpuDelegation {
    size_t i; // base offset for this thread to process
    size_t n; // size of this thread's problem
    size_t N; // total number of bodies
    float *hostPosMass;
    float *hostForce;
    float softeningSquared;
   udaError_t status;
};
void
gpuWorkerThread(void *_p)
{
   udaError_t status;
   gpuDelegation *p = (gpuDelegation *) _p;
    float *dptrPosMass = 0;
    float *dptrForce = 0;
    // Each GPU has its own device pointer to the host pointer.

CUDART_CHECK(udaMalloc(&ptrPosMass,4\*p->N\*sizeof(float)）); CUDART_CHECK(udaMalloc(&ptrForce，3\*p->n\*sizeof(float)）); CUDART_CHECK(udaMemcpyAsync(dptrPosMass, p->hostPosMass, 4\*p->N\*sizeof(float),udaMemcpyHostToDevice）); ComputeNBodyGravitationmultiGPU<<<300,256,256\*sizeof(float4)>>>dptrForce, dptrPosMass, p->softeningSquared, p->i, p->n, p->N）; //NOTE:synchronous memcpy，so no need for further//synchronization with device CUDART_CHECK（udaMemcpy( p->hostForce  $+3^{*}p - > i$  dptrForce, 3\*p->n\*sizeof(float),udaMemcpyDeviceToHost）); Error: CUDAFree(dptrPosMass); CUDAFree(dptrForce);  $\mathbb{P}\rightarrow$  status  $=$  status;   
1

Listing 9.7 ?Host code (application thread)

float   
ComputeGravitation-multiGPU_threaded( float \*force, float \*posMass, float softeningSquared, size_t N   
1 chTimerTimestamp start, end; chTimerGetTime( &start ); { gpuDelegation \*pgpu  $=$  newgpuDelegation[g_numGPUs]; size_t bodiesPerGPU  $= \mathrm{N}$  /g_numGPUs; if(N%g_numGPUs){ return 0.0f; } size_t i; for  $(\mathrm{i} = 0$  .i  $<  \mathbf{g}_{-}$  numGPUs;  $\mathrm{i + + }$  ）{ pgpu[i].hostPosMass  $=$  g_hostAOS_PosMass; pgpu[i].hostForce  $=$  g_hostAOS_Force; pgpu[i].softeningSquared  $=$  softeningSquared;

pgpu[i].i = bodiesPerGPU*i; pgpu[i].n = bodiesPerGPU; pgpu[i].N = N; g_GPUThreadPool[i].delegateAsynchronous(gpuWorkerThread, &pgpu[i]); } workerThread::waitAll( g_GPUThreadPool, g_numGPUs ); delete[] pgpu; } chTimerGetTime( &end ); return chTimerElapsedTime( &start, &end ) * 1000.0f;

This page intentionally left blank

9.6_Multithreaded_Multi-GPU

9.6 Multithreaded Multi-GPU

Texturing