Parallel computing using general purpose GPU is really taking off with advancement of technology from Nvidia, AMD, Intel. Especially Nvidia is dominating the field with variety of GPU offering and also software infrastructure CUDA (initially called as Compute Unified Device Architecture), which is a parallel computing platform and application programming interface (API) model. In this article, I share how GPU programming with CUDA looks like using UCS server with Nvidia GPU GRID K1
GPU Programming model
First introducing two terminologies:
Host: The CPU (e.g. x86, ARM) and its memory (host memory)
Device: The GPU (e.g. Nvidia GPU) and its memory (device memory).
There will be host code, which is executed in host CPU (e.g. x86); and there will be device code, which is loaded in host and push into GPU to run. The following diagram shows programming model and execution flow.
The basic hello world code is shown below:
The host code is same as usual, but the device code is marked with a new keyword “global”.
host code invokes device code almost same as usual, except it adds «1,1» which means using one block and one thread, which isn’t interesting at all. Let’s trying something more interesting.
Running an example of vector additions using mutiple block and multiple threads
The main advantage of GPU computing is to have huge numbers of parallel executions. For that, CUDA introduces:
Block: On the device, each block can execute in parallel, each block has index of “blockIdx.x”
Thread: a block can be split into parallel threads, each thread has index of “threadIdx.x”
Combining block with thread: the index is “threadIdx.x + blockIdx.x * blockDim.x”
The below code creates two input arrays, which holds random integers, and the third array to hold result of addition, which is to be done by GPU.
First make sure we have Nvidia GPU GRID K1 and Nvidia compiler in place: