Parallel computing using general purpose GPU is really taking off with advancement of technology from Nvidia, AMD, Intel. Especially Nvidia is dominating the field with variety of GPU offering and also software infrastructure CUDA (initially called as Compute Unified Device Architecture), which is a parallel computing platform and application programming interface (API) model. In this article, I share how GPU programming with CUDA looks like using UCS server with Nvidia GPU GRID K1

GPU Programming with CUDA

GPU Programming model

First introducing two terminologies:

  • Host: The CPU (e.g. x86, ARM) and its memory (host memory)
  • Device: The GPU (e.g. Nvidia GPU) and its memory (device memory).

There will be host code, which is executed in host CPU (e.g. x86); and there will be device code, which is loaded in host and push into GPU to run. The following diagram shows programming model and execution flow.

GPU Programming with CUDA model

The basic hello world code is shown below:


__global__ void mykernel(void) {


int main(void) {
    printf("Hello World!\n");
    return 0;

The host code is same as usual, but the device code is marked with a new keyword “global”. host code invokes device code almost same as usual, except it adds «1,1» which means using one block and one thread, which isn’t interesting at all. Let’s trying something more interesting.

Running an example of vector additions using mutiple block and multiple threads

The main advantage of GPU computing is to have huge numbers of parallel executions. For that, CUDA introduces:

  • Block: On the device, each block can execute in parallel, each block has index of “blockIdx.x”
  • Thread: a block can be split into parallel threads, each thread has index of “threadIdx.x”
  • Combining block with thread: the index is “threadIdx.x + blockIdx.x * blockDim.x”

GPU block thread index access

The below code creates two input arrays, which holds random integers, and the third array to hold result of addition, which is to be done by GPU.

First make sure we have Nvidia GPU GRID K1 and Nvidia compiler in place:

iot@iotg-ml-1:~/cuda-ex$ nvidia-smi 
Fri Jun  2 06:41:22 2017       
| NVIDIA-SMI 367.124                Driver Version: 367.124                   |
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  GRID K1             Off  | 0000:85:00.0     Off |                  N/A |
| N/A   34C    P0    13W /  31W |      0MiB /  4036MiB |      0%      Default |
|   1  GRID K1             Off  | 0000:86:00.0     Off |                  N/A |
| N/A   35C    P0    13W /  31W |      0MiB /  4036MiB |      0%      Default |
|   2  GRID K1             Off  | 0000:87:00.0     Off |                  N/A |
| N/A   27C    P0    13W /  31W |      0MiB /  4036MiB |      0%      Default |
|   3  GRID K1             Off  | 0000:88:00.0     Off |                  N/A |
| N/A   30C    P0    12W /  31W |      0MiB /  4036MiB |      0%      Default |
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|  No running processes found                                                 |
iot@iotg-ml-1:~/cuda-ex$ which nvcc
iot@iotg-ml-1:~/cuda-ex$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2016 NVIDIA Corporation
Built on Sun_Sep__4_22:14:01_CDT_2016
Cuda compilation tools, release 8.0, V8.0.44

Now let’s try following example:

iot@iotg-ml-1:~/cuda-ex$ cat 
#include <stdio.h>

#define N (2048*2048)

__global__ void add(int *a, int *b, int *c, int n) {
	int index = threadIdx.x + blockIdx.x * blockDim.x;
	if (index < n)
		c[index] = a[index] + b[index];

int main(void) {
	int *a, *b, *c;	// host copies of a, b, c
	int *d_a, *d_b, *d_c; // device copies of a, b, c
	int size = N * sizeof(int);
	int i;

	// Alloc space for device copies of a, b, c
	cudaMalloc((void **)&d_a, size);
	cudaMalloc((void **)&d_b, size);
	cudaMalloc((void **)&d_c, size);
	// Alloc space for host copies of a, b, c and setup input values
	a = (int *)malloc(size);  for (i = 0; i < N; i++)  a[i] = rand()/10;
	b = (int *)malloc(size);  for (i = 0; i < N; i++)  b[i] = rand()/10;
	c = (int *)malloc(size);
	// Copy inputs to device
	cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
	cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);
	// Launch add() kernel on GPU
	add<<<N/THREADS_PER_BLOCK,THREADS_PER_BLOCK>>>(d_a, d_b, d_c, N);
	// Copy result back to host
	cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);
	// Verify the result
	printf("\n Verifying the result:");
        for (i = 0; i < N; i++) {
	  if ((a[i] + b[i]) != c[i]) {
	    printf("Failed at %d: a=%d, b=%d, c=%d \n", i, a[i], b[i], c[i]);
        if (i == N) printf("PASSED!\n\n");

	// Cleanup
	free(a); free(b); free(c);
	cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
	return (0);

iot@iotg-ml-1:~/cuda-ex$ nvcc -o gpu-add-vector
nvcc warning : The 'compute_20', 'sm_20', and 'sm_21' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
iot@iotg-ml-1:~/cuda-ex$ ./gpu-add-vector 

 Verifying the result:PASSED!