Numba supports Nvidia and AMD GPUs. In this course, however, we are going to focus on CUDA capable GPUs from Nvidia. There are two reasons for that: CUDA is better supported and we don't have any AMD system that can run Numba on the GPU right now.

If you want to use an already installed CUDA SDK, there are two environment variables that need to be set to let numba know where it can find the libraries it needs. On JURECA this can be done like this:

```
export NUMBAPRO_NVVM=/usr/local/software/jureca/Stages/2017a/software/CUDA/8.0.61/nvvm/lib64/libnvvm.so
export NUMBAPRO_LIBDEVICE=/usr/local/software/jureca/Stages/2017a/software/CUDA/8.0.61/nvvm/libdevice/
```

For other systems you will have to adjust the paths.

In [ ]:

```
```

In [ ]:

```
from numba import vectorize
def escape_time(p, maxtime):
"""Perform the Mandelbrot iteration until it's clear that p diverges
or the maximum number of iterations has been reached.
"""
z = 0j
for i in range(maxtime):
z = z ** 2 + p
if abs(z) > 2:
return i
return maxtime
```

In [ ]:

```
import numpy
x = numpy.linspace(-2, 2, 100)
y = numpy.linspace(-1.5, 1.5, 75)
zr, zc = numpy.meshgrid(x, 1j * y)
```

In [ ]:

```
escape_time_v = vectorize("int64(complex128, int64)",
target="parallel")(escape_time)
```

In [ ]:

```
t_cpu = %timeit -o escape_time_v(zr + zc, 10)
```

`target="parallel"`

with `target="cuda"`

the function runs on the GPU instead. Give it a try and compare the performance for different sizes:

In [ ]:

```
```

Numba allows you to write CUDA programs in Python.

While a complete introduction to CUDA is beyond the scope of this course---there are other courses for this, for example, GPU Programming with CUDA @ JSC and also many online resources available---here you'll get the nutshell version and some of the differences between CUDA C++ and CUDA Python.

CPUs are optimized for latency.

GPUs are optimized for throughput.

GPUs were (and are) made to display graphics on your screen. It doesn't matter how quickly a GPU can update a single pixel. It's important how quickly it can update all of the pixels on the screen (more than 2 million on an HD display). In addition it often must perform the same operation on a lot of vertices or pixels.

These two conditions let to a different execution model.

GPUs use *many* lightweight threads.

GPUs hide latency instead of avoiding it

GPUs work best if the problem can me mapped on a grid.

We can calculate the Mandelbrot set using `escape_time`

like this:

In [ ]:

```
maxiter = 20
rlim = (-2.2, 1.5)
ilim = (-1.5, 1.5)
nx = 100
ny = 75
dx = (rlim[1] - rlim[0]) / nx
dy = (ilim[1] - ilim[0]) / ny
M = numpy.zeros((ny, nx), dtype=int)
for i in range(ny):
for j in range(nx):
p = rlim[0] + j * dx + (ilim[0] + i * dy) * 1j
M[i, j] = escape_time(p, maxiter)
```

In [ ]:

```
%matplotlib inline
import matplotlib.pyplot as plt
plt.imshow(M, interpolation="nearest")
```

Notice that for every pair (i, j), we calculate the escape time. This makes

```
p = rlim[0] + j * dx + (ilim[0] + i * dy) * 1j
M[i, j] = escape_time(p, maxiter)
```

our kernel that we execute on the grid spanned by the two for loops. It's quite apparent on the image above, where the coordinates for each value are shown in index space.

The basic unit of execution is a thread. Each thread has an x, y, and z index that can be queried in a kernel.

```
from numba import cuda
@cuda.jit
def my_kernel:
x,y,z = cuda.grid(3) # use 2 if you only need x, y and 1 for x
```

`cuda.jit`

instead of `jit`

as decorator (see above). The kernel can also call other functions that have been decorated with `cuda.jit(device=True)`

. A Mandelbrot kernel could look like this:

In [ ]:

```
import numpy
from numba import cuda
escape_time_gpu = cuda.jit(device=True)(escape_time)
@cuda.jit
def mandelbrot_gpu(M, real_min, real_max, imag_min, imag_max):
"""Calculate the Mandelbrot set on the GPU.
Parameters
----------
M : numpy.ndarray
a two-dimensional integer array that will contain the
escape times for each point.
real_min: float
minimum value on the real axis
real_max: float
maximum value on the real axis
imag_min: float
minimum value on the imaginary axis
imag_max: float
maximum value on the imaginary axis
"""
ny, nx = M.shape
i, j = cuda.grid(2)
if i < ny and j < nx:
dx = (real_max - real_min) / nx
dy = (imag_max - imag_min) / ny
p = real_min + dx * i + (imag_min + dy * j) * 1j
M[i, j] = escape_time_gpu(p, 20)
```

CUDA needs to know how many blocks and how many threads per block you want to launch. This is called the launch configuration. To calculate the Mandelbrot set with of two 1024 by 1024 points with a block size of 32 by 32, we need 1024/32=32 by 1024/32=32 blocks.

If we want to calculate the set for 1032 by 1032 points, we have to add an extra block. The following lines take care of that:

In [ ]:

```
M = numpy.zeros((1024, 1024), dtype=numpy.int32)
block = (32, 32)
grid = (M.shape[0] // block[0] if M.shape[0] % block[0] == 0
else M.shape[0] // block[0] + 1,
int(M.shape[0] // block[1] if M.shape[1] % block[1] == 0
else M.shape[1] // block[1] + 1))
```

The launch configuration is passed in square brackets before the function arguments.

In [ ]:

```
mandelbrot_gpu[grid, block](M, -2.0, 2.0, -1.6, 1.6)
```

In [ ]:

```
plt.imshow(M, interpolation="nearest")
```

a) Use %timeit to measure the speed of the GPU implementation compared to the vectorized version.

In [ ]:

```
```

In [ ]:

```
```

Write a kernel, where each thread calculates one element of the result matrix $C=AB$, where A and B are matrices.

For the algorithm, you can take a look at Bottlenecks.

In [ ]:

```
```

As you learned in Bottlenecks, the matrix matrix multiplication tends to be memory bandwidth bound. This is true on the GPU, too.

The way to make it faster is to use faster memory. On a CPU this usually means, dividing the matrix into blocks that fit in cache and hope for the best. On a GPU at lease part of the fast memory is usually programmable. In CUDA this memory is called *shared memory*.

Shared memory is available to all *threads in a thread block*. Usually, each thread loads data from device memory into shared memory. This is followed by barrier, so that all threads are finished reading. Then the shared memory is reused as often as possible. Another barrier makes sure that all threads are done.

Let's look at an example:

The following function performs a matrix multiplication on the GPU using shared memory.

In [ ]:

```
from numba import cuda, float32
# Controls threads per block and shared memory usage.
# The computation will be done on blocks of TPBxTPB elements.
TPB = 16
@cuda.jit
def fast_matmul(A, B, C):
# Define an array in the shared memory
# The size and type of the arrays must be known at compile time
sA = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
sB = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
x, y = cuda.grid(2)
tx = cuda.threadIdx.x
ty = cuda.threadIdx.y
bpg = cuda.gridDim.x # blocks per grid
if x >= C.shape[0] and y >= C.shape[1]:
# Quit if (x, y) is outside of valid C boundary
return
# Each thread computes one element in the result matrix.
# The dot product is chunked into dot products of TPB-long vectors.
tmp = 0.
for i in range(bpg):
# Preload data into shared memory
sA[tx, ty] = A[x, ty + i * TPB]
sB[tx, ty] = B[tx + i * TPB, y]
# Wait until all threads finish preloading
cuda.syncthreads()
# Computes partial product on the shared memory
for j in range(TPB):
tmp += sA[tx, j] * sB[j, ty]
# Wait until all threads finish computing
cuda.syncthreads()
C[x, y] = tmp
```

The above example shows most special functions used in CUDA kernels:

All the cuda specific API is found in numbas cuda module (line 1)

CUDA kernels are defined like regular Python functions with the added decorator `@cuda.jit`

(line 7). The decorator makes sure that the function is compiled for the GPU.

As mentioned above, CUDA kernels are executed as a grid of blocks of threads. cuda.grid (line 14) returns the global indices of the current thread, e.g., `x, y = cuda.grid(2)`

for a two dimensional grid. The argument gives the number of dimensions (1, 2, 3). This function does not exist in CUDA C++.

In CUDA C++, the programmer usually calculates the global index from the thread index `threadIdx`

, the block index `blockIdx`

, and the size (dimension) of the block stored in `blockDim`

```
int x = blockIdx.x * blockDim.x + threadIdx.x;
```

In numba these are available through the cuda module, so that you can rewrite the C++ code above as

```
x = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
```

In `fast_matmul`

these functions are used to get the local index of the thread, which is important to use the shared memory (lines 16--18).

Shared memory is a programmable cache accessible to all threads within a block. In C++ it is allocated within a kernel as

```
__shared__ float sA[TPB, TPB]; // Allocate a 2D array of floats of size TPB
```

The cuda module provides shared.array (line 11 & 12) to do the same thing.

The last function from the cuda module used in `fast_matmul`

is cuda.syncthreads(), which implements a barrier for the threads within a block. It corresponds to `__syncthreads()`

in CUDA C++.

These functions cover most kernels.