Numba supports Nvidia and AMD GPUs. In this course, however, we are going to focus on CUDA capable GPUs from Nvidia. There are two reasons for that: CUDA is better supported and we don't have any AMD system that can run Numba on the GPU right now.
If you want to use an already installed CUDA SDK, there are two environment variables that need to be set to let numba know where it can find the libraries it needs. On JURECA this can be done like this:
export NUMBAPRO_NVVM=/usr/local/software/jureca/Stages/2017a/software/CUDA/8.0.61/nvvm/lib64/libnvvm.so export NUMBAPRO_LIBDEVICE=/usr/local/software/jureca/Stages/2017a/software/CUDA/8.0.61/nvvm/libdevice/
For other systems you will have to adjust the paths.
We already learned how to vectorize a function. Remember the Mandelbrot set. We defined a function that returns the number of iterations needed to decide if the algorithm diverges.
from numba import vectorize def escape_time(p, maxtime): """Perform the Mandelbrot iteration until it's clear that p diverges or the maximum number of iterations has been reached. """ z = 0j for i in range(maxtime): z = z ** 2 + p if abs(z) > 2: return i return maxtime
import numpy x = numpy.linspace(-2, 2, 100) y = numpy.linspace(-1.5, 1.5, 75) zr, zc = numpy.meshgrid(x, 1j * y)
escape_time_v = vectorize("int64(complex128, int64)", target="parallel")(escape_time)
t_cpu = %timeit -o escape_time_v(zr + zc, 10)
If you replace
target="cuda" the function runs on the GPU instead. Give it a try and compare the performance for different sizes:
Numba allows you to write CUDA programs in Python.
While a complete introduction to CUDA is beyond the scope of this course---there are other courses for this, for example, GPU Programming with CUDA @ JSC and also many online resources available---here you'll get the nutshell version and some of the differences between CUDA C++ and CUDA Python.
CPUs are optimized for latency.
A CPU tries to execute a given instruction as quickly as possible, i.e., it tries to keep the latency (the time between issuing and executing an instruction) as short as possible. CPUs use caches and a lot of control logic to achieve this goal.
GPUs are optimized for throughput.
GPUs were (and are) made to display graphics on your screen. It doesn't matter how quickly a GPU can update a single pixel. It's important how quickly it can update all of the pixels on the screen (more than 2 million on an HD display). In addition it often must perform the same operation on a lot of vertices or pixels.
These two conditions let to a different execution model.
GPUs use many lightweight threads.
GPUs hide latency instead of avoiding it
GPUs work best if the problem can me mapped on a grid.
We can calculate the Mandelbrot set using
escape_time like this:
maxiter = 20 rlim = (-2.2, 1.5) ilim = (-1.5, 1.5) nx = 100 ny = 75 dx = (rlim - rlim) / nx dy = (ilim - ilim) / ny M = numpy.zeros((ny, nx), dtype=int) for i in range(ny): for j in range(nx): p = rlim + j * dx + (ilim + i * dy) * 1j M[i, j] = escape_time(p, maxiter)
%matplotlib inline import matplotlib.pyplot as plt plt.imshow(M, interpolation="nearest")
Notice that for every pair (i, j), we calculate the escape time. This makes
p = rlim + j * dx + (ilim + i * dy) * 1j M[i, j] = escape_time(p, maxiter)
our kernel that we execute on the grid spanned by the two for loops. It's quite apparent on the image above, where the coordinates for each value are shown in index space.
The basic unit of execution is a thread. Each thread has an x, y, and z index that can be queried in a kernel.
from numba import cuda @cuda.jit def my_kernel: x,y,z = cuda.grid(3) # use 2 if you only need x, y and 1 for x
Threads are organized in blocks. The maximum dimension of a thread block is 1024x1024x64, where the volume of the thread block (the number of threads in the block) must be 1024 or less.
Blocks are organized in a grid to cover all the threads needed to solve you problem. The maximum grid dimensions are 2147483647x65535x65535 blocks!
For writing a CUDA kernel, we use
cuda.jit instead of
jit as decorator (see above). The kernel can also call other functions that have been decorated with
cuda.jit(device=True). A Mandelbrot kernel could look like this:
import numpy from numba import cuda escape_time_gpu = cuda.jit(device=True)(escape_time) @cuda.jit def mandelbrot_gpu(M, real_min, real_max, imag_min, imag_max): """Calculate the Mandelbrot set on the GPU. Parameters ---------- M : numpy.ndarray a two-dimensional integer array that will contain the escape times for each point. real_min: float minimum value on the real axis real_max: float maximum value on the real axis imag_min: float minimum value on the imaginary axis imag_max: float maximum value on the imaginary axis """ ny, nx = M.shape i, j = cuda.grid(2) if i < ny and j < nx: dx = (real_max - real_min) / nx dy = (imag_max - imag_min) / ny p = real_min + dx * i + (imag_min + dy * j) * 1j M[i, j] = escape_time_gpu(p, 20)
CUDA needs to know how many blocks and how many threads per block you want to launch. This is called the launch configuration. To calculate the Mandelbrot set with of two 1024 by 1024 points with a block size of 32 by 32, we need 1024/32=32 by 1024/32=32 blocks.
If we want to calculate the set for 1032 by 1032 points, we have to add an extra block. The following lines take care of that:
M = numpy.zeros((1024, 1024), dtype=numpy.int32) block = (32, 32) grid = (M.shape // block if M.shape % block == 0 else M.shape // block + 1, int(M.shape // block if M.shape % block == 0 else M.shape // block + 1))
The launch configuration is passed in square brackets before the function arguments.
mandelbrot_gpu[grid, block](M, -2.0, 2.0, -1.6, 1.6)
a) Use %timeit to measure the speed of the GPU implementation compared to the vectorized version.
The kernel calculates dx and dy for every pixel although it is the same for all of them. Change the kernel so that it takes dx and dy as arguments and calculate dx and dy before you call the kernel. Does this improve the performance?
Write a kernel, where each thread calculates one element of the result matrix $C=AB$, where A and B are matrices.
For the algorithm, you can take a look at Bottlenecks.
As you learned in Bottlenecks, the matrix matrix multiplication tends to be memory bandwidth bound. This is true on the GPU, too.
The way to make it faster is to use faster memory. On a CPU this usually means, dividing the matrix into blocks that fit in cache and hope for the best. On a GPU at lease part of the fast memory is usually programmable. In CUDA this memory is called shared memory.
Shared memory is available to all threads in a thread block. Usually, each thread loads data from device memory into shared memory. This is followed by barrier, so that all threads are finished reading. Then the shared memory is reused as often as possible. Another barrier makes sure that all threads are done.
Let's look at an example:
The following function performs a matrix multiplication on the GPU using shared memory.
from numba import cuda, float32 # Controls threads per block and shared memory usage. # The computation will be done on blocks of TPBxTPB elements. TPB = 16 @cuda.jit def fast_matmul(A, B, C): # Define an array in the shared memory # The size and type of the arrays must be known at compile time sA = cuda.shared.array(shape=(TPB, TPB), dtype=float32) sB = cuda.shared.array(shape=(TPB, TPB), dtype=float32) x, y = cuda.grid(2) tx = cuda.threadIdx.x ty = cuda.threadIdx.y bpg = cuda.gridDim.x # blocks per grid if x >= C.shape and y >= C.shape: # Quit if (x, y) is outside of valid C boundary return # Each thread computes one element in the result matrix. # The dot product is chunked into dot products of TPB-long vectors. tmp = 0. for i in range(bpg): # Preload data into shared memory sA[tx, ty] = A[x, ty + i * TPB] sB[tx, ty] = B[tx + i * TPB, y] # Wait until all threads finish preloading cuda.syncthreads() # Computes partial product on the shared memory for j in range(TPB): tmp += sA[tx, j] * sB[j, ty] # Wait until all threads finish computing cuda.syncthreads() C[x, y] = tmp
The above example shows most special functions used in CUDA kernels:
All the cuda specific API is found in numbas cuda module (line 1)
CUDA kernels are defined like regular Python functions with the added decorator
@cuda.jit (line 7). The decorator makes sure that the function is compiled for the GPU.
As mentioned above, CUDA kernels are executed as a grid of blocks of threads. cuda.grid (line 14) returns the global indices of the current thread, e.g.,
x, y = cuda.grid(2) for a two dimensional grid. The argument gives the number of dimensions (1, 2, 3). This function does not exist in CUDA C++.
In CUDA C++, the programmer usually calculates the global index from the thread index
threadIdx, the block index
blockIdx, and the size (dimension) of the block stored in
int x = blockIdx.x * blockDim.x + threadIdx.x;
In numba these are available through the cuda module, so that you can rewrite the C++ code above as
x = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
fast_matmul these functions are used to get the local index of the thread, which is important to use the shared memory (lines 16--18).
Shared memory is a programmable cache accessible to all threads within a block. In C++ it is allocated within a kernel as
__shared__ float sA[TPB, TPB]; // Allocate a 2D array of floats of size TPB
The cuda module provides shared.array (line 11 & 12) to do the same thing.
The last function from the cuda module used in
fast_matmul is cuda.syncthreads(), which implements a barrier for the threads within a block. It corresponds to
__syncthreads() in CUDA C++.
These functions cover most kernels.
Not all CUDA features are implemented in Numba. Some missing features are listed at http://numba.pydata.org/numba-doc/dev/cuda/overview.html#missing-cuda-features. Currently they include dynamic parallelism and texture memory.