Difference between revisions of "Python GPU programming"

From edegan.com
Jump to navigation Jump to search
Line 40: Line 40:
  
 
The key things that you need to know are:
 
The key things that you need to know are:
* One kernel is executed at a time on a device
+
* One '''kernel''' is executed at a time on a device
* Many threads execute each kernel - each thread runs the same code but on different data (based on its threadID)
+
* Many '''threads''' execute each kernel - each thread runs the same code but on different data (based on its threadID)
* Threads are grouped into blocks and a kernel runs on a grid of blocks
+
* Threads are grouped into '''blocks''' and a kernel runs on a '''grid''' of blocks
 
* Blocks can't synchronize. They can run concurrently or sequentially.
 
* Blocks can't synchronize. They can run concurrently or sequentially.
* Threads have local memory (registers ~ 1clock cycle), blocks share memory (~10 clock cycles), and kernels have per-device global memory(~100s/1000 clock cycles)
+
* Threads have local memory ('''registers''' ~ 1clock cycle), '''blocks share memory''' (~10 clock cycles), and kernels have '''per-device global memory''' (~100s/1000 clock cycles)
* Per device memory can transfer data to/from the CPU, and includes global, local (for consecutive access by a thread), constant (much faster than other per device), and some specialized memories for graphics (texture and surface).
+
* Per device memory can transfer data to/from the CPU, and includes '''global''', '''local''' (for consecutive access by a thread), '''constant''' (much faster than other per device), and some specialized memories for graphics ('''texture''' and surface).
 
* Transfers from global memory to local registers is in 4,8 or 16 byte units (or can incur a penalty, which slows things down). Threads can talk to constant and texture memory.
 
* Transfers from global memory to local registers is in 4,8 or 16 byte units (or can incur a penalty, which slows things down). Threads can talk to constant and texture memory.
* Blocks should have dimension >=32
+
* Blocks should have dimension >=32 (see warps below).
* A GPU device is a set of SIMT multiprocessors
+
* A GPU device is a set of '''[https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads SIMT multiprocessor]'''
* At each clock cycle, a multiprocessor executes the same instruction on a warp (the number of threads in a warp is the "warp size". It's usually 32. You can find yours by running the deviceQuery utility provided in the samples folder. See [[DIGITS DevBox#Test the installation]].
+
* The number of threads in a '''warp''' is the "warp size". It's usually 32. You can find yours by running the deviceQuery utility provided in the samples folder. See [[DIGITS DevBox#Test the installation]]. Warps are then grouped into blocks.
 +
* At each clock cycle, a multiprocessor executes the same instruction on a warp. Threads within a warp are executed physically in parallel. Warps and blocks are executed logically in parallel.
 +
* Kernel launches are asynchronous - the CPU hands off the kernel and moves on. The kernel only executes and all previous CUDA calls have completed.
  
 
==CUDA and Python==
 
==CUDA and Python==

Revision as of 16:26, 13 November 2020

There are many different ways to work with GPUs using Python. This page explores them!

Foundations

CUDA vs. OpenCL

At a fundamental level, using a GPU for computing means using [CUDA], [OpenCL], or some other interface (OpenGL compute, Microsoft's DirectCompute, etc.) The big trade-off between CUDA and OpenCL is proprietary performance vs. open-source generality. Usually, I favour the later. However, at this point, the nVIDIA chipsets dominate the market and CUDA (which only runs on nVIDIA) seems to be the obvious choice. There have also been some attempts to make CUDA run on CL.

CUDA for C++ or Fortran

If you are coding in C/C++ then you can compile CUDA code into PTX (a low-level virtual machine language that runs on GPUs) with the [nVIDIA CUDA Compiler (nvcc)]. The nvcc separates out CUDA code that will run on the GPU and compiles it to PTX, and leaves the rest to be compiled using your regular compiler (likely GCC or the Microsoft Visual C compiler). Likewise, nVIDIA provides a dedicated Fortran CUDA compiler (nVIDIA bought the Portland Group, Inc. -- PGI -- to this end).

Approaches for other languages

However, if you want to write GPU compute code in Python, Perl, Java, Matlab, or a host of other languages, you'll need to think carefully about which of the offered approaches is right for you. There are broadly four classes of approaches:

  1. Getting a language-specific compiler that isn't made by nVIDIA!
  2. Wrapping CUDA C++ (or Fortran) code directly into your code
  3. Using the low-level CUDA video driver API
  4. Using the higher-level CUDA Runtime API, which sits on top of the low-level CUDA video library (i.e., using both APIs)

The distinction between the last two is that only the Runtime API gives you access to the full set of libraries including:

  • cuBLAS – CUDA Basic Linear Algebra Subroutines library
  • cuSOLVER – CUDA based collection of dense and sparse direct solvers, see main and docs
  • cuSPARSE – CUDA Sparse Matrix library

If you're an economist, then these libraries are very likely what you're going to want! (If you're a physicist, or doing signal processing, then you'll probably want cuFFT and other libraries that are also in the Runtime API).

For the second option, you'll need to use SWIG (Simplified Wrapper and Interface Generator) or something that gives you equivalent functionality for your language, such as Cython for Python. (NPCUDA is a simple project to demo both of these methods in Python.) The major advantage of this option is that you aren't hitching your horse to the continued support of a whole series of intermediate APIs.

Notice for Perl Programmers
A quick look at the GPU support for Perl suggests that SWIG the way to go!

Compiling a Kernel

In the language of GPU computing, we need to compile a kernel to run on the GPU. Some packages (discussed later) abstract away how GPUs handle memory and processing, but you should be aware of the fundamentals as they are often very important to maximizing the code's performance: if you understand the hardware implementation, you can tune for it!

Chi Wei Cliburn Chan, an associate prof of Biostatistics and Bioinformatics at Duke, teaches lots of great classes, and provides a guide to Massively parallel programming with GPUs as a part of his Computational Statistics in Python class (note that the 2018 version of his STA 663: Computational Statistics and Statistical Computing (2018) class (under the same course number) has sections on Spark, Tensorflow, Cython, and more!). This guide has a pretty good walk-through of how a CUDA kernel runs, though it is missing some images. See also:

The key things that you need to know are:

  • One kernel is executed at a time on a device
  • Many threads execute each kernel - each thread runs the same code but on different data (based on its threadID)
  • Threads are grouped into blocks and a kernel runs on a grid of blocks
  • Blocks can't synchronize. They can run concurrently or sequentially.
  • Threads have local memory (registers ~ 1clock cycle), blocks share memory (~10 clock cycles), and kernels have per-device global memory (~100s/1000 clock cycles)
  • Per device memory can transfer data to/from the CPU, and includes global, local (for consecutive access by a thread), constant (much faster than other per device), and some specialized memories for graphics (texture and surface).
  • Transfers from global memory to local registers is in 4,8 or 16 byte units (or can incur a penalty, which slows things down). Threads can talk to constant and texture memory.
  • Blocks should have dimension >=32 (see warps below).
  • A GPU device is a set of SIMT multiprocessor
  • The number of threads in a warp is the "warp size". It's usually 32. You can find yours by running the deviceQuery utility provided in the samples folder. See DIGITS DevBox#Test the installation. Warps are then grouped into blocks.
  • At each clock cycle, a multiprocessor executes the same instruction on a warp. Threads within a warp are executed physically in parallel. Warps and blocks are executed logically in parallel.
  • Kernel launches are asynchronous - the CPU hands off the kernel and moves on. The kernel only executes and all previous CUDA calls have completed.

CUDA and Python