Changes

Jump to navigation Jump to search
7,468 bytes added ,  18:17, 30 November 2020
There are many different ways to work with GPUs using Python. This page explores them!provides a discussion of the foundations behind working with GPUs, from the fundamental choice between CUDA and OpenCL to what it means to compile a kernel. It then covers the dominant approaches for using CUDA with python.
==Foundations==
===CUDA vs. OpenCL===
At a fundamental level, using a GPU for computing means using [[https://en.wikipedia.org/wiki/CUDA CUDA]], [[https://en.wikipedia.org/wiki/OpenCL OpenCL]], or some other interface (OpenGL compute, Microsoft's DirectCompute, etc.) The big trade-off between CUDA and OpenCL is proprietary performance vs. open-source generality. Usually, I favour the later. However, at this point, the nVIDIA chipsets dominate the market and CUDA (which only runs on nVIDIA) seems to be the obvious choice. There have also been some attempts to make CUDA run on CL.
===CUDA for C++ or Fortran===
===Compiling a Kernel===
In the language of GPU computing, we need to compile a [https://en.wikipedia.org/wiki/Compute_kernel kernel] to run on the GPU. Some packages (discussed later) abstract away how GPUs handle memory and processing, but you should be aware of the fundamentals as they are often very important to maximizing the code's performance: if you understand the hardware implementation, you can tune for it!
 
[https://scholars.duke.edu/person/cliburn.chan Chi Wei Cliburn Chan], an associate prof of Biostatistics and Bioinformatics at Duke, teaches [https://people.duke.edu/~ccc14/ lots of great classes], and provides a guide to [https://people.duke.edu/~ccc14/sta-663/CUDAPython.html Massively parallel programming with GPUs] as a part of his [https://people.duke.edu/~ccc14/sta-663/ Computational Statistics in Python] class (note that the 2018 version of his [http://people.duke.edu/~ccc14/sta-663-2018/ STA 663: Computational Statistics and Statistical Computing (2018)] class (under the same course number) has sections on Spark, Tensorflow, Cython, and more!). This guide has a pretty good walk-through of how a CUDA kernel runs, though it is missing some images. See also:
*The streaming multiprocessor and the CUDA core: https://i.stack.imgur.com/kvu4M.jpg
*CUDA memory hierarchy: https://www.researchgate.net/profile/Marco_Nobile/publication/261069154/figure/fig1/AS:296718735298563@1447754667270/Schematization-of-CUDA-architecture-Schematic-representation-of-CUDA-threads-and-memory.png
*Various slides from Cyril Zeller (nVIDIA Developer Technology)'s Tutorial CUDA: https://www.slideshare.net/angelamm2012/nvidia-cuda-tutorialnondaapr08
 
The key things that you need to know are:
* One '''kernel''' is executed at a time on a device
* Many '''threads''' execute each kernel - each thread runs the same code but on different data (based on its threadID)
* Threads are grouped into '''blocks''' and a kernel runs on a '''grid''' of blocks
* Blocks can't synchronize. They can run concurrently or sequentially.
* Threads have local memory ('''registers''' ~ 1clock cycle), '''blocks share memory''' (~10 clock cycles), and kernels have '''per-device global memory''' (~100s/1000 clock cycles)
* Per device memory can transfer data to/from the CPU, and includes '''global''', '''local''' (for consecutive access by a thread), '''constant''' (much faster than other per device), and some specialized memories for graphics ('''texture''' and surface).
* Transfers from global memory to local registers is in 4,8 or 16 byte units (or can incur a penalty, which slows things down). Threads can talk to constant and texture memory.
* Blocks should have dimension >=32 (see warps below).
* A GPU device is a set of '''[https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads SIMT multiprocessor]'''
* The number of threads in a '''warp''' is the "warp size". It's usually 32. You can find yours by running the deviceQuery utility provided in the samples folder. See [[DIGITS DevBox#Test the installation]]. Warps are then grouped into blocks.
* At each clock cycle, a multiprocessor executes the same instruction on a warp. Threads within a warp are executed physically in parallel. Warps and blocks are executed logically in parallel.
* Kernel launches are asynchronous - the CPU hands off the kernel and moves on. The kernel only executes and all previous CUDA calls have completed.
==CUDA and Python==
 
There's lots of different Ways to use GPUs with Python. Here's a partial list:
 
===Tensorflow===
 
Tensors are matrices with a uniform type. Read all about them and tensorflow here: https://www.tensorflow.org/guide/. The highlights are:
*Use it as a library in python, importing methods.
*It plays well with NumPy!
*It sits on top of Keras, allowing custom ANN development.
*Tensorflow has both low level methods (e.g., tf.variable, tf.math, tf.GradientTape.jacobian, etc.) and pre-built estimators (e.g., tf.estimator.LinearClassifier) .
*Crucially, tensorflow lets users build graphs and tf.functions that exist and persist beyond the python interpreter (analogous to kernels)
*You can define custom models and layers for machine learning
 
===CuPy===
 
CuPy is a NumPy-compatible array library (it's really a 'drop-in' replacement) accelerated by CUDA: https://cupy.dev/.
*It leverages CUDA-related libraries including cuBLAS, cuDNN, cuRand, cuSolver, cuSPARSE, cuFFT and NCCL
*Lets users build elementwise or reduction kernels, or raw kernels that are defined using raw CUDA source
 
===Numba===
 
Numba translates python functions to optimized machine code at runtime: https://numba.pydata.org/
*Define threads, grids, and blocks and manage them with 'facilities like those exposed by CUDA C'.
*Doesn't work with NumPy because of memory management issues.
 
===PyCUDA===
 
PyCUDA is a wrapper of CUDA's API for python: https://wiki.tiker.net/PyCuda/
*Essentially gives you access to CUDA's methods, and handles memory allocation and cleanup.
*Doesn't work with CUBLAS. PyCuda is based on the CUDA driver API, which is mutually exclusive with the CUDA runtime API (used by CUBLAS).
*Does reimplement a part of CUBLAS as GPUArray.
*Works with NumPy.
 
===Other options===
 
There's also [https://pypi.org/project/pyopencl/ PyOpenCL] (another wrapper but of OpenCL instead), [https://pytorch.org/ PyTorch] (a tensor library that provides a replacement for Numpy for working with GPUs), [https://github.com/Theano/Theano Theono], [https://pypi.org/project/Hebel/ Hebel] (a deep-learning library on top of PyCuda and numpy), [https://caffe2.ai/docs/getting-started.html Caffe2] (which uses CuDNN C++ libraries to provide a deep learning framework) and likely some other things that I have forgotten about.
 
==Other things==
 
The NVIDIA CUDA Toolkit 7.5 is available for free on Amazon Linux: https://aws.amazon.com/marketplace/pp/B01LZMLK1K
 
When developing CUDA code you'll want (both are installed as a part of the CUDA Software Development Kit):
*Nvidia Nsight Eclipse Edition for C/C++
*Nvidia Visual Profiler
 
There's a (probably pretty bad) book which has chapters on working with CUDA in python: https://subscription.packtpub.com/book/big_data_and_business_intelligence/9781789341072/6/ch06lvl1sec55/configuring-pycuda-on-your-python-ide
 
===IDEs===
 
You'll likely want to write your python in an IDE that supports your CUDA development. It looks like PyCharm CE will work with an environment:
https://medium.com/@ashkan.abbasi/quick-guide-for-installing-python-tensorflow-and-pycharm-on-windows-ed99ddd9598. But if we use a docker container, then we'll need the PyCharm PE or Visual Studio (which should work fine with Python):
https://www.analyticsvidhya.com/blog/2020/08/docker-based-python-development-with-cuda-support-on-pycharm-and-or-visual-studio-code/. Visual Studio looks like it allows remote development, using either docker containers or ssh, which might be very nice!
https://devblogs.microsoft.com/python/remote-python-development-in-visual-studio-code/ I'm not sure whether PyCharm CE has this feature, though PE does: https://www.jetbrains.com/help/idea/configuring-remote-python-sdks.html.
 
===Matlab===
 
If you want to use Matlab then Garland's GPU workshop is a good place to start: http://www.quantosanalytics.org/calpoly/gpu_workshop/index.html

Navigation menu