Python in HPC
|
Python speed up can be thought of in multiple ways, single core and multi-core
Single core speed-up techniques include cython , numba and cffi
MPI is the true method of parallelisation in Python in multi-core applications
|
Timing Code and Simple Speed-up Techniques
|
Performance code profiling is used to identify and analyse the execution and improvement of applications.
Never try and optimise your code on the first try. Get the code correct first.
Most often, about 90% of the code time is spent in 10% of the application.
The lru_cache() helps reduce the execution time of the function by using the memoization technique, discarding least recently used items first.
|
Numba
|
Numba only compiles individual functions rather than entire scripts.
The recommended modes are nopython=True and njit
Numba is constantly changing, so keep checking for new versions.
|
Cython
|
Cython IS Python, only with C datatypes
Working from the terminal, a .pyx , main.py and setup.py file are required.
From the terminal the code can be run with python setup.py build_ext --inplace
Cython ‘typeness’ can be done using the %%cython -a cell magic, where the yellow tint is coloured according to typeness.
Main methods of improving Cython speedup include static type declarations, declaration of functions with cdef , using cimport and utilising fast index of C-numpy arrays and types.
Compiler directives can be used to turn off certain python features.
|
C Foreign Function Interface for Python
|
CFFI is an external package for Python that provides a C Foreign Function Interface for Python, and allows one to interact with almost any C code
The Application Binary Interface mode (ABI) is easier, but slower
The Application Programmer Interface mode (API) is more complex but faster
|
MPI with Python
|
MPI is the true way to achieve parallelism
mpi4py is an unofficial library that can be used to implement MPI in Python
A communicator is a group containing all the processes that will participate in communication
A rank is a logical ID number given to a process, and therefore a way to query the rank
Point to Point communication is the communication between two processes, where a source sends a message to a destination process which has to then receive it
|
Non-blocking and collective communications
|
In some cases, serialisation is worse than a deadlock, as you don’t know the code is inhibited by poor performance
Collective communication transmits data among all processes in a communicator, and must be called by all processes in a group
MPI is best used in C and Fortran, as in Python some function calls are either not present or are inhibited by poor performance
|
Dask
|
Dask allows task parallelisation of problems
Identification of tasks is crucial and understanding memory usage
Dask Array is a method to create a set of tasks automatically through operations on arrays
Tasks are created by splitting larger arrays, meaning larger memory problems can be handled
|
GPUs with Python
|
|
{:auto_ids}
key word 1
: explanation 1
key word 2
: explanation 2