JIT-compiling a Python function
In this tutorial, we walk through how to rewrite a Python function operating on lists to a function that we can JIT-compile using ParPy. The code is available at examples/jit-compilation.py in the ParPy repository, and can be run using python examples/jit-compilation.py.
Matrix-vector multiplication
Consider the following implementation of a matrix-vector multiplication in Python operating on lists:
def mv(A, b):
return [sum([A_val * b_val for (A_val, b_val) in zip(A_row, b)]) for A_row in A]
A = [[2.5, 3.5], [1.5, 0.5]]
b = [2.0, 1.0]
out = mv(A, b)
print("lists", out)
The mv function, as implemented above, cannot be JIT-compiled using ParPy. We will consider issues, and discuss how the code can be rewritten to eventually make it parallelizable. We also motivate why we have to perform certain rewrites.
Returning a value
A problem with the mv function implementation is that it returns the result. In ParPy, the main function (the function we call from Python) cannot return a value. Instead, the function takes output data as arguments are mutate them. ParPy is designed this way to avoid having to allocate data within a JIT-compiled function, as the ability to allocate data on the GPU is very limited.
We rewrite the function by having it mutate a pre-allocated output list (assuming N is the number of rows in A). In each iteration of the loop, we update an entry in out by summing the product of a matrix row A[row] with the corresponding values of the vector b.
def mv_inplace(A, b, out, N):
for row in range(N):
out[row] = sum([A_val * b_val for (A_val, b_val) in zip(A[row], b)])
N = 2
A = [[2.5, 3.5], [1.5, 0.5]]
b = [2.0, 1.0]
out = [0.0 for i in range(N)]
mv_inplace(A, b, out, N)
print("lists (no return)", out)
Python lists
The next problem with the function is that it uses Python lists. These not supported in ParPy because they are difficult to use in native code (e.g., they can contain elements of distinct types). Similarly, ParPy does not support list comprehensions. Instead, we can rewrite it to operate on NumPy arrays:
import numpy as np
def mv_numpy(A, b, out, N):
for row in range(N):
out[row] = sum(A[row] * b)
N = 2
M = 2
A = np.array([[2.5, 3.5], [1.5, 0.5]])
b = np.array([2.0, 1.0])
out = np.zeros((N,))
mv_numpy(A, b, out, N)
print("NumPy v2", out)
ParPy supports arrays implementing the Array Interface Protocol and the similar CUDA Array Interface. This includes containers such as NumPy arrays, PyTorch tensors, and CuPy arrays. ParPy will use the data of a CUDA array (e.g., a PyTorch tensor placed on the GPU) without copying, while other kinds of data (e.g., a NumPy array) is automatically copied between the CPU and the GPU.
Parallelization
At this stage, we can enable parallelization of the function by adding a decorator and labels. In addition, we rewrite the summation expression as A[row,:] * b[:] to explicitly list the dimensions we parallelize over (which we can associate with a label). The resulting code is:
import parpy
@parpy.jit
def mv_parpy(A, b, out, N):
parpy.label('N')
for i in range(N):
parpy.label('M')
out[row] = sum(A[row, :] * b[:])
When calling the JIT-compiled mv_parpy function, we have to provide the compiler options where we specify how to parallelize it (as we saw in the previous tutorial):
opts = parpy.par({'N': parpy.threads(N)})
mv_parpy(A, b, out, N, opts=opts)
print("ParPy", out)
Performance concerns
We use NumPy arrays in the call to mv_parpy. In this case, ParPy will implicitly copy data back and forth between the CPU (where the NumPy arrays are stored) and the GPU (where the parallel code executes). If we plan to perform more operations on the GPU, copying data back and forth like this is very wasteful. Instead of NumPy arrays, we could have used PyTorch tensors already stored on the GPU to avoid this overhead. However, this is only supported for the CUDA backend.
The parpy.buffer module defines ParPy buffers, which can be used to pre-allocate GPU data for any supported backend. This eliminates the need to implicitly copy data between the CPU and the GPU. They are particularly useful when benchmarking to minimize the overhead. We can pre-allocate data as:
backend = parpy.default_backend()
A = parpy.buffer.from_array(A, backend)
b = parpy.buffer.from_array(b, backend)
out = parpy.buffer.zeros((N,), parpy.types.F32, backend)
After allocating the buffers, we can call the mv_parpy function as before. After calling the function, we can use the numpy method to produce a new NumPy array containing the data of a ParPy buffer. The GPU executes asynchronously from the CPU. When we called the mv_parpy function using NumPy arrays, it would implicitly wait for the GPU execution to finish, as it had to copy data back to the NumPy arrays before returning. This happens implicitly when invoking the numpy method of a buffer. However, when performing timing measurements, we have to ensure the GPU execution has finished when using buffers. To do this, we use the sync method on the out buffer:
import time
t1 = time.time_ns()
mv_parpy(A, b, out, N, opts=opts)
out.sync()
t2 = time.time_ns()
print("Time:", (t2-t1)/1e9)
print("ParPy", out.numpy())