Assignments outside parallel code

Consider the following code performing a parallel summation of the values of a provided array x and storing the results in the first index of an output array y:

import parpy
@parpy.jit
def sum_values(x, y, N):
    y[0] = 0.0
    parpy.label('N')
    for i in range(N):
        y[0] += x[i]

We use the following code to initialize the input and specify the parallelization. In particular, we explicitly specify that the for-loop performs a reduction (the ParPy compiler does not automatically identify this). Also, we specify that the ParPy compiler should generate code for the CUDA backend.

import numpy as np
N = 100
x = np.random.randn(N).astype(np.float32)
y = np.empty((1,), dtype=np.float32)
opts = parpy.par({'N': parpy.threads(N).reduce()})
opts.backend = parpy.CompileBackend.Cuda
print(parpy.print_compiled(sum_values, [x, y, N], opts))

When running this code, the ParPy compiler fails and produces an error of the form

RuntimeError: Assignments are not allowed outside parallel code.

On line 4 of file /<path>/<to>/<file>/example.py:
    y[0] = 0.0
    ^^^^

Problem

The problem is that the initial statement is placed outside parallel code. Such statements are executed on the CPU. When we use the CUDA backend, we cannot assign to y on the CPU because data is allocated on the GPU, which has a separate memory space. This is not a problem on the Metal backend as it uses shared memory accessible from both the CPU and the GPU without the need to copy. The ParPy compiler could automatically insert copying when using the CUDA backend, but such copying is costly and may result in surprisingly bad performance.

Solution

It is strongly recommended to use parallel reduction operators in place of manual for-loops where applicable. In this case, we could use parpy.operators.sum to implement the summation as shown below.

import parpy
@parpy.jit
def sum_values(x, y):
    parpy.label('N')
    y[0] = parpy.operators.sum(x[:])

However, there are many situations where this does not apply. In these cases, we can use the parpy.gpu context manager. The ParPy compiler will ensure that all code within a parpy.gpu block runs on the GPU regardless of whether it contains any parallelization. Using this approach, the resulting code looks like:

import parpy
@parpy.jit
def sum_values(x, y, N):
    with parpy.gpu:
        y[0] = 0.0
        parpy.label('N')
        for i in range(N):
            y[0] += x[i]

Problem​

Solution​

Problem

Solution