External functions

In this tutorial, we discuss how to declare and use external functions from ParPy. This is useful when we want to use specialized implementations in existing libraries (e.g., cuBLAS in CUDA), or when the functionality of ParPy is insufficient. We present two example uses of externals in Metal and CUDA. The example code is found in example/external.py in the main repository. For more examples, consider the parpy.math library implementation, found in python/parpy/math.py.

Directly calling an external function from ParPy (Metal)

Assume we want to compute the number of set bits in an array of 32-bit integers (e.g., to count the number of entries in a bitset). Counting the number of set bits is known as popcount, for which many systems have intrinsic functions. As there is no such built-in function in ParPy, we could implement a function to compute this for an integer in Python:

def bitcount_naive(v):
    counts = v & 1
    for i in range(1, 32):
        counts += (v >> i) & 1
    return counts

While we could use ParPy to convert this function to native code, it is not very efficient. Instead, we can implement this by referring to an external function. In the standard library of Metal, we can find an implementation of popcount (metal::popcount). In CUDA, we would use the __popc intrinsic. To use this in Metal, we first declare an external function:

import parpy
from parpy.types import I32

@parpy.external("metal::popcount", parpy.CompileBackend.Metal, parpy.Target.Device, header="<metal_stdlib>")
def popcount(v: I32) -> I32:
    # Naive implementation in Python
    counts = v & 1
    for i in range(1, 32):
        counts += (v >> i) & 1
    return counts

The external decorator parpy.external accepts three required arguments:

The C name of the function to be called. This is the string that is used in the generated code, while the Python name of the function is used in ParPy to call the external.
The ParPy backend in which this external is available.
The target where this external is called from. In this case, we want to call the popcount function from parallel code, so we want it to run on the device (GPU). In the next example, we will see another example, showing why this matters.

The decorator also accepts additional keyword arguments. We use the header argument to specify that the definition of the external is available in the "metal_stdlib" header. The parameter v is annotated with a type, and we also declare the return type of the function (using -> I32 after declaring all parameters of the function). Further, the body of the function is the naive implementation discussed earlier. This is optional; the code is not used in the external call, but it can be provided for documentation purposes, or to define its behavior if used outside ParPy functions (e.g., to perform validation, as we will see).

Following this declaration, we can use it as any function in ParPy. We use this to implement the sum_bitcount function for summing together the bitcounts of an array of integers in parallel:

N = parpy.types.shape_var()

@parpy.jit
def sum_bitcount(
        x: parpy.types.buffer(I32, [N]),
        out: parpy.types.buffer(I32, [parpy.types.literal(1)])):
    out[0] = parpy.reduce.sum(popcount(x[:N]))

N = 132
x = torch.randint(0, 2**31-1, (N,), dtype=torch.int32)
out = torch.zeros(1, dtype=torch.int32)
opts = parpy.par({'N': parpy.threads(32)})
sum_bitcount(x, out, opts=opts)
assert out[0] == sum([popcount(v) for v in x])

On the last line, we call the naive popcount implementation from Python to validate the output from ParPy.

Matrix multiplication using cuBLAS (CUDA)

In the previous tutorial, we saw a naive implementation of matrix multiplication in ParPy. However, as ParPy does not provide the ability to use advanced CUDA features like tensor cores or asynchronous shared memory, matrix multiplication cannot be implemented as efficiently in ParPy as in state-of-the-art libraries like cuBLAS. Instead, we provide the ability to use external function calls to directly rely on the implementations provided in libraries such as cuBLAS.

This tutorial walks through the steps required to implement 64-bit matrix multiplication via an external function calling cuBLAS (see GEMM in cuBLAS).

Defining the external function

We present the implementation of the external function that we intend to call. This function manages a cuBLAS handle, as required when calling their functions, and invokes the appropriate matrix multiplication function with the arguments required to behave correctly for a row-major layout:

#include <cublas_v2.h>
#include <stdio.h>

void cublas_gemm_f64(
    int64_t M,
    int64_t N,
    int64_t K,
    double alpha,
    double *A,
    double *B,
    double beta,
    double *C
) {
  cublasHandle_t handle;
  cublasCreate(&handle);
  cublasDgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, N, M, K, &alpha, B, N, A, K, &beta, C, N);
  cublasDestroy(handle);
}

The content showed above is stored in a file cublas_wrapper.h in the same directory as the Python file. In this tutorial, we make this assumption. In general, of course, we could have placed the file anywhere, as long as we set the include path to help the CUDA compiler find it.

Declaring the external function in ParPy

The first step is to declare the external function in ParPy. To do this, we write a normal Python function definition, and add the parpy.external decorator with additional information. As explained earlier, the body of this function is ignored by ParPy, but can be used to document what the external code does (or to use it from Python).

import parpy
from parpy.types import F64, I64

M = parpy.types.shape_var()
N = parpy.types.shape_var()
K = parpy.types.shape_var()

@parpy.external("cublas_gemm_f64", parpy.CompileBackend.Cuda, parpy.Target.Host, header="<cublas_wrapper.h>")
def cublas_gemm(
        M: I64,
        N: I64,
        K: I64,
        alpha: F64,
        A: parpy.types.buffer(F64, [M, K]),
        B: parpy.types.buffer(F64, [K, N]),
        beta: F64,
        C: parpy.types.buffer(F64, [M, N]))
    pass

This declaration is similar to what we saw in the previous section about popcount. In this case, we specify that we want to sue the CUDA backend, as cuBLAS is only available in that backend. Further, while cuBLAS functions use the GPU, they are called from the CPU. Therefore, we declare the target as parpy.Target.Host. In addition, we specify the header in which we defined our wrapper function (cublas_wrapper.h). Importantly, our declared types include ParPy buffers corresponding to the matrices. When calling the external function, the generated code will pass its raw pointer.

Using the external function in ParPy

To use the declared external function from ParPy, we have to define a ParPy function that calls the external function. As a ParPy function must contain parallelism, we wrap the call to cuBLAS in a parpy.gpu block:

@parpy.jit
def cublas_gemm(
        alpha: F64,
        A: parpy.types.buffer(F64, [M, K]),
        B: parpy.types.buffer(F64, [K, N]),
        beta: F64,
        C: parpy.types.buffer(F64, [M, N])):
    with parpy.gpu:
        cublas_gemm_f64(M, N, K, alpha, A, B, beta, C)

M = 1024
N = 512
K = 256
A = torch.randn(M, K, dtype=torch.float64, device='cuda')
B = torch.randn(K, N, dtype=torch.float64, device='cuda')
C = torch.empty(M, N, dtype=torch.float64, device='cuda')
opts = parpy.par({})
# Add the directory of this file to the include path, so the header file is found.
opts.includes += [str(pathlib.Path(__file__).parent.resolve())]
# Add a flag to link the cuBLAS library when compiling.
opts.extra_flags += ["-lcublas"]
cublas_gemm(1.0, A, B, 0.0, C, opts=opts)
assert torch.allclose(C, A @ B, atol=1e-5)

Our wrapper function explicitly passes along the shape variables as arguments to the external function, because the shapes are not available in an external function. ParPy uses nvcc to compile the generated code under the hood. To be able to compile this example, we add the directory to the header file to the include path (assuming it is stored in the same directory as this Python file). Also, we use the -lcublas flag to link our code to the cuBLAS library. By passing alpha = 1.0 and beta = 0.0 as arguments to the cublas_gemm function, we (should) get the typical matrix multiplication behavior, which we verify by comparing against the result of using matrix multiplication in PyTorch (via the @ operator).

Limitations

As a consequence of how ParPy is implemented, there are backend-specific limitations on how external functions can be used.

Metal

The code generated for Metal is compiled using Metal-cpp. This interface does not expose the ability to control the include path. As a consequence, device externals on Metal must directly refer to functions defined in the Metal standard library (including the namespace).

CUDA

CUDA has full support for either host or device externals. A device external must be explicitly annotated with __device__, while host externals do not have to be annotated (though, they may not be annotated with only __device__ or with __global__).

When calling a host external, buffer arguments are always provided as raw pointers to GPU data. If it is necessary to operate on the data on the CPU, the user has to manually make sure to copy data back and forth.

Directly calling an external function from ParPy (Metal)​

Matrix multiplication using cuBLAS (CUDA)​

Defining the external function​

Declaring the external function in ParPy​

Using the external function in ParPy​

Limitations​

Metal​

CUDA​