Python Bindings#

The Python module enables calling the Hpk library from Python code for transforms of NumPy or JAX arrays, PyTorch tensors, or TensorFlow EagerTensors.

The Python Download Page has wheels for Python 3.11 and later (on Linux only). These include all the libhpk_* shared libraries; the C++ packages do not need to be installed. After downloading, the wheel can be installed using pip. For example, using a virtual environment:

python3 -m venv --system-site-packages ./my-venv
source ./my-venv/bin/activate
python3 -m pip install hpk-*-linux_x86_64.whl

python3  # Copy the example below using the little clipboard icon
         # and paste it at the Python interactive mode prompt.

deactivate

An introductory, in-place example using NumPy is as follows:

>>> import hpk
>>> import numpy as np
>>> cfg = hpk.Configuration([(hpk.Parameter.threads, 1)])
>>> factory = hpk.fft.makeFactoryCC(np.float32, cfg)
>>> fft = factory.makeInplace([4])
>>> a = np.array([7.0, 0.0, 0.0, 0.0], dtype=np.complex64)
>>> a
array([7.+0.j, 0.+0.j, 0.+0.j, 0.+0.j], dtype=complex64)
>>> fft.forward(a)
>>> a
array([7.+0.j, 7.+0.j, 7.+0.j, 7.+0.j], dtype=complex64)

Or, using PyTorch:

>>> import hpk
>>> import torch
>>> cfg = hpk.Configuration([(hpk.Parameter.threads, 1)])
>>> factory = hpk.fft.makeFactoryCC(torch.float32, cfg)
>>> fft = factory.makeInplace([4])
>>> t = torch.tensor([7.0, 0.0, 0.0, 0.0], dtype=torch.complex64)
>>> t
tensor([7.+0.j, 0.+0.j, 0.+0.j, 0.+0.j])
>>> fft.forward(t)
>>> t
tensor([7.+0.j, 7.+0.j, 7.+0.j, 7.+0.j])

For the remainder of this page, we will use NumPy and omit the import commands from our examples. If you like, you can try the examples as written by starting an interactive python session with:

python3 -i -c "import hpk, numpy as np"

Naturally, this assumes NumPy has been installed as a system package.

Data types#

The functions hpk.fft.makeFactoryCC() and hpk.fft.makeFactoryRC() require a dtype object as their first argument. By convention, it is a real type, but the corresponding complex type is accepted as well. As seen in the introductory example, the data type objects provided by a framework such as NumPy or PyTorch are recognized.

A string such as "numpy.float64" or "torch.float16" is also correct. The string typically begins with a framework prefix:

Framework
`numpy`
`jax`
`torch`
`tensorflow`

and is then followed by one of:

DataType
Real	Complex
`float16`	`complex32`
`float32`	`complex64`
`float64` `double`	`complex128`

Note that not all combinations are possible. For example, "numpy.complex32" is not yet supported by NumPy. However, assuming float16 hardware support, either hpk.fft.makeFactoryCC(np.float16) or hpk.fft.makeFactoryCC("numpy.float16") will work as expected on arrays of half-precision data in which the real and imaginary parts are adjacent.

>>> hpk.detectArchitecture()
Architecture.avx512fp16
>>> factory = hpk.fft.makeFactoryCC(np.float16)
>>> fft = factory.makeOoplace([4])
>>> scratch = fft.allocateScratch()
>>> a = np.array([1., 2., 0., 0., 0., 0., 0., 0.], dtype=np.float16)
>>> y = fft.forward(a, scratch)
>>> y
array([1., 2., 1., 2., 1., 2., 1., 2.], dtype=float16)
>>> z = hpk.empty(8, np.float16)
>>> fft.forwardCopy(a, z, scratch)
>>> z
array([1., 2., 1., 2., 1., 2., 1., 2.], dtype=float16)

The example above illustrates a general policy: when an array having a real type is provided as an argument to an FFT expecting complex data, the data is reinterpreted as complex, i.e., as consisting of interleaved real and imaginary parts. The Ooplace compute methods that return an array respect the user’s choice regarding the type:

>>> factory = hpk.fft.makeFactoryCC(np.float64)
>>> fft = factory.makeOoplace([4])
>>> scratch = fft.allocateScratch()
>>> a = np.array([1.+2.j, 0., 0., 0.], dtype=np.complex128)
>>> y = fft.forward(a, scratch)
>>> y
array([1.+2.j, 1.+2.j, 1.+2.j, 1.+2.j])
>>> a = np.array([1., 2., 0., 0., 0., 0., 0., 0.], dtype=np.float64)
>>> y = fft.forward(a, scratch)
>>> y
array([1., 2., 1., 2., 1., 2., 1., 2.])
>>> y.view(dtype=np.complex128)
array([1.+2.j, 1.+2.j, 1.+2.j, 1.+2.j])

If the framework is omitted, the Ooplace compute methods forward(), scaleForward(), backward() and scaleBackward() will return an object implementing the DLPack protocol.

>>> factory = hpk.fft.makeFactoryCC("double")
>>> fft = factory.makeOoplace([4])
>>> a = np.array([1.0+2.0j, 0.0, 0.0, 0.0], dtype=np.complex128)
>>> d = fft.forward(a)
>>> d
<hpk.DlPack_complex128 object at 0x7fc6109fb0f0>
>>> z = np.from_dlpack(d)
>>> z
array([1.+2.j, 1.+2.j, 1.+2.j, 1.+2.j])

This is most useful if your preferred framework is not one of the four listed above.

Array allocation#

Note that NumPy does not align arrays to cacheline (64B) boundaries:

>>> a = np.empty(1024, dtype=np.complex128)
>>> hpk.inspect(a)
Array (writable) of complex128 at 0x2942570 on device='cpu' (id 0)
    ndim : 1    shape : [1024]    stride : [2]

Hpk does:

>>> b = hpk.empty(1024, dtype=np.complex128)
>>> hpk.inspect(b)
Array (writable) of complex128 at 0x2951380 on device='cpu' (id 0)
    ndim : 1    shape : [1024]    stride : [2]

Therefore, we recommended allocating arrays using hpk.empty().

Strides, batches#

Hpk measures strides in terms of real values, as seen in the previous example. For some background information regarding strides, please read Inplace and Ooplace.

Each dimension of an Inplace transform is described with an InplaceDim object, which is initialized with either an int or a 2-tuple of ints. In the former case, the stride is set to zero, a convenient way to specify contiguous data. The batch, if greater than one, is assumed to be the leftmost array index, i.e., the index having the largest stride. Otherwise, specify the batch’s stride.

>>> factory = hpk.fft.makeFactoryCC(np.float32)
>>> # Each transform has 3 rows and 5 columns.  Batch size is 7.
>>> fft = factory.makeInplace([ 3     ,  5    ],  7     )
>>> #           Equivalently: [(3, 10), (5, 2)], (7, 30)
>>> x = np.ones([7, 3, 5], dtype=np.complex64)
>>> fft.forward(x)

Each dimension of an Ooplace transform is described by an OoplaceDim object, which is initialized with either an int or a 3-tuple of ints. After the length n, two strides can be specified, corresponding to the time domain and the frequency domain. Again, zero stride (either explicitly or by omission) can be used to specify contiguous data.

>>> factory = hpk.fft.makeFactoryCC(np.float32)
>>> fft = factory.makeOoplace([ 3         ,  5       ],  7         )
>>> #           Equivalently: [(3, 10, 10), (5, 2, 2)], (7, 30, 30)
>>> x = np.ones([7, 3, 5], dtype=np.complex64)
>>> y = fft.forward(x)
>>> z = hpk.empty([7, 3, 5], dtype=np.complex64)
>>> fft.forwardCopy(x, z)
>>> np.allclose(y, z)
True

If the data is padded, the Ooplace compute methods that allocate a return array will return a 1D array rather than attempting to guess the output’s layout/shape. The padding will be uninitialized. It may be easier to use forwardCopy(), etc.

>>> factory = hpk.fft.makeFactoryCC(np.float32)
>>> fft = factory.makeOoplace([(3, 10, 10)], (5, 2, 2))  # Column FFT
>>> x = np.ones([3, 5], dtype=np.complex64)
>>> y = fft.forward(x)  # All five columns transformed, y is 3x5 array
>>> y
array([[3.+0.j, 3.+0.j, 3.+0.j, 3.+0.j, 3.+0.j],
       [0.-0.j, 0.-0.j, 0.-0.j, 0.-0.j, 0.-0.j],
       [0.+0.j, 0.+0.j, 0.+0.j, 0.+0.j, 0.+0.j]], dtype=complex64)
>>> fft4 = factory.makeOoplace([(3, 10, 10)], (4, 2, 2))  # 4 column batch
>>> y = fft4.forward(x)  # First four columns transformed, y is 1D array
>>> y.shape
(15,)
>>> z = np.reshape(y, [3, 5])
>>> z  # The last column is uninitialized; your results may differ.
array([[3.+0.j, 3.+0.j, 3.+0.j, 3.+0.j, 0.+0.j],
       [0.-0.j, 0.-0.j, 0.-0.j, 0.-0.j, 0.+0.j],
       [0.+0.j, 0.+0.j, 0.+0.j, 0.+0.j, 0.+0.j]], dtype=complex64)
>>> z.fill(np.nan + 7.0j)
>>> fft4.forwardCopy(x, z)
>>> z
array([[3.+0.j, 3.+0.j, 3.+0.j, 3.+0.j, nan+7.j],
       [0.-0.j, 0.-0.j, 0.-0.j, 0.-0.j, nan+7.j],
       [0.+0.j, 0.+0.j, 0.+0.j, 0.+0.j, nan+7.j]], dtype=complex64)

Scratch memory#

Temporary storage for intermediate results is generally needed for computing an FFT transform. This memory will be allocated automatically if not provided by the user. However, if scratch memory can be used more than once, it is more efficient for the user to do the allocation.

>>> cfg = hpk.Configuration([(hpk.Parameter.threads, 1)])
>>> factory = hpk.fft.makeFactoryCC(np.float32, cfg)
>>> a = np.ones([1080, 1920], dtype=np.complex64)
>>> fft = factory.makeOoplace(a.shape)
>>> scratch = fft.allocateScratch()
>>> y = fft.forward(a, scratch)
>>> x = fft.scaleBackward(1.0 / (1080*1920), y, scratch)
>>> np.allclose(a, x)
True

The memory provided as scratch space must be a contiguous array, and using the allocateScratch() method is a convenient way to obtain an array of the required size. Also, the array will be aligned to a cacheline boundary, which is beneficial for performance.

The number of real-valued elements that are needed for scratch can be obtained with the FFT compute object’s scratchSize() method. If the scratch memory provided to a compute function is too small, it will be ignored, and no harm is done.

Real time domain#

The following examples are Python versions of those found in the C++ section on Real Time Domain Data Layout.

Consider a 1-dimensional time sequence of six real numbers

>>> cfg = hpk.Configuration([(hpk.Parameter.threads, 1)])
>>> factory = hpk.fft.makeFactoryRC(np.float64, cfg)
>>> x = np.array([4.667, -2.643, 2.821, 1.667, 0.512, 1.976], np.float64)
>>> np.set_printoptions(precision=2)

Ooplace

>>> fft = factory.makeOoplace([6])  # equivalently, [(6, 1, 2)]
>>> X = hpk.empty(8, np.float64)
>>> fft.forwardCopy(x, X)
>>> X
array([9., 0., 1., 2., 5., 6., 7., 0.])
>>> X = hpk.empty(4, np.complex128)
>>> fft.forwardCopy(x, X)
>>> X
array([9.+0.j, 1.+2.j, 5.+6.j, 7.+0.j])
>>> X = fft.forward(x)
>>> X
array([9., 0., 1., 2., 5., 6., 7., 0.])
>>> X.view(np.complex128)
array([9.+0.j, 1.+2.j, 5.+6.j, 7.+0.j])

Inplace

>>> fft = factory.makeInplace([6])  # equivalently, [(6, 1)]
>>> x = np.concatenate((x, [0.0, 0.0]))
>>> fft.forward(x)
>>> x
array([9., 0., 1., 2., 5., 6., 7., 0.])

Consider a 1-dimensional time sequence of seven real numbers

>>> cfg = hpk.Configuration([(hpk.Parameter.threads, 1)])
>>> factory = hpk.fft.makeFactoryRC(np.float64, cfg)
>>> x = np.array([5.000, -3.766, 3.156, 0.338, 2.610, -0.792, 2.454], np.float64)
>>> np.set_printoptions(precision=2)

Ooplace

>>> fft = factory.makeOoplace([7])  # equivalently, [(7, 1, 2)]
>>> X = hpk.empty(8, np.float64)
>>> fft.forwardCopy(x, X)
>>> X
array([9., 0., 1., 2., 5., 6., 7., 8.])
>>> X = hpk.empty(4, np.complex128)
>>> fft.forwardCopy(x, X)
>>> X
array([9.+0.j, 1.+2.j, 5.+6.j, 7.+8.j])
>>> X = fft.forward(x)
>>> X
array([9., 0., 1., 2., 5., 6., 7., 8.])
>>> X.view(np.complex128)
array([9.+0.j, 1.+2.j, 5.+6.j, 7.+8.j])

Inplace

>>> fft = factory.makeInplace([7])  # equivalently, [(7, 1)]
>>> x = np.concatenate((x, [0.0]))
>>> fft.forward(x)
>>> x
array([9., 0., 1., 2., 5., 6., 7., 8.])

Consider a three-dimensional FFT of real data having 9 slabs, each of which has 7 rows and 6 columns.

>>> cfg = hpk.Configuration([(hpk.Parameter.threads, 1)])
>>> factory = hpk.fft.makeFactoryRC(np.float64, cfg)

Ooplace (no padding between rows, neither in time nor in frequency)

>>> fft = factory.makeOoplace([ 9         ,  7       ,  6       ])
>>> #           equivalently, [(9, 42, 56), (7, 6, 8), (6, 1, 2)]
>>> x = hpk.empty(378, dtype=np.float64)
>>> x.fill(0.0)
>>> x[0] = 1.0  # unit impulse
>>> X = hpk.empty(504, dtype=np.float64)
>>> # Hpk doesn't care about the actual shape of x or X.
>>> fft.forwardCopy(x, X)
>>> y = np.reshape(X.view(np.complex128), [9, 7, 4])

Inplace (2 real padding values at end of each time row)

>>> fft = factory.makeInplace([ 9     ,  7    ,  6    ])
>>> #           equivalently, [(9, 56), (7, 8), (6, 1)]
>>> x = hpk.empty([9, 7, 8], dtype=np.float64)  # includes padding
>>> x.fill(0.0)
>>> x[0, 0, 0] = 1.0  # unit impulse
>>> fft.forward(x)

OpenMP#

Hpk by default supports computing FFTs using multiple threads via OpenMP. To make instead a single-threaded factory, the cfg parameter should contain the entry (hpk.Parameter.threads, 1), as seen in the examples above. If so, the OpenMP-enabled FFT library will not be loaded.

The first time a multithreaded factory is made, Hpk will load its OpenMP-enabled FFT factory library, which depends on the following:

Architecture	OpenMP library
`x86_64`	`libiomp5.so`
`aarch64`	`libomp.so`

The OpenMP library from the table above must be installed on your system. For x86_64, one choice is intel-openmp. Depending on your system configuration, you may need to set the environment variable LD_LIBRARY_PATH to the installation directory.

Also, it may help performance to set the environment variables:

OMP_PLACES=cores
OMP_MAX_ACTIVE_LEVELS=2

To verify multithreaded capability, call the factory’s method maxThreads().

>>> factory = hpk.fft.makeFactoryCC(np.float32)
>>> factory.maxThreads()
12

The result should match the number of cores available to the Python interpreter.