Python Bindings#

The Python module enables calling the Hpk library from Python code for transforms of NumPy or JAX arrays, PyTorch tensors, or TensorFlow EagerTensors.

The Python Download Page has Linux/x86_64 wheels for Python 3.11 and later. These include all the libhpk_* shared libraries; the C++ packages do not need to be installed. After downloading, the wheel can be installed using pip. For example, using a virtual environment:

python3 -m venv --system-site-packages ./my-venv
source ./my-venv/bin/activate
python3 -m pip install hpk-*-linux_x86_64.whl

python3  # Copy the example below using the little clipboard icon
         # and paste it at the Python interactive mode prompt.
deactivate

The introductory, in-place example is as follows:

>>> import hpk
>>> import numpy as np
>>> cfg = hpk.Configuration([(hpk.Parameter.threads, 1)])
>>> factory = hpk.fft.makeFactoryCC(np.float32, cfg)
>>> fft = factory.makeInplace([4])
>>> a = np.array([7.0, 0.0, 0.0, 0.0], dtype=np.complex64)
>>> a
array([7.+0.j, 0.+0.j, 0.+0.j, 0.+0.j], dtype=complex64)
>>> fft.forward(a)
>>> a
array([7.+0.j, 7.+0.j, 7.+0.j, 7.+0.j], dtype=complex64)

For the remainder of this document, we will omit the import commands from our examples. If you like, you can try the examples as written by starting an interactive python session with:

python3 -i -c "import hpk, numpy as np, torch"

Naturally, this assumes NumPy and PyTorch have been installed as system packages.

Data types#

The functions hpk.fft.makeFactoryCC() and hpk.fft.makeFactoryRC() require a dtype object as their first argument. By convention, it is a real type, but the corresponding complex type is accepted as well. As seen in the introductory example, the data type objects provided by a framework such as NumPy or PyTorch are recognized. A string such as "numpy.float64" or "torch.float16" is also correct. The string typically begins with a framework prefix:

Framework

numpy

torch

jax

tensorflow

and is then followed by one of:

DataType

Real

Complex

float16

complex32

float32

complex64

float64

double

complex128

Note that not all combinations are possible. For example, "numpy.complex32" is not yet supported by NumPy. However, assuming float16 hardware support, either hpk.fft.makeFactoryCC(np.float16) or hpk.fft.makeFactoryCC("numpy.float16") will work as expected on arrays of half-precision data in which the real and imaginary parts are adjacent.

>>> hpk.detectArchitecture()
Architecture.avx512fp16
>>> factory = hpk.fft.makeFactoryCC(np.float16)
>>> fft = factory.makeOoplace([4])
>>> scr = fft.allocateScratch()
>>> a = np.array([1., 2., 0., 0., 0., 0., 0., 0.], dtype=np.float16)
>>> fft.forward(a, scratch=scr)
array([1., 2., 1., 2., 1., 2., 1., 2.], dtype=float16)
>>> b = hpk.empty(8, np.float16)
>>> fft.forwardCopy(a, b, scratch=scr)
>>> b
array([1., 2., 1., 2., 1., 2., 1., 2.], dtype=float16)

The example above illustrates a general policy: when an array having a real type is provided as an argument to an FFT expecting complex data, the data is reinterpreted as complex, i.e., as consisting of interleaved real and imaginary parts. The Ooplace compute methods that return an array respect the user’s choice regarding the type:

>>> factory = hpk.fft.makeFactoryCC(np.float64)
>>> fft = factory.makeOoplace([4])
>>> scr = fft.allocateScratch()
>>> z = np.array([1.+2.j, 0., 0., 0.], dtype=np.complex128)
>>> fft.forward(z, scratch=scr)
array([1.+2.j, 1.+2.j, 1.+2.j, 1.+2.j])
>>> a = np.array([1., 2., 0., 0., 0., 0., 0., 0.], dtype=np.float64)
>>> y = fft.forward(a, scratch=scr)
>>> y
array([1., 2., 1., 2., 1., 2., 1., 2.])
>>> y.view(dtype=np.complex128)
array([1.+2.j, 1.+2.j, 1.+2.j, 1.+2.j])

If the framework is omitted, the Ooplace compute methods forward(), scaleForward(), backward() and scaleBackward() will return an object implementing the DLPack protocol.

>>> factory = hpk.fft.makeFactoryCC("double")
>>> fft = factory.makeOoplace([4])
>>> t = torch.tensor([1.+2.j, 0., 0., 0.], dtype=torch.complex128)
>>> d = fft.forward(t)
>>> d
<hpk.DlPack_complex128 object at 0x7f71d7ebd2f0>
>>> y = torch.from_dlpack(d)
>>> y
tensor([1.+2.j, 1.+2.j, 1.+2.j, 1.+2.j], dtype=torch.complex128)

This is most useful if your preferred framework is not one of the four listed above.

Array allocation#

Note that NumPy does not align arrays to cacheline (64B) boundaries:

>>> a = np.empty(1024, dtype=np.complex128)
>>> hpk.inspect(a)
Array (writable) of complex128 at 0x2942570 on device::cpu (id 0)
    ndim : 1    shape : [1024]    stride : [2]

Hpk does:

>>> b = hpk.empty(1024, dtype=np.complex128)
>>> hpk.inspect(b)
Array (writable) of complex128 at 0x2951380 on device::cpu (id 0)
    ndim : 1    shape : [1024]    stride : [2]

Therefore, we recommended allocating arrays using hpk.empty().

Strides, batches#

Hpk measures strides in terms of real values, as seen in the previous example. For some background information regarding strides, please read Inplace and Ooplace.

Each dimension of an Inplace transform is described with an InplaceDim object, which is initialized with either an int or a 2-tuple of ints. In the former case, the stride is set to zero, a convenient way to specify contiguous data. The batch, if greater than one, is assumed to be the leftmost array index, i.e., the index having the largest stride. Otherwise, specify the batch’s stride.

>>> factory = hpk.fft.makeFactoryCC(np.float32)
>>> # Each transform has 3 rows and 5 columns.  Batch size is 7.
>>> fft = factory.makeInplace([ 3     ,  5    ],  7     )
>>> #           Equivalently: [(3, 10), (5, 2)], (7, 30)
>>> x = np.ones([7, 3, 5], dtype=np.complex64)
>>> fft.forward(x)

Each dimension of an Ooplace transform is described by an OoplaceDim object, which is initialized with either an int or a 3-tuple of ints. After the length n, two strides can be specified, corresponding to the time domain and the frequency domain. Again, zero stride (either explicitly or by omission) can be used to specify contiguous data.

>>> factory = hpk.fft.makeFactoryCC(np.float32)
>>> fft = factory.makeOoplace([ 3         ,  5       ],  7         )
>>> #           Equivalently: [(3, 10, 10), (5, 2, 2)], (7, 30, 30)
>>> x = np.ones([7, 3, 5], dtype=np.complex64)
>>> y = fft.forward(x)
>>> z = hpk.empty([7, 3, 5], dtype=np.complex64)
>>> fft.forwardCopy(x, z)
>>> np.allclose(y, z)
True

If the data is padded, the Ooplace compute methods that allocate a return array will return a 1D array rather than attempting to guess the output’s layout/shape. The padding will be uninitialized. It may be easier to use forwardCopy(), etc.

>>> factory = hpk.fft.makeFactoryCC(np.float32)
>>> fft = factory.makeOoplace([(3, 10, 10)], (5, 2, 2))  # Column FFT
>>> x = np.ones([3, 5], dtype=np.complex64)
>>> y = fft.forward(x)  # All five columns transformed, y is 3x5 array
>>> y
array([[3.+0.j, 3.+0.j, 3.+0.j, 3.+0.j, 3.+0.j],
       [0.-0.j, 0.-0.j, 0.-0.j, 0.-0.j, 0.-0.j],
       [0.+0.j, 0.+0.j, 0.+0.j, 0.+0.j, 0.+0.j]], dtype=complex64)
>>> fft4 = factory.makeOoplace([(3, 10, 10)], (4, 2, 2))  # 4 column batch
>>> y = fft4.forward(x)  # Four columns transformed, y is 1D array
>>> y.shape
(15,)
>>> z = np.reshape(y, [3, 5])
>>> z  # The last column is uninitialized; your results may differ.
array([[3.+0.j, 3.+0.j, 3.+0.j, 3.+0.j, 0.+0.j],
       [0.-0.j, 0.-0.j, 0.-0.j, 0.-0.j, 0.+0.j],
       [0.+0.j, 0.+0.j, 0.+0.j, 0.+0.j, 0.+0.j]], dtype=complex64)
>>> z.fill(np.nan)
>>> fft4.forwardCopy(x, z)
>>> z
array([[3.+0.j, 3.+0.j, 3.+0.j, 3.+0.j, nan+0.j],
       [0.-0.j, 0.-0.j, 0.-0.j, 0.-0.j, nan+0.j],
       [0.+0.j, 0.+0.j, 0.+0.j, 0.+0.j, nan+0.j]], dtype=complex64)

Scratch memory#

Temporary storage for intermediate results is generally needed for computing an FFT transform. This memory will be allocated automatically if not provided by the user. However, if scratch memory can be used more than once, it is more efficient for the user to do the allocation.

>>> cfg = hpk.Configuration([(hpk.Parameter.threads, 1)])
>>> factory = hpk.fft.makeFactoryCC(np.float32, cfg)
>>> a = np.ones([1080, 1920], dtype=np.complex64)
>>> fft = factory.makeOoplace(a.shape)
>>> scr = fft.allocateScratch()
>>> y = fft.forward(a, scratch=scr)
>>> x = fft.scaleBackward(1.0 / (1080*1920), y, scratch=scr)
>>> np.allclose(a, x)
True

The memory provided via the keyword argument scratch must be a contiguous array, and using the allocateScratch() method is a convenient way to obtain an array of the required size. Also, the array will be aligned to a cacheline boundary, which is beneficial for performance.

The number of scratch elements that are necessary can be obtained with the FFT compute object’s scratchSize() method. If the memory provided to a compute function is too small, it will be ignored, and no harm is done.

Real time domain#

The following examples are Python versions of those found in the C++ section on Real Time Domain Data Layout.

  1. Consider a 1-dimensional time sequence of six real numbers

    >>> cfg = hpk.Configuration([(hpk.Parameter.threads, 1)])
    >>> factory = hpk.fft.makeFactoryRC(np.float64, cfg)
    >>> x = np.array([4.667, -2.643, 2.821, 1.667, 0.512, 1.976], np.float64)
    >>> np.set_printoptions(precision=2)
    
    • Ooplace

      >>> fft = factory.makeOoplace([6])  # equivalently, [(6, 1, 2)]
      >>> X = hpk.empty(8, np.float64)
      >>> fft.forwardCopy(x, X)
      >>> X
      array([9., 0., 1., 2., 5., 6., 7., 0.])
      >>> X = hpk.empty(4, np.complex128)
      >>> fft.forwardCopy(x, X)
      >>> X
      array([9.+0.j, 1.+2.j, 5.+6.j, 7.+0.j])
      >>> X = fft.forward(x)
      >>> X
      array([9., 0., 1., 2., 5., 6., 7., 0.])
      >>> X.view(np.complex128)
      array([9.+0.j, 1.+2.j, 5.+6.j, 7.+0.j])
      
    • Inplace

      >>> fft = factory.makeInplace([6])  # equivalently, [(6, 1)]
      >>> x = np.concatenate((x, [0.0, 0.0]))
      >>> fft.forward(x)
      >>> x
      array([9., 0., 1., 2., 5., 6., 7., 0.])
      
  2. Consider a 1-dimensional time sequence of seven real numbers

    >>> cfg = hpk.Configuration([(hpk.Parameter.threads, 1)])
    >>> factory = hpk.fft.makeFactoryRC(np.float64, cfg)
    >>> x = np.array([5.000, -3.766, 3.156, 0.338, 2.610, -0.792, 2.454], np.float64)
    >>> np.set_printoptions(precision=2)
    
    • Ooplace

      >>> fft = factory.makeOoplace([7])  # equivalently, [(7, 1, 2)]
      >>> X = hpk.empty(8, np.float64)
      >>> fft.forwardCopy(x, X)
      >>> X
      array([9., 0., 1., 2., 5., 6., 7., 8.])
      >>> X = hpk.empty(4, np.complex128)
      >>> fft.forwardCopy(x, X)
      >>> X
      array([9.+0.j, 1.+2.j, 5.+6.j, 7.+8.j])
      >>> X = fft.forward(x)
      >>> X
      array([9., 0., 1., 2., 5., 6., 7., 8.])
      >>> X.view(np.complex128)
      array([9.+0.j, 1.+2.j, 5.+6.j, 7.+8.j])
      
    • Inplace

      >>> fft = factory.makeInplace([7])  # equivalently, [(7, 1)]
      >>> x = np.concatenate((x, [0.0]))
      >>> fft.forward(x)
      >>> x
      array([9., 0., 1., 2., 5., 6., 7., 8.])
      
  3. Consider a three-dimensional FFT of real data having 9 slabs, each of which has 7 rows and 6 columns.

    >>> cfg = hpk.Configuration([(hpk.Parameter.threads, 1)])
    >>> factory = hpk.fft.makeFactoryRC(np.float64, cfg)
    
    • Ooplace (no padding between rows, neither in time nor in frequency)

      >>> fft = factory.makeOoplace([ 9         ,  7       ,  6       ])
      >>> #           equivalently, [(9, 42, 56), (7, 6, 8), (6, 1, 2)]
      >>> x = hpk.empty(378, dtype=np.float64)
      >>> x.fill(0.0)
      >>> x[0] = 1.0  # unit impulse
      >>> X = hpk.empty(504, dtype=np.float64)
      >>> # Hpk doesn't care about the actual shape of x or X.
      >>> fft.forwardCopy(x, X)
      >>> y = np.reshape(X.view(np.complex128), [9, 7, 4])
      
    • Inplace (2 real padding values at end of each time row)

      >>> fft = factory.makeInplace([ 9     ,  7    ,  6    ])
      >>> #           equivalently, [(9, 56), (7, 8), (6, 1)]
      >>> x = hpk.empty(504, dtype=np.float64)
      >>> x.fill(0.0)
      >>> x[0] = 1.0  # unit impulse
      >>> fft.forward(x)
      

OpenMP#

Hpk by default supports computing FFTs using multiple threads via OpenMP. To make instead a single-threaded factory, the cfg parameter should contain the entry (hpk.Parameter.threads, 1), as seen in the examples above. If so, the OpenMP library will not be loaded.

The first time a multi-threaded factory is made, Hpk loads its OpenMP-enabled shared library, which depends on libiomp5.so. For this to succeed, this OpenMP library must be installed on your system. One choice is intel-openmp. Depending on your installation, you may need to set the environment variable LD_LIBRARY_PATH. Also, it may help performance to set

OMP_PLACES=cores
OMP_MAX_ACTIVE_LEVELS=2

To verify multi-threaded capability, use the factory’s method maxThreads():

>>> factory = hpk.fft.makeFactoryCC(np.float32)
>>> factory.maxThreads()
24

The result should match the number of cores available to the Python interpreter.