API Design#

Overview of the application programming interface

Factories#

A factory does not compute an FFT, but rather it makes the objects that do. Factories are specialized for both the data representation as well as for the hardware that will be used. For example, different factories are required for double precision, single precision, and half precision. Also, different factories are made when the time domain is complex (and so each data point is represented by two floating point numbers, such as a std::complex<float>) than when the time domain is real (and each data point is represented by one). Furthermore, a factory for AVX512 hardware is different from a factory for AVX2 hardware, and a factory that makes objects that compute with a single thread is different from a factory that make objects that utilize OpenMP parallelism.

The type of a factory is hpk::fft::Factory<fp_t, time_t, freq_t>. Note that the data representation is part of the type, but other attributes, such as the hardware architecture, are not. The first template type parameter is the real floating point type used for the computer arithmetic, e.g., double. If a scaled FFT will be performed, this will be the type of the scale factor. The second and third template type parameters are, respectively, the types of data in the time and frequency domains. So, for double precision complex values in the time domain, time_t would be std::complex<double>, and for a single precision real-valued time domain, time_t would be float.

For “complex-to-complex” FFTs, it is convenient to use the type alias hpk::fft::FactoryCC<fp_t>, which has std::complex<fp_t> for both time_t and freq_t. For “real-to-complex” transforms, the type alias hpk::fft::FactoryRC<fp_t> has fp_t for time_t and has std::complex<fp_t> for freq_t. These are defined in the header file factory.hpp.

A concrete instance of a Factory is obtained by calling hpk::fft::makeFactory(), which constructs a factory based on the function’s template type parameters and an hpk::Configuration object optionally passed as an argument. The function returns a std::unique_ptr that manages the lifetime of the constructed factory.

It sounds terribly confusing, with a function to make factories that make objects that compute FFTs, but it’s actually not bad. For example, for double precision complex time domain:

auto factory = hpk::fft::makeFactory<double>();

This example omitted the function’s second and third template type parameters, allowing them to default to std::complex<fp_t>. Also, the Configuration argument is omitted, so the hardware architecture will be detected at run time and the factory is for OpenMP using one thread per core.

To make a sequential factory (no OpenMP, single-threaded computation):

auto factory =
        hpk::fft::makeFactory<double>({{hpk::Parameter::threads, 1}});

A factory for single precision “real-to-complex” FFTs that are computed with a single thread can be made as follows:

hpk::Configuration cfg{{hpk::Parameter::threads, 1}};
auto factory =
        hpk::fft::makeFactory<float, float, std::complex<float>>(cfg);

Note that after having made the factory, the Configuration object is not again referenced. There is no “spooky action at a distance.” It is fine to end the lifetime of cfg in code such as this:

std::unique_ptr<hpk::fft::FactoryRC<float>> factory;
if ( /* some condition */ ) {
    hpk::Configuration cfg{{hpk::Parameter::threads, 1}};
    factory =
            hpk::fft::makeFactory<float, float, std::complex<float>>(cfg);
} else {
    // multithreaded
    factory = hpk::fft::makeFactory<float, float, std::complex<float>>();
}

If you enjoy generic programming (and who doesn’t), the following makes a factory for double precision “complex-to-complex” FFTs:

std::vector<std::complex<double>> v = {1.0, 2.0, 3.0, 4.0, 5.0};
using T = decltype(v)::value_type;
auto factory = hpk::fft::makeFactory<T, T>();

Note that, above, T is std::complex<double>.

So, makeFactory() constructs a Factory<remove_complex_t<T>, T, add_complex_t<T>> and returns a unique_ptr that owns it. That is, the variable factory above has type

std::unique_ptr<hpk::fft::Factory<double,
                                  std::complex<double>,
                                  std::complex<double>>>

or, using the type alias,

std::unique_ptr<hpk::fft::FactoryCC<double>>

If the requested factory cannot be constructed, an empty unique_ptr is returned. For example,

// Broken; specified architecture does not support half precision.
auto factory = hpk::fft::makeFactory<_Float16>({hpk::Architecture::avx2});

The std::unique_ptr named factory above is empty and dereferencing it will cause a run time segmentation fault.

Finally, note that all of the member functions of a factory are const. From this you may correctly infer that a factory can be shared among threads in a multithreaded environment without, for example, locking your own mutex. For further details, please see Advanced.

Inplace and Ooplace#

A factory makes hpk::fft::Inplace and hpk::fft::Ooplace objects. The first is used to compute an FFT in-place (so the output overwrites the input) and the second is used to compute out-of-place (i.e., retain the input and copy the output elsewhere). In the latter case, the input and output must be disjoint.

These two classes are each templated by the three template type parameters we have seen, and for convenience the aliases InplaceCC, OoplaceCC, InplaceRC, and OoplaceRC are defined in the header file fft.hpp. Objects of these types are made using the factory member functions makeInplace() and makeOoplace(). These functions return a unique_ptr that owns the constructed object. The factory can be destroyed at any time without affecting the objects it has already made. For example,

std::unique_ptr<hpk::fft::InplaceCC<double>> fft;
if ( /* some condition */ ) {
    auto factory = hpk::fft::makeFactory<double>();
    fft = factory->makeInplace({1024});
}

When calling these factory member functions, the number of points in each dimension of the FFT must be specified. Optionally, strides (the distance between points) may be specified as well. An FFT’s data is described by a layout array having one element for each dimension of the FFT. Each element of the layout array is an object of type hpk::fft::InplaceDim or hpk::fft::OoplaceDim, as appropriate. If a stride is not explicitly specified at construction, it defaults to zero. A stride of zero is taken to mean that points are contiguous in memory with no unnecessary padding. It is a convenient default for the most common case.

Important rules to know are:

InplaceDim objects have a time stride but do not have a frequency stride, since data in the time and frequency domains use the same storage.
For OoplaceDim objects, strides are specified for the time domain and for the frequency domain, not for a compute function’s input and output.
Strides are always specified using units of real values. For example, adjacent complex points have a stride of two.

Given that an Inplace FFT writes its output to the same memory location from which it read its input, it uses timeStride exclusively, for both input and output, and for both forward and backward directions.

The second rule allows a single Ooplace object to perform both a forward transform (having input with timeStride and output with freqStride) and also its inverse (which is a backward transform having input with freqStride and output with timeStride).

The third rule allows for flexibility regarding data alignment. For example, consider an array of 28 floating point numbers:

R0 I0 R1 I1 R2 I2 NaN R3 I3 R4 I4 R5 I5 NaN ...
R9 I9 Ra Ia Rb Ib NaN

where the first two values (R0, I0) represent a complex number, the same for the other numbered values, and NaN is a padding value, perhaps used to mark the end of a row. Logically, we have a 2D sequence of complex numbers with 4 rows and 3 columns:

   R0 I0  R1 I1  R2 I2  NaN
   R3 I3  R4 I4  R5 I5  NaN
   R6 I6  R7 I7  R8 I8  NaN
   R9 I9  Ra Ia  Rb Ib  NaN

This 4x3 layout in the time domain can be specified with the following array:

hpk::fft::InplaceDim inplaceLayout[2] = {{4, 7}, {3, 2}};

The stride between rows (from R0 to R3) is 7 real numbers and the stride from R0 to R1 is 2. Equivalently, since the points in each row are adjacent,

hpk::fft::InplaceDim inplaceLayout[2] = {{4, 7}, 3};

The observation here is that the third rule allows the stride to be an odd number of real values, even in the case of complex-valued data. This allows one to specify correctly the stride to access data in an array of structures that is not padded to an even number of floating point values. Note, for example, that although sizeof(std::complex<float>) is 8, the required alignment, alignof(std::complex<float>), is only 4.

The inplaceLayout array above can be used to make an Inplace object, returned via an owning std::unique_ptr, as follows:

auto factory = hpk::fft::makeFactory<float>();
auto fft_inp = factory->makeInplace<2>(inplaceLayout);

Note that a stride may be explicitly set to zero. For example, in the following:

hpk::fft::OoplaceDim ooplaceLayout[2] = {{4, 7, 0}, 3};
auto fft_oop = factory->makeOoplace<2>(ooplaceLayout);

the freqStride is zero for both dimensions. Therefore, the rows are contiguous and the points within each row are also contiguous. The data in the frequency domain for fft_oop is 24 consecutive floats, without any padding:

R0 I0 R1 I1 R2 I2 R3 I3 R4 I4 R5 I5 ... R9 I9 Ra Ia Rb Ib

Logically, this is:

   R0 I0  R1 I1  R2 I2
   R3 I3  R4 I4  R5 I5
   R6 I6  R7 I7  R8 I8
   R9 I9  Ra Ia  Rb Ib

For multidimensional FFTs, padding is not supported between consecutive points in a row. That is, the stride in the last dimension must be either 0, which is the default, or (equivalently) 1 if the domain is real and 2 if the domain is complex.

To specify that a batch of transforms is to be computed, makeInplace() and makeOoplace() accept an optional second argument. This is a single hpk::fft::InplaceDim or hpk::fft::OoplaceDim object (not an array), where n is the number of transforms in the batch, timeStride is the batch stride in the time domain, and freqStride (for OoplaceDim) is the batch stride in the frequency domain.

Continuing the previous in-place example for a batch of 10 transforms with no extra padding (i.e., the first data point of each transform in the batch follows 28 floats after the first point of the previous transform):

auto fft_inp_b10 = factory->makeInplace<2>(inplaceLayout, 10);

For a batch of 10 out-of-place transforms where the first points of each transform in the batch are 32 floats apart in the time domain and 40 floats apart in the frequency domain:

auto fft_oop_b10 = factory->makeOoplace<2>(ooplaceLayout, {10, 32, 40});

It might be observed that the rank (the number of dimensions, the 2) can be automatically deduced by the compiler. So, one could have written:

auto fft_oop_b10 = factory->makeOoplace(ooplaceLayout, {10, 32, 40});

But we recommend providing the rank explicitly. To illustrate why, consider the following:

auto fft3 = factory->makeOoplace({20, 2, 2});
auto fft1 = factory->makeOoplace({{20, 2, 2}});

The first is a 3-dimensional 20x2x2 FFT, and the second is 1-dimensional (20 points) with explicit strides. Not only are these more readable with explicit template parameters, but also the compiler confirms the rank of the layout array. So, prefer:

auto fft3 = factory->makeOoplace<3>({20, 2, 2});
auto fft1 = factory->makeOoplace<1>({{20, 2, 2}});

std::unique_ptr fft3 = factory->makeOoplace<3>({20, 2, 2});
std::unique_ptr fft1 = factory->makeOoplace<1>({{20, 2, 2}});

On the other hand, if the rank is not known at compile time,

std::vector<hpk::fft::OoplaceDim> layout;
layout.emplace_back(20);
layout.emplace_back(2);
layout.emplace_back(2);
std::unique_ptr fft3 = factory->makeOoplace(layout.size(), layout.data());
std::cout << *fft3 << '\n';

layout.clear();
layout.emplace_back(20, 2, 2);
std::unique_ptr fft1 = factory->makeOoplace(layout.size(), layout.data());
std::cout << *fft1 << '\n';

Finally, note that Inplace and Ooplace objects are immutable; all member functions are const. They may be shared in a multithreaded environment without any form of synchronization. If that is your intent, you may wish to initialize a shared_ptr as follows:

std::shared_ptr fft3 = factory->makeOoplace<3>({20, 2, 2});
std::shared_ptr fft1 = factory->makeOoplace<1>({{20, 2, 2}});

Computing the FFT#

An Inplace object has the following member functions:

forward(), scaleForward()
backward(), scaleBackward()
scratchSize(), scratchSizeBytes()
maxThreads()
toString()

An Ooplace object has:

forwardCopy(), scaleForwardCopy()
backwardCopy(), scaleBackwardCopy()
scratchSize(), scratchSizeBytes()
maxThreads()
toString()

The first four functions of each compute an FFT, either forwards or backwards, and either with or without a real-valued scaling factor. The mathematics is described in the Introduction. Note that the scale factor is a pointer-to-const and the data is read-only and never modified. Likewise, the input argument to the out-of-place compute functions is a pointer-to-const and the data is read-only and never modified.

The compute functions are templated with type names inout_t, in_t, and out_t to allow the use of either real or complex types as function arguments for complex data. Of course, the data itself must have the correct representation. In other words, supposing the FFT is constructed so as to compute on contiguous complex points, the data must be a contiguous sequence of floating point numbers (of the specified precision), first the real part of the first point, then its imaginary part, then the real part of the second point, then its imaginary part, and so on. But (taking single precision as an example) the C++ type of the pointer can be float* or std::complex<float>*.

In general, the compute functions require scratch memory to do their job. For performance and usability, each compute function has three overloads:

scratch memory is provided by an Allocator
- the Allocator argument is omitted and an AlignedAllocator is used, or
- the Allocator argument is specified in the call to the compute function
scratch memory is provided by a smart pointer such as std::unique_ptr<mathType>
scratch memory is provided by a raw pointer, mathType*

For the last two cases, it is the developer’s responsibility to ensure that sufficient memory is available at the pointed to location. The minimum amount can be determined by calling either scratchSize() or scratchSizeBytes(), however, the use of allocateScratch() is both convenient and recommended. Please see Advanced for further discussion on memory allocation.

The C++ Examples are fairly self-explanatory, so won’t be repeated here. However, let’s complete the generic programming example we started above:

std::vector<std::complex<double>> v = {1.0, 2.0, 3.0, 4.0, 5.0};
using T = decltype(v)::value_type;
auto factory = hpk::fft::makeFactory<T, T>();
auto fft = factory->makeInplace<1>({static_cast<long>(v.size())});
fft->forward(v.data());

Note that in C++20, static_cast<long>(v.size()) can be replaced with std::ssize(v).

Finally, if scratch memory can be reused later to avoid subsequent allocations, the last line can be replaced with:

auto scratch = allocateScratch(*fft);
fft->forward(v.data(), scratch);