API Design#
Overview of the application programming interface
Factories#
A factory does not compute an FFT, but rather it makes the objects that do.
Factories are specialized for both the data representation as well as for the
hardware that will be used. For example, different factories are required for
double precision, single precision, and half precision. Also, different
factories are made when the time domain is complex (and so each data point is
represented by two floating point numbers, such as a std::complex<float>
)
than when the time domain is real (and each data point is represented by one).
Furthermore, a factory for AVX512 hardware is different from a factory for AVX2
hardware, and a factory that makes objects that compute with a single thread
is different from a factory that make objects that utilize OpenMP parallelism.
The type of a factory is hpk::fft::Factory<fp_t, time_t, freq_t>
.
Note that the data representation is part of the type, but other attributes,
such as the hardware architecture, are not.
The first template type parameter is the real floating point type used for the
computer arithmetic, e.g., double
. If a scaled FFT will be performed, this
will be the type of the scale factor. The second and third template type
parameters are, respectively, the types of data in the time and frequency
domains. So, for double precision complex values in the time domain, time_t
would be std::complex<double>
, and for a single precision real-valued time
domain, time_t
would be float
.
For “complex-to-complex” FFTs, it is convenient to use the type alias
hpk::fft::FactoryCC<fp_t>
, which has std::complex<fp_t>
for both
time_t
and freq_t
.
For “real-to-complex” transforms, the type alias hpk::fft::FactoryRC<fp_t>
has fp_t
for time_t
and has std::complex<fp_t>
for freq_t
.
These are defined in the header file factory.hpp
.
A concrete instance of a Factory
is obtained by calling
hpk::fft::makeFactory()
, which constructs a factory based on the function’s
template type parameters and an hpk::Configuration
object optionally passed
as an argument. The function returns a std::unique_ptr
that manages the
lifetime of the constructed factory.
It sounds terribly confusing, with a function to make factories that make objects that compute FFTs, but it’s actually not bad. For example, for double precision complex time domain:
auto factory = hpk::fft::makeFactory<double>();
This example omitted the function’s second and third template type parameters,
allowing them to default to std::complex<fp_t>
. Also, the Configuration
argument is omitted, so the hardware architecture will be detected at run time
and the factory is for OpenMP using one thread per core.
To make a sequential factory (no OpenMP, single-threaded computation):
auto factory =
hpk::fft::makeFactory<double>({{hpk::Parameter::threads, 1}});
A factory for single precision “real-to-complex” FFTs that are computed with a single thread can be made as follows:
hpk::Configuration cfg{{hpk::Parameter::threads, 1}};
auto factory =
hpk::fft::makeFactory<float, float, std::complex<float>>(cfg);
Note that after having made the factory, the Configuration object is not again
referenced. There is no “spooky action at a distance.”
It is fine to end the lifetime of cfg
in code such as this:
std::unique_ptr<hpk::fft::FactoryRC<float>> factory;
if ( /* some condition */ ) {
hpk::Configuration cfg{{hpk::Parameter::threads, 1}};
factory =
hpk::fft::makeFactory<float, float, std::complex<float>>(cfg);
} else {
// multithreaded
factory = hpk::fft::makeFactory<float, float, std::complex<float>>();
}
If you enjoy generic programming (and who doesn’t), the following makes a factory for double precision “complex-to-complex” FFTs:
std::vector<std::complex<double>> v = {1.0, 2.0, 3.0, 4.0, 5.0};
using T = decltype(v)::value_type;
auto factory = hpk::fft::makeFactory<T, T>();
Note that, above, T
is std::complex<double>
.
So, makeFactory()
constructs a Factory<remove_complex_t<T>, T, add_complex_t<T>>
and returns a unique_ptr
that owns it.
That is, the variable factory
above has type
std::unique_ptr<hpk::fft::Factory<double,
std::complex<double>,
std::complex<double>>>
or, using the type alias,
std::unique_ptr<hpk::fft::FactoryCC<double>>
If the requested factory cannot be constructed, an empty unique_ptr
is
returned. For example,
// Broken; specified architecture does not support half precision.
auto factory = hpk::fft::makeFactory<_Float16>({hpk::Architecture::avx2});
The std::unique_ptr
named factory
above is empty and dereferencing it will
cause a run time segmentation fault.
Finally, note that all of the member functions of a factory are const
.
From this you may correctly infer that a factory can be shared among threads
in a multithreaded environment without, for example, locking your own mutex.
For further details, please see Advanced.
Inplace and Ooplace#
A factory makes hpk::fft::Inplace
and hpk::fft::Ooplace
objects. The
first is used to compute an FFT in-place (so the output overwrites the input)
and the second is used to compute out-of-place (i.e., retain the input and copy
the output elsewhere). In the latter case, the input and output must be
disjoint.
These two classes are each templated by the three template type parameters
we have seen, and for convenience the aliases InplaceCC
, OoplaceCC
,
InplaceRC
, and OoplaceRC
are defined in the header file fft.hpp
.
Objects of these types are made using the factory member functions
makeInplace()
and makeOoplace()
.
These functions return a unique_ptr
that owns the constructed object.
The factory can be destroyed at any time without affecting the objects it has
already made.
For example,
std::unique_ptr<hpk::fft::InplaceCC<double>> fft;
if ( /* some condition */ ) {
auto factory = hpk::fft::makeFactory<double>();
fft = factory->makeInplace({1024});
}
When calling these factory member functions, the number of points in each
dimension of the FFT must be specified. Optionally, strides (the distance
between points) may be specified as well.
An FFT’s data is described by a layout array having one element for each
dimension of the FFT.
Each element of the layout array is an object of type hpk::fft::InplaceDim
or hpk::fft::OoplaceDim
, as appropriate.
If a stride is not explicitly specified at construction, it defaults to zero.
A stride of zero is taken to mean that points are contiguous in memory with no
unnecessary padding. It is a convenient default for the most common case.
Important rules to know are:
InplaceDim
objects have a time stride but do not have a frequency stride, since data in the time and frequency domains use the same storage.For
OoplaceDim
objects, strides are specified for the time domain and for the frequency domain, not for a compute function’s input and output.Strides are always specified using units of real values. For example, adjacent complex points have a stride of two.
Given that an Inplace FFT writes its output to the same memory location from
which it read its input, it uses timeStride
exclusively, for both input and
output, and for both forward and backward directions.
The second rule allows a single Ooplace object to perform both a forward
transform (having input with timeStride
and output with freqStride
) and
also its inverse (which is a backward transform having input with freqStride
and output with timeStride
).
The third rule allows for flexibility regarding data alignment. For example, consider an array of 28 floating point numbers:
R0 I0 R1 I1 R2 I2 NaN R3 I3 R4 I4 R5 I5 NaN ...
R9 I9 Ra Ia Rb Ib NaN
where the first two values (R0, I0)
represent a complex number, the same for
the other numbered values, and NaN
is a padding value, perhaps used to mark
the end of a row.
Logically, we have a 2D sequence of complex numbers with 4 rows and 3 columns:
R0 I0 R1 I1 R2 I2 NaN
R3 I3 R4 I4 R5 I5 NaN
R6 I6 R7 I7 R8 I8 NaN
R9 I9 Ra Ia Rb Ib NaN
This 4x3 layout in the time domain can be specified with the following array:
hpk::fft::InplaceDim inplaceLayout[2] = {{4, 7}, {3, 2}};
The stride between rows (from R0
to R3
) is 7 real numbers and the stride
from R0
to R1
is 2.
Equivalently, since the points in each row are adjacent,
hpk::fft::InplaceDim inplaceLayout[2] = {{4, 7}, 3};
The observation here is that the third rule allows the stride to be an odd
number of real values, even in the case of complex-valued data.
This allows one to specify correctly the stride to access data in an array of
structures that is not padded to an even number of floating point values.
Note, for example, that although sizeof(std::complex<float>)
is 8,
the required alignment, alignof(std::complex<float>)
, is only 4.
The inplaceLayout
array above can be used to make an Inplace
object,
returned via an owning std::unique_ptr
, as follows:
auto factory = hpk::fft::makeFactory<float>();
auto fft_inp = factory->makeInplace<2>(inplaceLayout);
Note that a stride may be explicitly set to zero. For example, in the following:
hpk::fft::OoplaceDim ooplaceLayout[2] = {{4, 7, 0}, 3};
auto fft_oop = factory->makeOoplace<2>(ooplaceLayout);
the freqStride
is zero for both dimensions.
Therefore, the rows are contiguous and the points within each row are also
contiguous.
The data in the frequency domain for fft_oop
is 24 consecutive floats,
without any padding:
R0 I0 R1 I1 R2 I2 R3 I3 R4 I4 R5 I5 ... R9 I9 Ra Ia Rb Ib
Logically, this is:
R0 I0 R1 I1 R2 I2
R3 I3 R4 I4 R5 I5
R6 I6 R7 I7 R8 I8
R9 I9 Ra Ia Rb Ib
For multidimensional FFTs, padding is not supported between consecutive points in a row. That is, the stride in the last dimension must be either 0, which is the default, or (equivalently) 1 if the domain is real and 2 if the domain is complex.
To specify that a batch of transforms is to be computed, makeInplace()
and makeOoplace()
accept an optional second argument.
This is a single hpk::fft::InplaceDim
or hpk::fft::OoplaceDim
object
(not an array), where n
is the number of transforms in the batch,
timeStride
is the batch stride in the time domain, and
freqStride
(for OoplaceDim
) is the batch stride in the frequency domain.
Continuing the previous in-place example for a batch of 10 transforms with no extra padding (i.e., the first data point of each transform in the batch follows 28 floats after the first point of the previous transform):
auto fft_inp_b10 = factory->makeInplace<2>(inplaceLayout, 10);
For a batch of 10 out-of-place transforms where the first points of each transform in the batch are 32 floats apart in the time domain and 40 floats apart in the frequency domain:
auto fft_oop_b10 = factory->makeOoplace<2>(ooplaceLayout, {10, 32, 40});
It might be observed that the rank (the number of dimensions, the 2) can be automatically deduced by the compiler. So, one could have written:
auto fft_oop_b10 = factory->makeOoplace(ooplaceLayout, {10, 32, 40});
But we recommend providing the rank explicitly. To illustrate why, consider the following:
auto fft3 = factory->makeOoplace({20, 2, 2});
auto fft1 = factory->makeOoplace({{20, 2, 2}});
The first is a 3-dimensional 20x2x2 FFT, and the second is 1-dimensional (20 points) with explicit strides. Not only are these more readable with explicit template parameters, but also the compiler confirms the rank of the layout array. So, prefer:
auto fft3 = factory->makeOoplace<3>({20, 2, 2});
auto fft1 = factory->makeOoplace<1>({{20, 2, 2}});
or
std::unique_ptr fft3 = factory->makeOoplace<3>({20, 2, 2});
std::unique_ptr fft1 = factory->makeOoplace<1>({{20, 2, 2}});
On the other hand, if the rank is not known at compile time,
std::vector<hpk::fft::OoplaceDim> layout;
layout.emplace_back(20);
layout.emplace_back(2);
layout.emplace_back(2);
std::unique_ptr fft3 = factory->makeOoplace(layout.size(), layout.data());
std::cout << *fft3 << '\n';
layout.clear();
layout.emplace_back(20, 2, 2);
std::unique_ptr fft1 = factory->makeOoplace(layout.size(), layout.data());
std::cout << *fft1 << '\n';
Finally, note that Inplace
and Ooplace
objects are immutable; all member
functions are const
. They may be shared in a multithreaded environment
without any form of synchronization. If that is your intent, you may wish
to initialize a shared_ptr
as follows:
std::shared_ptr fft3 = factory->makeOoplace<3>({20, 2, 2});
std::shared_ptr fft1 = factory->makeOoplace<1>({{20, 2, 2}});
Computing the FFT#
An Inplace
object has the following member functions:
forward()
,scaleForward()
backward()
,scaleBackward()
scratchSize()
,scratchSizeBytes()
maxThreads()
toString()
An Ooplace
object has:
forwardCopy()
,scaleForwardCopy()
backwardCopy()
,scaleBackwardCopy()
scratchSize()
,scratchSizeBytes()
maxThreads()
toString()
The first four functions of each compute an FFT, either forwards or backwards,
and either with or without a real-valued scaling factor. The mathematics is
described in the Introduction.
Note that the scale factor is a pointer-to-const
and the data is read-only
and never modified.
Likewise, the input argument to the out-of-place compute functions is a
pointer-to-const
and the data is read-only and never modified.
The compute functions are templated with type names inout_t
, in_t
, and
out_t
to allow the use of either real or complex types as function arguments
for complex data.
Of course, the data itself must have the correct representation.
In other words, supposing the FFT is constructed so as to compute on contiguous
complex points, the data must be a contiguous sequence of floating point
numbers (of the specified precision), first the real part of the first point,
then its imaginary part, then the real part of the second point, then its
imaginary part, and so on.
But (taking single precision as an example) the C++ type of the pointer can be
float*
or std::complex<float>*
.
In general, the compute functions require scratch memory to do their job. For performance and usability, each compute function has three overloads:
scratch memory is provided by an Allocator
the Allocator argument is omitted and an AlignedAllocator is used, or
the Allocator argument is specified in the call to the compute function
scratch memory is provided by a smart pointer such as
std::unique_ptr<mathType>
scratch memory is provided by a raw pointer,
mathType*
For the last two cases, it is the developer’s responsibility to ensure that
sufficient memory is available at the pointed to location. The minimum amount
can be determined by calling either scratchSize()
or scratchSizeBytes()
,
however, the use of allocateScratch()
is both convenient and recommended.
Please see Advanced for further discussion on memory allocation.
The C++ Examples are fairly self-explanatory, so won’t be repeated here. However, let’s complete the generic programming example we started above:
std::vector<std::complex<double>> v = {1.0, 2.0, 3.0, 4.0, 5.0};
using T = decltype(v)::value_type;
auto factory = hpk::fft::makeFactory<T, T>();
auto fft = factory->makeInplace<1>({static_cast<long>(v.size())});
fft->forward(v.data());
Note that in C++20, static_cast<long>(v.size())
can be replaced with
std::ssize(v)
.
Finally, if scratch memory can be reused later to avoid subsequent allocations, the last line can be replaced with:
auto scratch = allocateScratch(*fft);
fft->forward(v.data(), scratch);