# API Design#

Overview of the application programming interface

## Factories#

A factory does not compute an FFT, but rather it makes the objects that do.
Factories are specialized for both the data representation as well as for the
hardware that will be used. For example, different factories are required for
double precision, single precision, and half precision. Also, different
factories are made when the time domain is complex (and so each data point is
represented by two floating point numbers, such as a `std::complex<float>`

)
than when the time domain is real (and each data point is represented by one).
Furthermore, a factory for AVX512 hardware is different from a factory for AVX2
hardware, and a factory that makes objects that compute with a single thread
is different from a factory that make objects that utilize OpenMP parallelism.

The type of a factory is `hpk::fft::Factory<fp_t, time_t, freq_t>`

.
Note that the data representation is part of the type, but other attributes,
such as the hardware architecture, are not.
The first template type parameter is the real floating point type used for the
computer arithmetic, e.g., `double`

. If a scaled FFT will be performed, this
will be the type of the scale factor. The second and third template type
parameters are, respectively, the types of data in the time and frequency
domains. So, for double precision complex values in the time domain, `time_t`

would be `std::complex<double>`

, and for a single precision real-valued time
domain, `time_t`

would be `float`

.

For “complex-to-complex” FFTs, it is convenient to use the type alias
`hpk::fft::FactoryCC<fp_t>`

, which has `std::complex<fp_t>`

for both
`time_t`

and `freq_t`

.
For “real-to-complex” transforms, the type alias `hpk::fft::FactoryRC<fp_t>`

has `fp_t`

for `time_t`

and has `std::complex<fp_t>`

for `freq_t`

.
These are defined in the header file `factory.hpp`

.

A concrete instance of a `Factory`

is obtained by calling
`hpk::fft::makeFactory()`

, which constructs a factory based on the function’s
template type parameters and an `hpk::Configuration`

object optionally passed
as an argument. The function returns a `std::unique_ptr`

that manages the
lifetime of the constructed factory.

It sounds terribly confusing, with a function to make factories that make objects that compute FFTs, but it’s actually not bad. For example, for double precision complex time domain:

```
auto factory = hpk::fft::makeFactory<double>();
```

This example omitted the function’s second and third template type parameters,
allowing them to default to `std::complex<fp_t>`

. Also, the `Configuration`

argument is omitted, so the hardware architecture will be detected at run time
and the factory is for OpenMP using one thread per core.

To make a sequential factory (no OpenMP, single-threaded computation):

```
auto factory =
hpk::fft::makeFactory<double>({{hpk::Parameter::threads, 1}});
```

A factory for single precision “real-to-complex” FFTs that are computed with a single thread can be made as follows:

```
hpk::Configuration cfg{{hpk::Parameter::threads, 1}};
auto factory =
hpk::fft::makeFactory<float, float, std::complex<float>>(cfg);
```

Note that after having made the factory, the Configuration object is not again
referenced. There is no “spooky action at a distance.”
It is fine to end the lifetime of `cfg`

in code such as this:

```
std::unique_ptr<hpk::fft::FactoryRC<float>> factory;
if ( /* some condition */ ) {
hpk::Configuration cfg{{hpk::Parameter::threads, 1}};
factory =
hpk::fft::makeFactory<float, float, std::complex<float>>(cfg);
} else {
// multithreaded
factory = hpk::fft::makeFactory<float, float, std::complex<float>>();
}
```

If you enjoy generic programming (and who doesn’t), the following makes a factory for double precision “complex-to-complex” FFTs:

```
std::vector<std::complex<double>> v = {1.0, 2.0, 3.0, 4.0, 5.0};
using T = decltype(v)::value_type;
auto factory = hpk::fft::makeFactory<T, T>();
```

Note that, above, `T`

is `std::complex<double>`

.

So, `makeFactory()`

constructs a `Factory<remove_complex_t<T>, T, add_complex_t<T>>`

and returns a `unique_ptr`

that owns it.
That is, the variable `factory`

above has type

```
std::unique_ptr<hpk::fft::Factory<double,
std::complex<double>,
std::complex<double>>>
```

or, using the type alias,

```
std::unique_ptr<hpk::fft::FactoryCC<double>>
```

If the requested factory cannot be constructed, an empty `unique_ptr`

is
returned. For example,

```
// Broken; specified architecture does not support half precision.
auto factory = hpk::fft::makeFactory<_Float16>({hpk::Architecture::avx2});
```

The `std::unique_ptr`

named `factory`

above is empty and dereferencing it will
cause a run time segmentation fault.

Finally, note that all of the member functions of a factory are `const`

.
From this you may correctly infer that a factory can be shared among threads
in a multithreaded environment without, for example, locking your own mutex.
For further details, please see Advanced.

## Inplace and Ooplace#

A factory makes `hpk::fft::Inplace`

and `hpk::fft::Ooplace`

objects. The
first is used to compute an FFT in-place (so the output overwrites the input)
and the second is used to compute out-of-place (i.e., retain the input and copy
the output elsewhere). In the latter case, the input and output must be
disjoint.

These two classes are each templated by the three template type parameters
we have seen, and for convenience the aliases `InplaceCC`

, `OoplaceCC`

,
`InplaceRC`

, and `OoplaceRC`

are defined in the header file `fft.hpp`

.
Objects of these types are made using the factory member functions
`makeInplace()`

and `makeOoplace()`

.
These functions return a `unique_ptr`

that owns the constructed object.
The factory can be destroyed at any time without affecting the objects it has
already made.
For example,

```
std::unique_ptr<hpk::fft::InplaceCC<double>> fft;
if ( /* some condition */ ) {
auto factory = hpk::fft::makeFactory<double>();
fft = factory->makeInplace({1024});
}
```

When calling these factory member functions, the number of points in each
dimension of the FFT must be specified. Optionally, strides (the distance
between points) may be specified as well.
An FFT’s data is described by a layout array having one element for each
dimension of the FFT.
Each element of the layout array is an object of type `hpk::fft::InplaceDim`

or `hpk::fft::OoplaceDim`

, as appropriate.
If a stride is not explicitly specified at construction, it defaults to zero.
A stride of zero is taken to mean that points are contiguous in memory with no
unnecessary padding. It is a convenient default for the most common case.

Important rules to know are:

`InplaceDim`

objects have a time stride but do not have a frequency stride, since data in the time and frequency domains use the same storage.For

`OoplaceDim`

objects, strides are specified for the time domain and for the frequency domain,*not*for a compute function’s input and output.Strides are always specified using units of real values. For example, adjacent complex points have a stride of two.

Given that an Inplace FFT writes its output to the same memory location from
which it read its input, it uses `timeStride`

exclusively, for both input and
output, and for both forward and backward directions.

The second rule allows a single Ooplace object to perform both a forward
transform (having input with `timeStride`

and output with `freqStride`

) and
also its inverse (which is a backward transform having input with `freqStride`

and output with `timeStride`

).

The third rule allows for flexibility regarding data alignment. For example, consider an array of 28 floating point numbers:

```
R0 I0 R1 I1 R2 I2 NaN R3 I3 R4 I4 R5 I5 NaN ...
R9 I9 Ra Ia Rb Ib NaN
```

where the first two values `(R0, I0)`

represent a complex number, the same for
the other numbered values, and `NaN`

is a padding value, perhaps used to mark
the end of a row.
Logically, we have a 2D sequence of complex numbers with 4 rows and 3 columns:

`R0 I0 R1 I1 R2 I2 NaN`

`R3 I3 R4 I4 R5 I5 NaN`

`R6 I6 R7 I7 R8 I8 NaN`

`R9 I9 Ra Ia Rb Ib NaN`

This 4x3 layout in the time domain can be specified with the following array:

```
hpk::fft::InplaceDim inplaceLayout[2] = {{4, 7}, {3, 2}};
```

The stride between rows (from `R0`

to `R3`

) is 7 real numbers and the stride
from `R0`

to `R1`

is 2.
Equivalently, since the points in each row are adjacent,

```
hpk::fft::InplaceDim inplaceLayout[2] = {{4, 7}, 3};
```

The observation here is that the third rule allows the stride to be an odd
number of real values, even in the case of complex-valued data.
This allows one to specify correctly the stride to access data in an array of
structures that is not padded to an even number of floating point values.
Note, for example, that although `sizeof(std::complex<float>)`

is 8,
the required alignment, `alignof(std::complex<float>)`

, is only 4.

The `inplaceLayout`

array above can be used to make an `Inplace`

object,
returned via an owning `std::unique_ptr`

, as follows:

```
auto factory = hpk::fft::makeFactory<float>();
auto fft_inp = factory->makeInplace<2>(inplaceLayout);
```

Note that a stride may be explicitly set to zero. For example, in the following:

```
hpk::fft::OoplaceDim ooplaceLayout[2] = {{4, 7, 0}, 3};
auto fft_oop = factory->makeOoplace<2>(ooplaceLayout);
```

the `freqStride`

is zero for both dimensions.
Therefore, the rows are contiguous and the points within each row are also
contiguous.
The data in the frequency domain for `fft_oop`

is 24 consecutive floats,
without any padding:

`R0 I0 R1 I1 R2 I2 R3 I3 R4 I4 R5 I5 ... R9 I9 Ra Ia Rb Ib`

Logically, this is:

`R0 I0 R1 I1 R2 I2`

`R3 I3 R4 I4 R5 I5`

`R6 I6 R7 I7 R8 I8`

`R9 I9 Ra Ia Rb Ib`

For multidimensional FFTs, padding is not supported between consecutive points in a row. That is, the stride in the last dimension must be either 0, which is the default, or (equivalently) 1 if the domain is real and 2 if the domain is complex.

To specify that a batch of transforms is to be computed, `makeInplace()`

and `makeOoplace()`

accept an optional second argument.
This is a single `hpk::fft::InplaceDim`

or `hpk::fft::OoplaceDim`

object
(not an array), where `n`

is the number of transforms in the batch,
`timeStride`

is the batch stride in the time domain, and
`freqStride`

(for `OoplaceDim`

) is the batch stride in the frequency domain.

Continuing the previous in-place example for a batch of 10 transforms with no extra padding (i.e., the first data point of each transform in the batch follows 28 floats after the first point of the previous transform):

```
auto fft_inp_b10 = factory->makeInplace<2>(inplaceLayout, 10);
```

For a batch of 10 out-of-place transforms where the first points of each transform in the batch are 32 floats apart in the time domain and 40 floats apart in the frequency domain:

```
auto fft_oop_b10 = factory->makeOoplace<2>(ooplaceLayout, {10, 32, 40});
```

It might be observed that the rank (the number of dimensions, the 2) can be automatically deduced by the compiler. So, one could have written:

```
auto fft_oop_b10 = factory->makeOoplace(ooplaceLayout, {10, 32, 40});
```

But we recommend providing the rank explicitly. To illustrate why, consider the following:

```
auto fft3 = factory->makeOoplace({20, 2, 2});
auto fft1 = factory->makeOoplace({{20, 2, 2}});
```

The first is a 3-dimensional 20x2x2 FFT, and the second is 1-dimensional (20 points) with explicit strides. Not only are these more readable with explicit template parameters, but also the compiler confirms the rank of the layout array. So, prefer:

```
auto fft3 = factory->makeOoplace<3>({20, 2, 2});
auto fft1 = factory->makeOoplace<1>({{20, 2, 2}});
```

or

```
std::unique_ptr fft3 = factory->makeOoplace<3>({20, 2, 2});
std::unique_ptr fft1 = factory->makeOoplace<1>({{20, 2, 2}});
```

On the other hand, if the rank is not known at compile time,

```
std::vector<hpk::fft::OoplaceDim> layout;
layout.emplace_back(20);
layout.emplace_back(2);
layout.emplace_back(2);
std::unique_ptr fft3 = factory->makeOoplace(layout.size(), layout.data());
std::cout << *fft3 << '\n';
layout.clear();
layout.emplace_back(20, 2, 2);
std::unique_ptr fft1 = factory->makeOoplace(layout.size(), layout.data());
std::cout << *fft1 << '\n';
```

Finally, note that `Inplace`

and `Ooplace`

objects are immutable; all member
functions are `const`

. They may be shared in a multithreaded environment
without any form of synchronization. If that is your intent, you may wish
to initialize a `shared_ptr`

as follows:

```
std::shared_ptr fft3 = factory->makeOoplace<3>({20, 2, 2});
std::shared_ptr fft1 = factory->makeOoplace<1>({{20, 2, 2}});
```

## Computing the FFT#

An `Inplace`

object has the following member functions:

`forward()`

,`scaleForward()`

`backward()`

,`scaleBackward()`

`scratchSize()`

,`scratchSizeBytes()`

`maxThreads()`

`toString()`

An `Ooplace`

object has:

`forwardCopy()`

,`scaleForwardCopy()`

`backwardCopy()`

,`scaleBackwardCopy()`

`scratchSize()`

,`scratchSizeBytes()`

`maxThreads()`

`toString()`

The first four functions of each compute an FFT, either forwards or backwards,
and either with or without a real-valued scaling factor. The mathematics is
described in the Introduction.
Note that the scale factor is a pointer-to-`const`

and the data is read-only
and never modified.
Likewise, the input argument to the out-of-place compute functions is a
pointer-to-`const`

and the data is read-only and never modified.

The compute functions are templated with type names `inout_t`

, `in_t`

, and
`out_t`

to allow the use of either real or complex types as function arguments
for complex data.
Of course, the data itself must have the correct representation.
In other words, supposing the FFT is constructed so as to compute on contiguous
complex points, the data must be a contiguous sequence of floating point
numbers (of the specified precision), first the real part of the first point,
then its imaginary part, then the real part of the second point, then its
imaginary part, and so on.
But (taking single precision as an example) the C++ type of the pointer can be
`float*`

or `std::complex<float>*`

.

In general, the compute functions require scratch memory to do their job. For performance and usability, each compute function has three overloads:

scratch memory is provided by an Allocator

the Allocator argument is omitted and an AlignedAllocator is used, or

the Allocator argument is specified in the call to the compute function

scratch memory is provided by a smart pointer such as

`std::unique_ptr<mathType>`

scratch memory is provided by a raw pointer,

`mathType*`

For the last two cases, it is the developer’s responsibility to ensure that
sufficient memory is available at the pointed to location. The minimum amount
can be determined by calling either `scratchSize()`

or `scratchSizeBytes()`

,
however, the use of `allocateScratch()`

is both convenient and recommended.
Please see Advanced for further discussion on memory allocation.

The C++ Examples are fairly self-explanatory, so won’t be repeated here. However, let’s complete the generic programming example we started above:

```
std::vector<std::complex<double>> v = {1.0, 2.0, 3.0, 4.0, 5.0};
using T = decltype(v)::value_type;
auto factory = hpk::fft::makeFactory<T, T>();
auto fft = factory->makeInplace<1>({static_cast<long>(v.size())});
fft->forward(v.data());
```

Note that in C++20, `static_cast<long>(v.size())`

can be replaced with
`std::ssize(v)`

.

Finally, if scratch memory can be reused later to avoid subsequent allocations, the last line can be replaced with:

```
auto scratch = allocateScratch(*fft);
fft->forward(v.data(), scratch);
```