Advanced#

Memory allocation, constant caching, multithreaded considerations, and OpenMP factories.

Memory Allocation#

The two main uses of memory in FFT libraries are for storing trigonometric constants (“twiddle factors”) and for storing temporary “scratch” values. Twiddle factors are generated when an Inplace or Ooplace object is made by a factory, and scratch memory is used during the computation of an FFT. A custom Allocator can be provided for either or both of these purposes, and the two do not have to be the same. For example, the first Allocator can allocate memory in a system-specific manner that is optimized for write-once, read-many data that is long living and widely shared among threads, and the second can be optimized for data that is written, read, and then discarded.

It is strongly recommended that memory be aligned to the width of both CPU vector registers and cache lines. For AVX2, the SIMD vector registers are 32B, and for AVX512 they are 64B. If a custom Allocator is not provided, an instance of hpk::AlignedAllocator is used, which, by default, allocates 64B-aligned memory. This class is a header-only implementation of an Allocator that uses std::aligned_alloc() internally to obtain uninitialized storage. (This code can also serve as a useful starting point for writing your own custom Allocator, or you may prefer std::pmr::polymorphic_allocator.)

An Allocator can be passed as the optional last function argument to makeInplace() or makeOoplace(). It will then be used if needed to allocate memory for the twiddle factors. (Small-sized FFTs may not require a twiddle table.)

An Allocator can be passed as the optional last function argument to an FFT compute function. It will then be used to allocate scratch memory if needed. (Small-sized FFTs may not require any scratch memory.) Note that, if there are multiple computations, it is generally advisable to allocate scratch memory once and reuse it for each function call to avoid repeatedly allocating and freeing memory.

The function hpk::allocateMemory() allocates memory and returns it via a std::unique_ptr that owns it, which has the advantage that the memory will be freed automatically when the smart pointer goes out of scope. Even easier, the function allocateScratch() does the same after first calling scratchSize() to determine the number of elements to allocate. These convenient allocation functions also accept an Allocator as an optional last argument. The allocateScratch() function is demonstrated in Example #3 in fft_cc.cpp. Later examples in that file demonstrate using a raw pointer to provide stack-based scratch memory.

Caching#

Each factory object maintains a single-entry cache that contains the last twiddle table it generated. If the next function call to makeInplace() or makeOoplace() from the same factory requires the same twiddle constants, then the twiddle table is reused. For example, in the following:

auto fft1k_inp = factory->makeInplace({1024});
auto fft1k_oop = factory->makeOoplace({1024});

both FFT objects will share the same twiddle table. However,

auto fft1k_inp = factory->makeInplace({1024});
auto fft2k_inp = factory->makeInplace({2048});
auto fft1k_oop = factory->makeOoplace({1024});
auto fft2k_oop = factory->makeOoplace({2048});

will generate four separate tables. Had both fft1k objects been made first, then both fft2k objects, then only two tables would have been generated.

Note that, from the outside, a factory is conceptually an immutable object. All of its member functions are const. While this is true from a functionality perspective, it is seen as not quite true once memory usage and performance are considered.

Multithreading#

Inplace, Ooplace, and Factory objects are all thread safe. They may be shared among threads and their member functions called without locking or synchronization. This is true for the FFT compute objects because they are immutable and for the factory objects because, internally, the twiddle table cache is protected by a mutex. This is something to consider when deciding whether to share a factory object between threads.

The downside of sharing a factory object is that only one thread at a time can make an Inplace or Ooplace object. If each thread were to make and use its own factory, this would not be the case.

On the other hand, if multiple threads will be using a shared factory to make the same Inplace or Ooplace objects, then those objects will share the same twiddle table. This is especially nice if the threads share the same CPU hardware cache; the twiddle constants would not be replicated needlessly in the processor’s data cache.

In a NUMA system, one could make a factory for each NUMA domain. Then, assuming the same size FFTs are made, all the threads in one NUMA domain would share one twiddle table, but threads in a different domain would have a separate copy of the twiddle constants.

Note that, of course, reusing an existing twiddle table (a “hit” in the factory’s twiddle cache) is faster than allocating memory and generating trigonometric constants. Under the right circumstances, this could have a performance smoothening effect: the fastest thread calls makeInplace() first and computes a bunch of complex exponentials, then the slower threads call makeInplace() and find the results are ready to use. The slower threads have less work to do.

OpenMP Factory#

Hpk supports computing FFTs using multiple threads via OpenMP. Note that for one-dimensional FFTs we do not currently parallelize when the batch size is one, but we do parallelize a batch of FFTs. Multidimensional FFTs are parallelized for any batch size, and for larger size transforms, Hpk can use nested parallel regions to enhance performance.

Hpk itself does not read any environment variables, but the OpenMP library does. Setting the following two environment variables is generally recommended:

OMP_PLACES=cores
OMP_MAX_ACTIVE_LEVELS=2

Hpk sets the number of threads programmatically, which effectively overrides the OMP_NUM_THREADS environment variable otherwise consulted by OpenMP. If runtime configuration is desirable, application code may wish to set the threads parameter of a Configuration based on an environment variable of one’s choosing (perhaps HPK_NUM_THREADS or some application-specific name). For example:

int numThreads = 0;  // Zero ==> detect the number of hardware cores.
if (const char* env = std::getenv("HPK_NUM_THREADS")) {
    std::string_view sv{env};
    auto [ptr, ec] = std::from_chars(sv.begin(), sv.end(), numThreads);
    assert(ec == std::errc());  // Better: report error to user.
}
hpk::Configuration cfg{{hpk::Parameter::threads, numThreads}};

Of course, one could have chosen to write std::getenv("OMP_NUM_THREADS") above.

Please see Configuration for more information.

Note

We always welcome feedback as to what FFT problems are of greatest interest to you (floating point precision, real or complex time domain, Inplace or Ooplace, rank, size of each dimension, and strides (if non-contiguous)), your preferred hardware platform (e.g., output of /usr/bin/lscpu), Linux version (/etc/os-release), compiler, and C++ version.

In particular, C++23 has fixed width floating point types, and we plan to replace _Float16 with std::float16_t in a future release.

Please don’t hesitate to share your thoughts.

support@hpkfft.com