Faster Fast Fourier Transforms in Python¶

A little while ago, we were approached by a researcher from the School of Mathematical Sciences with the classic request of "I'd like my code to run more quickly". They were simulating a ball bouncing around a billiard table over the course of millions of collisions and analysing patterns in the path of the ball (this type of problem is known generally as dynamical billiards).

Fortunately for me, as is often the case when researchers are already HPC users, I didn't need to write, or even greatly rewrite, very much of the code. What landed in my lap already worked, had been partially compiled with Cython and profiled to find the bottlenecks. The main culprits in the long run times were the frequent fast Fourier transforms and inverse fast Fourier transforms that were a key part of the analysis (to calculate auto-correlation via the Wiener-Kinchin theorem).

Background¶

As a quick reminder, a Fourier transform takes a signal that varies with time (or space) and shows you the temporal (or spatial) frequencies contained in that signal. If, as with our case, the signal is an array of samples (rather than a truly continuous function) then we use a discrete Fourier transform (DFT), though you will often find the "discrete" dropped when it is obvious from the context. The DFT formula is

\[ X_{k} = \sum _{n=0} ^{N-1}x_{n}e^{-i2\pi kn /N} \hspace{1cm} k = 0, ..., N - 1 \]

A direct/naive approach to calculating a DFT will give you an O(N²) algorithm (doubling the number of samples will quadruple the length of time taken to complete the algorithm). Fortunately, there are a family of algorithms known as fast Fourier transforms (FFTs) that are O(N log N), which make DFTs on very long samples computationally feasible.

Since FFTs are so useful and ubiquitous, they are included in the major Python numerical and scientific libraries, Numpy and Scipy. The billiard codebase already used Numpy for FFTs so that will be our basis for comparison when we try to improve upon it, below.

A simplified model¶

We shall imagine, for the sake of argument, that our temporal signal is an array of values representing the direction of travel of an ideal billiard, bouncing around a confined space, sampled at regular intervals. To keep things very simple, we could map the points of the compass to the values 0-1 so that [0, 0, 0.25, 0.25, 0.88] would be two samples where the billiard was moving due west, two samples where the billiard was moving due north and a sample moving south-east.

In reality, the array of samples came from Cython code that simulated the billiard's movement. For our examples, though, we will use Numpy's rand() function as it is a more convenient source of input data.

# In practice, we may have called
from functions import simulate_billiard
sample_array = simulate_billiard(time=10**7)

# For speed comparisons, it is simpler to get 10^7 samples from 0.0 to 1.0 with
import numpy as np
sample_array = np.random.rand(10**7)

Note that the speed of our Fourier transform shouldn't be affected by the values themselves, though the number and precision of values do matter (as we shall see later).

Array length¶

The most commonly used FFT is the Cooley-Tukey algorithm, which recursively breaks down an input of size N into smaller FFTs. The most straightforward case is when N is a power of 2. This first leads to two smaller FFTs on arrays of length N/2, which can be decomposed in the same fashion, leading to four FFTs on inputs of length N/4 and so on to arrays of length 1.

Though FFTs don't require power-of-two inputs, Numpy does perform much better with them. If you run the following, you will find that the slightly longer array is much quicker than the slightly shorter one because it is an exact power of 2 in length.

import numpy as np
import time

for array_length in (2**22, (2**22)-1):
    total_time = 0

    # Do the FFT 10 times, to allow for some random error
    for i in range(10):
        in_array = np.random.rand(array_length)
        start = time.perf_counter()
        out_array = np.fft.fft(in_array)
        total_time += time.perf_counter() - start

    print("array_length:", array_length,
          "total_time:", total_time)

This is easy to exploit in a situation like ours where we are conducting a simulation and are free to choose the sample length.

This brings us more or less to the code as it was when I came across it: Numpy's fft function was being applied to arrays of length a power of 2. What else can we do to speed things up?

Real vs complex data¶

FFTs can be done with complex numbers as input data, but you will find that they are faster if you can get away with using only real data. In our example case, we are using a single real-valued number to represent the direction of travel of our billiard so can happily switch to using a real FFT.

If you run this snippet, you will notice two things:

there is a significant speed up using rfft
the rfft output array is still complex but is half as long as the fft output (the values in the missing half are redundant as they would duplicate those in the first half)

import numpy as np
import time

for fft_func in (np.fft.fft, np.fft.rfft):
    total_time = 0

    for i in range(10):
        in_array = np.random.rand(2**22)
        start = time.perf_counter()
        out_array = fft_func(in_array)
        total_time += time.perf_counter() - start

    print("fft/rfft:", fft_func.__name__,
          "total_time:", total_time,
          "length(f):", len(out_array))

pyFFTW interfaces¶

At this point, we have taken all of the reasonable steps that I can think of with Numpy. The next thing we can do is to look for a quicker library. Enter pyFFTW, a Python interface to the FFTW library, written in C. FFTW is already installed on Apocrita but you may need to install it first on any other machine.

The pyFFTW interfaces API provides a drop-in replacement to Numpy's FFT functions. I have found them to be marginally quicker for power-of-two cases and much quicker than Numpy for non-power-of-two cases.

import numpy as np
import pyfftw
import time

def np_rfft(*args):
    return np.fft.rfft(*args)

def interfaces_rfft(*args):
    return pyfftw.interfaces.numpy_fft.rfft(*args)

# Use our wrapper functions else function names will be "rfft" for both
for fft_func in (np_rfft, interfaces_rfft):

    for array_length in (2**22, (2**22)-1):
        total_time = 0

        for i in range(10):
            in_array = np.random.rand(array_length)
            start = time.perf_counter()
            out_array = fft_func(in_array)
            total_time += time.perf_counter() - start

        print("fft_func:", fft_func.__name__,
              "array_length:", array_length,
              "total_time:", total_time)

After re-reading the manual, I noticed there was a caching option to keep the FFTW objects alive for longer. This saves them from being rebuilt if, as in our case, inputs of the same type and size are reused.

This gives us our final pyFFTW interfaces option, which is about as quick as I have been able to get with this super-convenient (almost) drop-in Numpy replacement.

import numpy as np
import pyfftw
import time

array_length = 2**22

for caching in (False, True):
    if caching:
        pyfftw.interfaces.cache.enable()
        pyfftw.interfaces.cache.set_keepalive_time(60)
    total_time = 0

    for i in range(10):
        in_array = np.random.rand(array_length)
        start = time.perf_counter()
        out_array = pyfftw.interfaces.numpy_fft.rfft(in_array)
        total_time += time.perf_counter() - start

    print("total_time:", total_time,
          "caching:", caching)

The FFTW class¶

At this point, there is still plenty that can be done to speed up our real-valued FFT by switching to the less convenient but more customisable FFTW class.

The basic idea is to create an object that knows details of the input and output vectors and of the machine it is running on. Below, we tell the FFTW object to estimate the best FFT method with "FFTW_ESTIMATE" and compare it with Numpy's rfft.

import numpy as np
import pyfftw
import time

array_length = 2**22

# Byte-aligned inputs are supposedly faster
in_array = pyfftw.empty_aligned(array_length, dtype=np.float64)
print("size:", round(in_array.nbytes / 1000 / 1000), "MB")

# Required, even if we don't explicitly reference it again
out_array = pyfftw.empty_aligned(array_length//2 + 1, dtype=np.complex128)

fftw_object = pyfftw.FFTW(in_array,
                          out_array,
                          direction="FFTW_FORWARD",
                          flags=("FFTW_ESTIMATE", ),
                          threads=1)
np_total = 0
fftw_total = 0

for i in range(10):
    # Copy random data into our empty input array
    in_array[:] = np.random.rand(array_length)

    # Numpy
    np_start = time.perf_counter()
    np_output = np.fft.rfft(in_array)
    np_total += time.perf_counter() - np_start

    # FFTW
    fftw_start = time.perf_counter()
    fftw_output = fftw_object(in_array)
    fftw_total += time.perf_counter() - fftw_start

    # Show that Numpy and FFTW give the same answers
    np.testing.assert_allclose(np_output, fftw_output)

    # Remind ourselves that FFTW has modified and returned out_array
    assert fftw_output is out_array

print("np total:", np_total)
print("fftw total:", fftw_total)

Of particular note here:

there is now a fair bit more setup
there is some opportunity for confusion as the output array, which you have to manually create with the correct dimensions, is modified and returned by the call to the FFTW object
though we haven't taken into account the extra CPU or memory overhead for creating input/output arrays and copying random values into the input, you do get a satisfying speed up over Numpy (~x2) for your extra work

The extra work becomes more worthwhile when you can dedicate more cores to the FFT. In my case using threads=4 gives a ~x2 speedup over threads=1. There is no guarantee that any speed up will outweigh the extra multi-threading overhead (or that your FFTW is capable of it), so you should perform tests with representative data and on your machine before following suit.

We already have some appreciable speed increases over Numpy. Does that leave us anywhere else to go? Well, this next approach comes at a cost. Instead of "FFTW_ESTIMATE", we can tell FFTW to make actual performance measurements during the planning stage. Doing so means that the planning with take a long time (routinely up to 5 minutes for me) so is probably only worth it for long-running programs (those with very large arrays or very many FFTs).

import numpy as np
import pyfftw
import time

array_length = 2**22

# Byte-aligned inputs are supposedly faster
in_array = pyfftw.empty_aligned(array_length, dtype=np.float64)

# Required, even if we don't explicitly reference it again
out_array = pyfftw.empty_aligned(array_length//2 + 1, dtype=np.complex128)

# Estimate before measuring as wisdom is cached
for wisdom in ("FFTW_ESTIMATE", "FFTW_MEASURE"):
    start = time.perf_counter()
    fftw_object = pyfftw.FFTW(in_array,
                              out_array,
                              direction="FFTW_FORWARD",
                              flags=(wisdom, ),
                              threads=1)
    print("planning took", time.perf_counter() - start)
    total = 0

    for i in range(10):
        # Copy random data into our empty input array
        in_array[:] = np.random.rand(array_length)

        start = time.perf_counter()
        fftw_object(in_array)
        total += time.perf_counter() - start

    print("wisdom:", wisdom,
          "total:", total)

The information gained about your system during "FFTW_MEASURE" is known as wisdom. A nice feature of this wisdom is that it can be inexpensively written to disk. If you find that you are doing many runs on the same machine with the same length and type of input, you might want to write the wisdom to disk at the end of the first run and read it in subsequent runs.

import numpy as np
import pickle
import pyfftw
import time

array_length = 2**22

in_array = pyfftw.empty_aligned(array_length, dtype=np.float64)
out_array = pyfftw.empty_aligned(array_length//2 + 1, dtype=np.complex128)

# Try to load FFTW wisdom but don't panic if we can't
try:
    with open("fft.wisdom", "rb") as the_file:
        wisdom = pickle.load(the_file)
        pyfftw.import_wisdom(wisdom)
        print("Wisdom imported")
except FileNotFoundError:
    print("Warning: wisdom could not be imported")

try:
    # Try to plan our transforms with the wisdom we have already
    fftw_object = pyfftw.FFTW(in_array,
                              out_array,
                              direction="FFTW_FORWARD",
                              flags=("FFTW_WISDOM_ONLY",))

except RuntimeError as e:
    # If we don't have enough wisdom, print a warning and proceed.
    print(e)
    start = time.perf_counter()
    fftw_object = pyfftw.FFTW(in_array,
                              out_array,
                              direction="FFTW_FORWARD",
                              flags=("FFTW_MEASURE",))
    print("Generating wisdom took {}s".format(time.perf_counter() - start))

# Use the fftw_object as per usual
# ...

# Save the wisdom for next time
with open("fft.wisdom", "wb") as the_file:
    wisdom = pyfftw.export_wisdom()
    pickle.dump(wisdom, the_file)

Note that the wisdom gained on one Apocrita node type is probably not applicable on another. One option would be to save node-specific wisdom, another is to restrict job submissions to one node type.

Our final trick saves on memory and time but should be used with more caution than the others. By default, Numpy and Python default to using 64-bit floating point numbers so, for example, we see:

>>> np.random.rand(1).dtype
dtype('float64')

However, we can specify different precisions when declaring arrays with Numpy or pyFFTW. In my tests, pyFFTW is quicker with the less precise 32-bit floating point type than with the default 64-bit. We might presume that this is because there is half as much data to move between RAM and the CPU or because the CPU we tested with has more optimal instructions for the shorter type. The shorter type was slower when using Numpy's rfft functions because Numpy converts them to 64-bit internally (see Type Promotion).

import numpy as np
import pyfftw
import time

array_length = 2**22

for in_type, out_type in [(np.float32, np.complex64),
                          (np.float64, np.complex128)]:
    in_array = pyfftw.empty_aligned(array_length, in_type)
    out_array = pyfftw.empty_aligned(array_length//2 + 1, out_type)

    fftw_object = pyfftw.FFTW(in_array,
                              out_array,
                              direction="FFTW_FORWARD",
                              flags=("FFTW_ESTIMATE", ),
                              threads=1)

    for fft_callable, callable_name in [(np.fft.rfft, "Numpy rfft"),
                                        (fftw_object, "FFTW object")]:
        total_time = 0

        for i in range(10):
            in_array[:] = np.random.rand(array_length)
            start = time.perf_counter()
            fft_callable(in_array)
            total_time += time.perf_counter() - start

        print("total_time:", total_time,
              "in_type:", in_type,
              "callable:", callable_name,
              "array size (MB):", round(in_array.nbytes / 1000 / 1000))

Ultimately, we didn't use this approach because the nature of the experiment meant that it was sensitive to initial conditions and accumulated rounding errors.

pyFFTW, For The Win¶

Overall, pyFFTW makes you think a little harder about your FFTs since you have to pre-allocate arrays, avoid overwriting the output array before you've used it and handle the wisdom files. In return, you get decent speedups without having to switch to a lower level language.

This isn't, by any means, a criticism of Numpy's implementation, and we haven't done a rigorous comparison. We have seen, though, that continuous incremental improvements are to be had if you want to look for them.

I have deliberately avoided giving timings for the code snippets thus far. Performance measurements are hard, and they're liable to change with different library versions, operating systems and hardware so what works for me might not for you. What I can say is that, for our dynamical billiards problem, using FFTW objects (with real input values, multi-threading and the "FFTW_MEASURE" flag) was about 10x quicker than Numpy's FFT function. Since the simulation wasn't solely doing FFTs, this meant that it was about 2x quicker overall using pyFFTW instead of Numpy. This is a worthwhile improvement if it means you can do two experiments in a day instead of one.

All of the code snippets should run on Python 3.8 as long as pyFFTW and Numpy are installed.

Credit to George Datseris for the title image, available under the Creative Commons Attribution-Share Alike 4.0 International license.