Any FFT libraries in active development?

Just one year into zig, and I still love this language very much because of how clear it manages heap memory and the use of defer, which is an ideal language for programming audio stuff and solving my greatest concerns in developing software.

Speaking of audio applications, FFT plays in a crucial roles in fetching the frequency spectrum of an audio signal which is usful for doing visualizers and reverb, so I have tried to search around the github to see if there are some good libraries doing this; sadly, the most popular github repos about FFT seems came in an halt or not usable as a dependency:

  • zfft-orchard (Development stopped 2 years ago)
  • fftiny.zig (Still in WIP, zig version stuck at 0.14.0)
  • cliFFTop (Doesn’t seem to find way to use it via dependency injection)

Besides, these repo currently looks more like a benchmark test of a collection of algorithms rather than a library. I know I could just use fftw with creating a binding, or even write my own FFT library based on the algorithms above, but before writing my own solutions, I wanna know if anyone have done a production ready FFT library for use?

I think I have found my way out for cliFFTop:

To import the dependency, since it doesn’t have a build.zig.zon file, I need to specify the name of the dependency for --save

zig fetch --save=cliFFTop  git+https://github.com/BlueAlmost/cliFFTop

And after imported the dependency, instead of directly using addImport(), I need to construct a module for a specific algorithm of choice:

    const cliFFTop = b.dependency("cliFFTop", .{});
    const ffts = b.addModule("ffts", .{
        .root_source_file = cliFFTop.path("cliFFTop/src/ffts.zig"),
        .target = target,
        .optimize = optimize,
    });
    exe.root_module.addImport("ffts", ffts);

With this setup, I managed to let ZLS recognize the functions in ffts.zig.

Update:

However, the ffts alone is not enough since it needs other modules, so for a quick hack just to mess around the library, I just manually import the libraries:

    const cliFFTop = b.dependency("cliFFTop", .{});
    const ffts = b.addModule("ffts", .{
        .root_source_file = cliFFTop.path("cliFFTop/src/ffts.zig"),
        .target = target,
        .optimize = optimize,
    });
    const utils = b.addModule("utils", .{
        .root_source_file = cliFFTop.path("cliFFTop/src/utils.zig"),
        .target = target,
        .optimize = optimize,
    });
    const complex_math = b.addModule("complex_math", .{
        .root_source_file = cliFFTop.path("cliFFTop/src/complex_math.zig"),
        .target = target,
        .optimize = optimize,
    });
    const luts = b.addModule("luts", .{
        .root_source_file = cliFFTop.path("cliFFTop/src/luts.zig"),
        .target = target,
        .optimize = optimize,
    });

    ffts.addImport("utils", utils);
    ffts.addImport("complex_math", complex_math);
    ffts.addImport("luts", luts);
    luts.addImport("utils", utils);

    exe.root_module.addImport("ffts", ffts);
    exe.root_module.addImport("utils", utils);
    exe.root_module.addImport("complex_math", complex_math);
    exe.root_module.addImport("luts", luts);

I know there are better solutions, but this is good enough for now.

4 Likes

Hey there,

I created the fftiny library in zig, and would highly advise against using it :laughing:. It’s more for me to test comptime :slight_smile:, compile times are long and the fft is slow compared to FFTW. Whereas cliFFTop looks better + cooler. That being said I pushed my local changes so it should build with 15.1.

Haha, it is no easy task to beat a library that has been developed for many years with all the SIMD stuff, backed by MIT, but it is still really cool to see you have done one by yourself. Since you have updated the repo, perhaps I am going to pull it and mess around it for fun. One thing I am not understand from the benchmark though: Are all three algorithms your implementation, or some of them are from the original FFTW Library as a control test?

Speaking of cliFFTop, it is still not perfect since the way to build the library is a bit indirect, not something we can use it out of the box by calling addModule(), so it definitely needs some changes to have a better experience. :thinking:

Hey,

Indeed! I’ve been trying to beat or get close to FFTW for a while now. Especially considering it’s somewhat restrictive licensing, I would love to have my own FFT to use without feeling like I’m compromising on performance in a huge way.

So my implementation is based on this paper: click me.

Essentially, I benchmark my two fft implementations (recursive and iterative) versus an fftw reference. I use my flake.nix to build an avx512 version of it. One thing I was thinking at the time when I wrote this (trying to be smart) was to expose the FULL recursive of the FFT to the compiler by comptime specializing all the butterflies. This was, in hindsight, super naive as you completely thrash the instruction cache. Consider the following comparision:

My implementation, AVX512 @ 512 size’d rfft:

(dev) niels@work-laptop:~/repositories/personal/fftiny.zig$ perf stat -e cycles,instructions,L1-icache-load-misses,L1-dcache-load-misses,LLC-load-misses,branches,branch-misses ./zig-out/bin/fftiny
Starting size 512...
->       fwFFTIterative: +2345130861 ns, mean:    2345.13 ns


 Performance counter stats for './zig-out/bin/fftiny':

    10,601,456,574      cycles:u
    20,699,380,590      instructions:u                   #    1.95  insn per cycle
     1,963,095,943      L1-icache-load-misses:u
           272,936      L1-dcache-load-misses:u
               552      LLC-load-misses:u
         3,055,386      branches:u
             2,793      branch-misses:u                  #    0.09% of all branches

       2.346229572 seconds time elapsed

       2.331690000 seconds user
       0.010899000 seconds sys

FFTW:

(dev) niels@work-laptop:~/repositories/personal/fftiny.zig$ perf stat -e cycles,instructions,L1-icache-load-misses,L1-dcache-load-misses,LLC-load-misses,branches,branch-misses ./zig-out/bin/fftiny
Starting size 512...
->               fwFFTW: +249144992 ns, mean:     249.14 ns


 Performance counter stats for './zig-out/bin/fftiny':

     3,234,655,398      cycles:u
    11,742,609,110      instructions:u                   #    3.63  insn per cycle
         1,840,124      L1-icache-load-misses:u
         4,758,020      L1-dcache-load-misses:u
               892      LLC-load-misses:u
       667,656,740      branches:u
         2,498,043      branch-misses:u                  #    0.37% of all branches

       0.740388690 seconds time elapsed

       0.737367000 seconds user
       0.001999000 seconds sys

It’s clear I have almost 2000x more L1-icache-load-misses than FFTW. This is (to me) 1) why my lib is slow 2) why my compile times are long. We do however have much less branches :laughing:.

Oh, that cache misses really hurt a lot in performance, but after seeing your implementation, I am wondering:

Is it possible the problem is caused by the inline loops? Since what got overwhelmed is the instruction cache, and you seem perform a inline for loop at the very beginning which the for loop will unroll into N repetition of contents of the loop:

In other words, if you do this:

inline for (0..4) |i| {
    std.debug.print("{d}", .{i});
}

It could becomes:

// The following presentation is wrong because the optimization should be 
// ended in assembly and not in zig, but the idea is there
std.debug.print("{d}", .{0});
std.debug.print("{d}", .{1});
std.debug.print("{d}", .{2});
std.debug.print("{d}", .{3});

Because of the unrolling of instructions, it might losses the temporal locality of instructions because it can’t re-run the same block of code but to fetch all the unrolled instructions in to the cache, while unrolling both the for and while loops seems to end up with a huge instruction block especially the N is large, so the cache can’t fit all the instructions of the unrolled loop in one go, result in cache misses.

However, this is just my speculation and I might be wrong since I know zig compiler is definitely smarter than me and possibly, there are some compiler optimization tricks to handle such situations yet I don’t know.

Especially considering it’s somewhat restrictive licensing

Me too, seeing the GPL license… There are reasons why I am searching for a FFT library written in zig or something that have a better licensing. It is a bit hilarious to see that a library developed at MIT not licensed in MIT license though.

I agree with you completely. I also attribute this to the inlining I do, and I also thought the compiler may have done something smarter (than me). Now, id be keen to do a bit less unrolling, and see if we improve ourselves (some compromise between all runtime and all comptime). This would require some rearchitecting though.

There aren’t zig bindings for it afaik but you can use FFTW no?

That’s pretty standard

Oh, I guess you have missed out something during the conversation so here is the summary of everything above regarding to your question:

It is true that we can create a binding or loading the shared library directly with FFTW, but before doing that, it is a good idea to look for an existing zig solution since FFTW seems a bit complex while the license is not desirable which is considered as a last resort, but since I have found cliFFTop which is written purely in zig while it has MIT license, I ended up using that because it is good enough for my use case.

1 Like

Ah i apologize for missing that. thank you for summarising!

Updated: During for waiting for the response of my suggestions, I have found a way to make the library working as an dependency by adding `build.zig.zon` and relocated the library build file to the root level, and with this configuration, we can now load the library using:

zig fetch --save git+https://github.com/Logickin-Lambda/cliFFTop#main
const cliFFTop = b.dependency("cliFFTop", .{});
exe.root_module.addImport("cliFFTop", cliFFTop.module("cliFFTop_build_name"));

Thus, no more manual module imports.

Edit: Here is the link to the old solution if anyone need to manually linking their dependencies module by module from their ends.

A bit off topic - can you recommend any good tools for investigating cache misses in the Zig world?

I’m also looking into Zig for audio processing just like OP :slight_smile:

Interesting question, I also just get into this field although I have learnt some simple dsp back in my bachelor degree, so I don’t have an answer for that sadly.

My current approach is a bit manual by putting some planning before implementation, considering what kind of access pattern are more likely be used in run time, and I have tried to avoid any dynamic allocation in the audio thread. Thus, currently, I just got inspired to the book about Data-Orientated Design and Andrew Kelley’s video to carefully analyze how struct are allocated into the heap:

I am also interested if there are tools for that or this has to be done by intuition.

1 Like

tools for investigating cache misses

You can look to the “perf” tool in Linux to quantify L1 cache misses. Any LLM can

give you detailed examples. For instance,

$ sudo perf stat -e L1-dcache-loads,L1-dcache-load-misses ./your-program

+1 learning perf has been such a huge benefit. I recommend you this course and the accompanying book. If you were a novice like me its really helpful!

1 Like