Hey,
Indeed! I’ve been trying to beat or get close to FFTW for a while now. Especially considering it’s somewhat restrictive licensing, I would love to have my own FFT to use without feeling like I’m compromising on performance in a huge way.
So my implementation is based on this paper: click me.
Essentially, I benchmark my two fft implementations (recursive and iterative) versus an fftw reference. I use my flake.nix to build an avx512 version of it. One thing I was thinking at the time when I wrote this (trying to be smart) was to expose the FULL recursive of the FFT to the compiler by comptime specializing all the butterflies. This was, in hindsight, super naive as you completely thrash the instruction cache. Consider the following comparision:
My implementation, AVX512 @ 512 size’d rfft:
(dev) niels@work-laptop:~/repositories/personal/fftiny.zig$ perf stat -e cycles,instructions,L1-icache-load-misses,L1-dcache-load-misses,LLC-load-misses,branches,branch-misses ./zig-out/bin/fftiny
Starting size 512...
-> fwFFTIterative: +2345130861 ns, mean: 2345.13 ns
Performance counter stats for './zig-out/bin/fftiny':
10,601,456,574 cycles:u
20,699,380,590 instructions:u # 1.95 insn per cycle
1,963,095,943 L1-icache-load-misses:u
272,936 L1-dcache-load-misses:u
552 LLC-load-misses:u
3,055,386 branches:u
2,793 branch-misses:u # 0.09% of all branches
2.346229572 seconds time elapsed
2.331690000 seconds user
0.010899000 seconds sys
FFTW:
(dev) niels@work-laptop:~/repositories/personal/fftiny.zig$ perf stat -e cycles,instructions,L1-icache-load-misses,L1-dcache-load-misses,LLC-load-misses,branches,branch-misses ./zig-out/bin/fftiny
Starting size 512...
-> fwFFTW: +249144992 ns, mean: 249.14 ns
Performance counter stats for './zig-out/bin/fftiny':
3,234,655,398 cycles:u
11,742,609,110 instructions:u # 3.63 insn per cycle
1,840,124 L1-icache-load-misses:u
4,758,020 L1-dcache-load-misses:u
892 LLC-load-misses:u
667,656,740 branches:u
2,498,043 branch-misses:u # 0.37% of all branches
0.740388690 seconds time elapsed
0.737367000 seconds user
0.001999000 seconds sys
It’s clear I have almost 2000x more L1-icache-load-misses than FFTW. This is (to me) 1) why my lib is slow 2) why my compile times are long. We do however have much less branches
.