Optimizing the Goertzel Algorithm

Hi everyone,

for some years now I have a small tool that uses the Goertzel algorithm for spectral analysis.

Since I moved the project to 0.16.0 the performance dropped a lot. The reason being Loop-Vectorization-Disabled.

So I started to vectorize it manually, which helped a bit, but performance is still not what it was. So maybe someone can give me some hints how to further improve it (given, I do not just want to wait for the llvm fix to arrive). I compared my original implementation with the manually optimized one at compiler explorer.

  • One thingt that stuck with me, was that in the manually vectorized version the vfmadd132ps instruction has twice the latency, because in that case it loads data from memory, whereas in the original version only registers are used
  • this adds one more load/store operation in the manual version for some reason
  • Also,it causes the fma operation to share one port with the load/store operations around it.
  • Note that the zig version is set 0.15.2 on compiler explorer link to show the vectorized original version, but you can switch to 0.16.0.

Is there a way to implement the manual vectorization in such a way, that the fma operation is not used to load data from memory (without going for inline assembly)?

Many thanks for your help!

EDIT: as always, right after posting, I found the critical bug…performance of the manual vectorized loop is now about half of what the original one was before. So for now, there is only one remaining question.