New Zigger exploring performance

WeeBull · November 23, 2025, 2:43pm

Just to complete the tale on this one because I put some more effort in to try to work it out.

I ran perf on the various versions of the code and discovered that the whole benchmark seemed to be bottle-necked by two structure assignments, with them taking about 60% of all cycles. A good run tended to have much better cache-prediction rates. The structure assignment itself was only 4 AVX instructions (2 read and 2 write - 32 bytes each), but the first one seemed to be taking a high penalty.

The 56-byte structures were allocated in the .data region of the executable (i.e. they’re static global) and the performance depended on how the compiler decided to align them. Good alignment meant they fitted in one cache-line, but at 56-bytes (with a natural 8-byte alignment) it was much more likely that they’d cross two. Being static, their alignment is determined by their location in the ELF file. Add or remove code, and the alignment changed. With them being so critical to the overall number, crossing two cache-lines had a big impact.

It’s a bit of a working theory, but it seems to fit …and also shows how terrible the benchmark is.