I write this kind of code daily - I do trading systems for prop firms, and the properly predicted branch will get pipelned away in the rest of the rest of the code. That predicted branch won’t even cost an instruction slot. You are going to be limited by running out of the uop cache if the loop is that tight.
And in code like this you are trying to avoid hash lookups are much as possible and instead tend to do a lot of bulk operations (eg take everything from last recv, put keys in 32 entry array, create a u32 where each bit is an add or del flag, make once call into the ht to do all of that and get a 32 entry result vector back – these are the things that can make code very fast.
The assume capacity calls are pipelined away and the OOO execution makes them irrelevant. I can’t stand how bloated they make an API and how they detract from true high performance code - it gives people a good feeling that they are writing high performance code when really it isn’t making any difference and maybe if they knew that they would actually learn to write faster code.
But I also work on very high end Intel machines, and don’t have to deal with low power chips and in order execution (old Intel Atom), so my view of performance can be overly narrowed by my target platforms. (This is one though where it really doesn’t matter).
There is also ways to take out the assume capacity calls incorrectly. Eg. This is very wrong. The work function is always calling the checkCap call unless it inlines it, then work might be inlined and you’ve polluted the hotpath with a bunch of cold cold.
pub fn work {
checkCap();
// ... do my work
}
fn checkCap {
if(new_size <= capacity) return;
// ... realloc and grow
}
Instead you want to do the capcity check either by hand of a tiny function I force inline and keep all the growing code in another function (often marked noinline to make sure the compiler doesn’t get any funny ideas):
pub fn work {
checkCap();
// ... do my work
}
inline fn checkCap {
if(new_size <= capacity)
growCap();
}
noinline fn growCap {
// ... realloc and grow
}
(usually check will just be done at top of the call by hand instead of having a function unless it is an unintuitive check that is easy to screw up.
That’s how you are supposed to do it, and in that code path work is as straight as possible and you dont want any complex methods getting inlined *you only get around 30 uops to fit in the loop stream buffer which is the optimal case. (In highly highly optimized code it might matter to elide that check since you would be using a uops (the test and branch get fused and I think only count as 1 uop but not sure) of the 32 you get, but would be a very special case and everything else would need to be optimzied perfectly for that to even matter
I just not a fan of how bloated the zig apis are. There are a lot of false optimization and pessimizations in the code too that I have a constant battle over. but I do understand that is the way the zig community does things.