I measured it, the randomization part takes the same amount of time in the C and Zig versions, so it’s fine for the benchmark. It’s the loop lowering that’s different here.
The binary example clearly shows why the for loops are so much slower: The for loops with dynamic ranges create a complex control flow with 2 counter and more branches, while the while loops result in only one counter with fewer branch misses.
Yeah, in ReleaseFast but effect is not as severe in ReleaseSmall. The LLVM opt pipeline is fun.
Yep: ReleaseFast ~80 lines of assembly with complex loop unrolling, ReleaseSmall ~30 lines, no more loop unrolling!
Can one draw general-purpose conclusions from this? (At the zig-language level, that is; I can see the binary / LLVM-opt-level conclusions, or observations, at least.)
Good question - I noticed some projects are now currently doing stuff like #pragma clang loop unroll(disable) due to the aggressive unrolling.
I see @loopHint has been proposed for Zig
i’m also curious: given that my code works correctly in Debug mode, is its behavior in ReleaseFast a miscompilation?
Are you referring to the comment “fwiw I ran the latest fixed version and there’s no difference under ReleaseFast” ? What I meant there was that there’s no difference in performance under ReleaseFast, not that there’s a problem with the output after your fix (afaict)
I don’t think so, in this particular instance it wasn’t beneficial, because the algorithm is branch heavy, and since it’s about sorting, random numbers, assuming the distribution is random, you literally have a 50% chance of being wrong. This text book example worst case scenario for the speculative execution engine, it keeps guessing and being wrong most of the time, and needs to stall, which is further increased by the unrolling. But most of the time loop unrolling is usually beneficial, and can help especially if you don’t have much branch, or if they are cold enough that you’ll still benefit from the unrolling. So it’s worth keeping in mind and double checking, and investigating, but no need to draw conclusion i think ![]()
i’m not sure how to understand this reply. i mean that the code i wrote way way above works in Debug mode, producing a sorted list. according to you, it does not work in ReleaseFast mode, producing garbage. That difference was the source of my question.
No, your provided fix does work in ReleaseFast which is why I originally said fixed version - it really was fixed
What I am saying (poorly) is simply that there’s no difference in performance between your corrected version and the original nested-for version. The correction being ar[i+1..]