In my chessprogram the pop_square is called billions of times. Square is just a simple enum(u6).
Inlining is (I think) not needed in Zig. I measured no speed differences ever using inline or not inline.
So question: Is there any way I can squeeze more speed in pop_square?
I don’t think so, but we never know.
/// Unsafe lsb
pub fn first_square(bitboard: u64) Square
{
assert(bitboard != 0);
const lsb: u6 = @truncate(@ctz(bitboard));
return Square.from(lsb);
}
/// Unsafe pop lsb and clears that lsb from the bitboard.
pub fn pop_square(bitboard: *u64) Square
{
assert(bitboard.* != 0);
defer bitboard.* &= (bitboard.* - 1);
return first_square(bitboard.*);
}
Depending how fast your CPU is clocked that means somewhere between 2 and 5 clock cycles per move. Tbh, I don’t think there’s a lot more wiggle room for improvement unless there’s some data-dependency on to_sq between loop iterations which prevents pipelining.
I would check the generated assembly output in release-fast mode for any obvious problems - e.g. whether everything is properly inlined.
Also maybe try if it’s worth it to manually unroll the loop, since the loop condition looks dynamic maybe you know more about the looping condition than the compiler can figure out on its own. Also maybe try if swapping the outer and inner loop makes a difference (if bb_from is usually much bigger or smaller than bb_to).
Also instead of focusing on the pop_square() operation I would probably focus on the memory access patterns in self.get_piece_attacks_from() and store_move(). E.g. if those are currently very random, is there a way to get followup accesses closer together in memory (ideally within the same cache line).
But from my experience with trying to optimize my home computer emulators, at some point it gets really hard to push performance further once you’re down to a couple of clock cycles per “action”, since it may be that a small improvement for one specific CPU type may mean a slight slowdown on another.
E.g. you’re probably already deep in diminishing returns territory
I found the Zig -emit-asm feature quite useless tbh (IIRC currently it doesn’t work at all). Instead I usually run some 3rd-party disassembler over the entire executable and try to make sense of the output.
At least on Mac you get (some) function entry points as labels also in release mode, for instance I can run otool -tV [exe] on a zig executable compiled with release-fast, and I can see the start of Zig functions like this: