LSB bits: can we go faster?

In my chessprogram the pop_square is called billions of times.
Square is just a simple enum(u6).
Inlining is (I think) not needed in Zig. I measured no speed differences ever using inline or not inline.

So question: Is there any way I can squeeze more speed in pop_square?
I don’t think so, but we never know.

/// Unsafe lsb
pub fn first_square(bitboard: u64) Square
{
    assert(bitboard != 0);
    const lsb: u6 = @truncate(@ctz(bitboard));
    return Square.from(lsb);
}

/// Unsafe pop lsb and clears that lsb from the bitboard.
pub fn pop_square(bitboard: *u64) Square
{
    assert(bitboard.* != 0);
    defer bitboard.* &= (bitboard.* - 1);
    return first_square(bitboard.*);
}

It is used in tight loops like:

// Bishop.
bb_from = self.bishops(us);
while (bb_from != 0)
{
    from_sq = pop_square(&bb_from);
    bb_to = self.get_piece_attacks_from(PieceType.BISHOP, from_sq) & target;
    while (bb_to != 0)
    {
        to_sq = pop_square(&bb_to);
        store_move(from_sq, to_sq, .normal, .no_prom, storage);
    }
}

My Zig movegenerator is already quite fast. around 730 million moves per second.
I want to get to 1 billion :slight_smile:

Depending how fast your CPU is clocked that means somewhere between 2 and 5 clock cycles per move. Tbh, I don’t think there’s a lot more wiggle room for improvement unless there’s some data-dependency on to_sq between loop iterations which prevents pipelining.

I would check the generated assembly output in release-fast mode for any obvious problems - e.g. whether everything is properly inlined.

Also maybe try if it’s worth it to manually unroll the loop, since the loop condition looks dynamic maybe you know more about the looping condition than the compiler can figure out on its own. Also maybe try if swapping the outer and inner loop makes a difference (if bb_from is usually much bigger or smaller than bb_to).

Also instead of focusing on the pop_square() operation I would probably focus on the memory access patterns in self.get_piece_attacks_from() and store_move(). E.g. if those are currently very random, is there a way to get followup accesses closer together in memory (ideally within the same cache line).

But from my experience with trying to optimize my home computer emulators, at some point it gets really hard to push performance further once you’re down to a couple of clock cycles per “action”, since it may be that a small improvement for one specific CPU type may mean a slight slowdown on another.

E.g. you’re probably already deep in diminishing returns territory :wink:

1 Like

Tell me how to do that in some readable format!

Yeah. For kings, knights we have a maximum of 8. I was already wondering if we can tell the compiler “Hey, can you unroll this loop”

True!

I found the Zig -emit-asm feature quite useless tbh (IIRC currently it doesn’t work at all). Instead I usually run some 3rd-party disassembler over the entire executable and try to make sense of the output.

At least on Mac you get (some) function entry points as labels also in release mode, for instance I can run otool -tV [exe] on a zig executable compiled with release-fast, and I can see the start of Zig functions like this:

_kc85.Args.parseModuleArgs:
0000000100012bb4        sub     sp, sp, #0xb0
0000000100012bb8        stp     x28, x27, [sp, #0x50]
0000000100012bbc        stp     x26, x25, [sp, #0x60]

…of course in release-fast mode there’s a lot of aggressive inlining, so you may only see a couple of top-level functions with huge bodies.

1 Like

This is my usual setup for looking at assembly output:

const mod = b.addModule("my_mod", .{
    .root_source_file = b.path("src/main.zig"),
    .target = target,
    .optimize = optimize,
    .strip = true,
});

const lib = b.addLibrary(.{
    .name = "my_lib",
    .root_module = mod,
});

const asm_step = b.step("asm", "Generate asm");
asm_step.dependOn(&b.addInstallFile(lib.getEmittedAsm(), "asm/output.asm").step);
3 Likes