Fixing Zig benchmark where std.mem.doNotOptimizeAway was ignored

4 Likes

I was wondering if using @call() with .never_inline would do the same trick of preventing the overly aggressive constant folding, but interestingly, .never_inline doesn’t seem to work (at least in 0.15.2):

Unfortunately, using volatile in benchmarks is still not the appropriate use according to Andrew: Benchmarking isDigit - #20 by andrewrk

I wonder if it works if you DNOA a variable with the value 30, instead of a whole array/tuple of args. On my phone so cannot test it.

I dont think Andrew is right in this case, at least with the information I have and what others in that thread responded to him with.

Unless zigs volatile differs is some way that this should not work, or that std.mem.doNotOptimizeAway is supposed to work in this case. Then using volatile seems fine, if the latter case is true then volatile is a workaround for a bug in std.

Something is needed to prevent constant folding, and volatile works, that is the first priority here. doNotOptimizeAway doesn’t work, so that’s not an option.

An alternative presented in that thread, of a prng to cipher the data. While I agree, LLVM probably won’t constantly fold a prng, but I think it could figure out a simple cipher even depending on how the random data is used. So I don’t think that would work, though it remains to be tested.

Guys, I tried to simplify this constant folding vs std.mem.doNotOptimizeAway.

Check this out:

const std = @import("std");

export fn fib(n: u64) u64 {
    if (n == 0) return 0;
    var a: u64 = 0;
    var b: u64 = 1;
    for (2..n + 1) |_| {
        const c = a + b;
        a = b;
        b = c;
    }
    return b;
}

export fn run_with_comptime_int() void {
    const n = 30; 
    std.mem.doNotOptimizeAway(n);
    const result = fib(n);
    std.mem.doNotOptimizeAway(result);
}

export fn run_with_u64() void {
    const n: u64 = 30; 
    std.mem.doNotOptimizeAway(n);
    const result = fib(n);
    std.mem.doNotOptimizeAway(result);
}

export fn run_with_immutable_pointer() void {
    const n: u64 = 30; 
    std.mem.doNotOptimizeAway(&n);
    const result = fib(n);
    std.mem.doNotOptimizeAway(result);
}

export fn run_with_mutable_pointer() void {
    const n: u64 = 30; 
    var x = n;
    std.mem.doNotOptimizeAway(&x);
    const result = fib(x);
    std.mem.doNotOptimizeAway(result);
}

Godbolt: Compiler Explorer

The output shows the first three functions are constant folded:

run_with_comptime_int:
        push    rbp
        mov     rbp, rsp
        mov     eax, 832040       ; <=== Answer hardcoded. 0ns execution.
        pop     rbp
        ret

run_with_u64:
        push    rbp
        mov     rbp, rsp
        mov     eax, 30           ; DNOA(n) kept this instruction alive
        mov     eax, 832040       ; <=== however the compiler still hardcoded the answer (0ns)
        pop     rbp
        ret

run_with_immutable_pointer:
        push    rbp
        mov     rbp, rsp
        mov     qword ptr [rbp - 16], offset __anon_237
        lea     rax, [rbp - 16]
        mov     qword ptr [rbp - 8], rax
        mov     eax, 832040       ; <=== Answer hardcoded (0ns)
        pop     rbp
        ret

But the last function is not:

run_with_mutable_pointer:
        push    rbp
        mov     rbp, rsp
        mov     qword ptr [rbp - 8], 30
        lea     rax, [rbp - 8]
        mov     qword ptr [rbp - 24], rax
        lea     rax, [rbp - 24]
        mov     qword ptr [rbp - 16], rax
        mov     rcx, qword ptr [rbp - 8]
        test    rcx, rcx
        je      .LBB4_1
        mov     rsi, rcx
        dec     rsi
        je      .LBB4_3
        add     rcx, -2
        mov     eax, esi
        and     eax, 7
        cmp     rcx, 7
        jae     .LBB4_6
        mov     ecx, 1
        xor     edx, edx
        jmp     .LBB4_8

You do not see 832040 anywhere.

I think I can use this mutable pointer trick to replace the volatile solution

1 Like

I probably need to write a blog post about this, but volatile, noinline, doNotOptimizeAway always felt snake-oily to me, when applied to benchmarking. I prefer natural remedies!

The way I solve this class of problems is:

On the input side, make parameters runtime overridable:

const search_count = bench.parameter("search_count", 1_000, 1_000_000);

By default, search count is one million, which is a value chosen by the benchmark author to give reasonable results. But you can override it at runtime with search_count environmental variable which a) makes it easy to tweak parameter without recompilation b) naturally prevents the compiler from optimizing the code away (the second parameter, 1_000, is what is used in tests to prevent bitrot)

On the output side, print a “hash” of the run. For fibonacci, it’ll be just the actual result fib(n)=13. For something like binary search, you can do something like

var hash: usize = 0;
for (...) {
    hash %= insertion_index(xs, x);
}

print("hash={}", {hash});

This prevents compiler from optimizing the code away, but, more importantly, it prevents the human from optimizing the code away when trying to make it faster and messing up. It’s very helpful to see both that your change made the code 10x faster, and that it also changed the hash that was supposed to stay the same.

EDIT: now in a form of a blog article also Do Not Optimize Away

14 Likes

I’ve been thinking about this more deeply. even tho the volatile workaround works, I believe it comes with a high cost.

With a volatile cast, it forces a hardware memory read on every access. This means the benchmark measures the implementation + the additional memory latency.

In contrast, std.mem.doNotOptimizeAway(&var) uses an inline assembly memory clobber constraint. This effectively blinds the compiler without inserting actual hardware load instructions into the benchmark loop.

        .pointer => {
            if (builtin.zig_backend == .stage2_c) {
                doNotOptimizeAwayC(val);
            } else {
                asm volatile (""
                    :
                    : [_] "m" (val),
                    : .{ .memory = true });
            }
        },

reference: std.mem.doNotOptimizeAway

2 Likes

Computing hash for the benchmark result is really good idea. One could put the benchmark itself on separate compilation unit with export function to get some guarantees that compiler won’t optimize away the inputs.

2 Likes