I was wondering if using @call() with .never_inline would do the same trick of preventing the overly aggressive constant folding, but interestingly, .never_inline doesn’t seem to work (at least in 0.15.2):
Unfortunately, using volatile in benchmarks is still not the appropriate use according to Andrew: Benchmarking isDigit - #20 by andrewrk
I wonder if it works if you DNOA a variable with the value 30, instead of a whole array/tuple of args. On my phone so cannot test it.
I dont think Andrew is right in this case, at least with the information I have and what others in that thread responded to him with.
Unless zigs volatile differs is some way that this should not work, or that std.mem.doNotOptimizeAway is supposed to work in this case. Then using volatile seems fine, if the latter case is true then volatile is a workaround for a bug in std.
Something is needed to prevent constant folding, and volatile works, that is the first priority here. doNotOptimizeAway doesn’t work, so that’s not an option.
An alternative presented in that thread, of a prng to cipher the data. While I agree, LLVM probably won’t constantly fold a prng, but I think it could figure out a simple cipher even depending on how the random data is used. So I don’t think that would work, though it remains to be tested.
Guys, I tried to simplify this constant folding vs std.mem.doNotOptimizeAway.
Check this out:
const std = @import("std");
export fn fib(n: u64) u64 {
if (n == 0) return 0;
var a: u64 = 0;
var b: u64 = 1;
for (2..n + 1) |_| {
const c = a + b;
a = b;
b = c;
}
return b;
}
export fn run_with_comptime_int() void {
const n = 30;
std.mem.doNotOptimizeAway(n);
const result = fib(n);
std.mem.doNotOptimizeAway(result);
}
export fn run_with_u64() void {
const n: u64 = 30;
std.mem.doNotOptimizeAway(n);
const result = fib(n);
std.mem.doNotOptimizeAway(result);
}
export fn run_with_immutable_pointer() void {
const n: u64 = 30;
std.mem.doNotOptimizeAway(&n);
const result = fib(n);
std.mem.doNotOptimizeAway(result);
}
export fn run_with_mutable_pointer() void {
const n: u64 = 30;
var x = n;
std.mem.doNotOptimizeAway(&x);
const result = fib(x);
std.mem.doNotOptimizeAway(result);
}
Godbolt: Compiler Explorer
The output shows the first three functions are constant folded:
run_with_comptime_int:
push rbp
mov rbp, rsp
mov eax, 832040 ; <=== Answer hardcoded. 0ns execution.
pop rbp
ret
run_with_u64:
push rbp
mov rbp, rsp
mov eax, 30 ; DNOA(n) kept this instruction alive
mov eax, 832040 ; <=== however the compiler still hardcoded the answer (0ns)
pop rbp
ret
run_with_immutable_pointer:
push rbp
mov rbp, rsp
mov qword ptr [rbp - 16], offset __anon_237
lea rax, [rbp - 16]
mov qword ptr [rbp - 8], rax
mov eax, 832040 ; <=== Answer hardcoded (0ns)
pop rbp
ret
But the last function is not:
run_with_mutable_pointer:
push rbp
mov rbp, rsp
mov qword ptr [rbp - 8], 30
lea rax, [rbp - 8]
mov qword ptr [rbp - 24], rax
lea rax, [rbp - 24]
mov qword ptr [rbp - 16], rax
mov rcx, qword ptr [rbp - 8]
test rcx, rcx
je .LBB4_1
mov rsi, rcx
dec rsi
je .LBB4_3
add rcx, -2
mov eax, esi
and eax, 7
cmp rcx, 7
jae .LBB4_6
mov ecx, 1
xor edx, edx
jmp .LBB4_8
You do not see 832040 anywhere.
I think I can use this mutable pointer trick to replace the volatile solution
I probably need to write a blog post about this, but volatile, noinline, doNotOptimizeAway always felt snake-oily to me, when applied to benchmarking. I prefer natural remedies!
The way I solve this class of problems is:
On the input side, make parameters runtime overridable:
const search_count = bench.parameter("search_count", 1_000, 1_000_000);
By default, search count is one million, which is a value chosen by the benchmark author to give reasonable results. But you can override it at runtime with search_count environmental variable which a) makes it easy to tweak parameter without recompilation b) naturally prevents the compiler from optimizing the code away (the second parameter, 1_000, is what is used in tests to prevent bitrot)
On the output side, print a “hash” of the run. For fibonacci, it’ll be just the actual result fib(n)=13. For something like binary search, you can do something like
var hash: usize = 0;
for (...) {
hash %= insertion_index(xs, x);
}
print("hash={}", {hash});
This prevents compiler from optimizing the code away, but, more importantly, it prevents the human from optimizing the code away when trying to make it faster and messing up. It’s very helpful to see both that your change made the code 10x faster, and that it also changed the hash that was supposed to stay the same.
EDIT: now in a form of a blog article also Do Not Optimize Away
I’ve been thinking about this more deeply. even tho the volatile workaround works, I believe it comes with a high cost.
With a volatile cast, it forces a hardware memory read on every access. This means the benchmark measures the implementation + the additional memory latency.
In contrast, std.mem.doNotOptimizeAway(&var) uses an inline assembly memory clobber constraint. This effectively blinds the compiler without inserting actual hardware load instructions into the benchmark loop.
.pointer => {
if (builtin.zig_backend == .stage2_c) {
doNotOptimizeAwayC(val);
} else {
asm volatile (""
:
: [_] "m" (val),
: .{ .memory = true });
}
},
reference: std.mem.doNotOptimizeAway
Computing hash for the benchmark result is really good idea. One could put the benchmark itself on separate compilation unit with export function to get some guarantees that compiler won’t optimize away the inputs.