I’m trying to figure out whether @mulAdd
is really slow on CPU’s without FMA instructions, as stated in zmath comments, and so I’ve decided to benchmark it myself (i5-12500 apparently has no FMA. Update: looks like it has, but not in a VirtualBox VM, but nevermind):
const a = @Vector(4, f32){ random.float(f32), random.float(f32), random.float(f32), random.float(f32) };
const b = @Vector(4, f32){ 5.0, 6.0, 7.0, 8.0 };
const c = @Vector(4, f32){ 9.0, 10.0, 11.0, 12.0 };
var d: @Vector(4, f32) = undefined;
var timer = try std.time.Timer.start();
var start = timer.read();
for (0..1_000_000_000) |_|
d = @mulAdd(@Vector(4, f32), a, b, c);
var end = timer.read();
std.debug.print("\n1_000_000_000 iterations of @mulAdd()\n", .{});
var elapsed_s = @as(f64, @floatFromInt(end - start)) / std.time.ns_per_s;
std.debug.print("{d} ns, {d:.4} s\n", .{end - start, elapsed_s});
std.debug.print("d = @mulAdd(a, b, c) = {d}\n", .{d});
But I strongly suspect my for
loop gets optimized away, as, for example in ReleaseSafe
, I get the following result for 1_000_000_000 iterations:
1_000_000_000 iterations of @mulAdd()
67 ns, 0.0000 s
And it takes only 4 ns less to run 1_000_000 iterations :
1_000_000 iterations of @mulAdd()
63 ns, 0.0000 s
Also, 67 nanoseconds to run @mulAdd
ONE BILLION times? I dunno, sounds fishy to me.
So how do I make sure the code I want to benchmark is not thrown away, while at the same time not introducing unwanted instructions, e.g. by trying to use d
in the for
loop, generating random inputs, or something like that?
Also, is my std.time.Timer
approach sound in the first place?