How to benchmark @mulAdd, or anything else

Durobot · July 14, 2023, 10:16am

I’m trying to figure out whether @mulAdd is really slow on CPU’s without FMA instructions, as stated in zmath comments, and so I’ve decided to benchmark it myself (i5-12500 apparently has no FMA. Update: looks like it has, but not in a VirtualBox VM, but nevermind):

    const a = @Vector(4, f32){ random.float(f32), random.float(f32), random.float(f32), random.float(f32) };
    const b = @Vector(4, f32){ 5.0, 6.0, 7.0, 8.0 };
    const c = @Vector(4, f32){ 9.0, 10.0, 11.0, 12.0 };

    var d: @Vector(4, f32) = undefined;
    var timer = try std.time.Timer.start();
    var start = timer.read();

    for (0..1_000_000_000) |_|
        d = @mulAdd(@Vector(4, f32), a, b, c);

    var end = timer.read();
    std.debug.print("\n1_000_000_000 iterations of @mulAdd()\n", .{});
    var elapsed_s = @as(f64, @floatFromInt(end - start)) / std.time.ns_per_s;
    std.debug.print("{d} ns, {d:.4} s\n", .{end - start, elapsed_s});

    std.debug.print("d = @mulAdd(a, b, c) = {d}\n", .{d});

But I strongly suspect my for loop gets optimized away, as, for example in ReleaseSafe, I get the following result for 1_000_000_000 iterations:

1_000_000_000 iterations of @mulAdd()
67 ns, 0.0000 s

And it takes only 4 ns less to run 1_000_000 iterations :

1_000_000 iterations of @mulAdd()
63 ns, 0.0000 s

Also, 67 nanoseconds to run @mulAdd ONE BILLION times? I dunno, sounds fishy to me.

So how do I make sure the code I want to benchmark is not thrown away, while at the same time not introducing unwanted instructions, e.g. by trying to use d in the for loop, generating random inputs, or something like that?

Also, is my std.time.Timer approach sound in the first place?

kristoff · July 14, 2023, 10:37am

try putting the upper limit in a var variable

Durobot · July 14, 2023, 10:44am

Tried it, doesn’t look like it helps:

1000000 iterations of @mulAdd()
61 ns, 0.0000 s
. . .
1000000000 iterations of @mulAdd()
56 ns, 0.0000 s

Meanwhile, I took another look into zmath’s own benchmarks, and modified my for loops:

for (0..1_000_000_000) |_|
{
    d = @mulAdd(@Vector(4, f32), a, b, c);
    std.mem.doNotOptimizeAway(&d);
}
. . .
for (0..1_000_000_000) |_|
{
    e = a * b + c;
    std.mem.doNotOptimizeAway(&e);
}

Now my results do look like they could represent reality:

1000000000 iterations of @mulAdd()
692039805 ns, 0.6920 s
. . .
1000000 iterations of @mulAdd()
468167 ns, 0.0005 s

But I still don’t know if adding std.mem.doNotOptimizeAway to the code I’m trying to benchmark means I’m benchmarking more than just @mulAdd / a * b + c.

Durobot · July 14, 2023, 10:56am

BTW, it does look like @mulAdd is about as fast as a * b + c without FMA instructions:

Target CPU: cpu_arch = target.Target.Cpu.Arch.x86_64, has_avx = true, has_avx512f = false, has_fma = false

(1)
1000000000 iterations of @mulAdd()
682609462 ns, 0.6826 s

1000000000 iterations of a * b + c
676349734 ns, 0.6763 s

(2)
1000000000 iterations of @mulAdd()
461343973 ns, 0.4613 s

1000000000 iterations of a * b + c
468194431 ns, 0.4682 s

(3)
1000000000 iterations of @mulAdd()
680894601 ns, 0.6809 s

1000000000 iterations of a * b + c
674716668 ns, 0.6747 s

(4)
1000000000 iterations of @mulAdd()
471182902 ns, 0.4712 s

1000000000 iterations of a * b + c
469416114 ns, 0.4694 s

(5)
1000000000 iterations of @mulAdd()
484429649 ns, 0.4844 s

1000000000 iterations of a * b + c
489647188 ns, 0.4896 s

So that comment in zmath’s code must be obsolete, or just plain wrong - provided my benchmarks are not horribly messed up.

dude_the_builder · July 14, 2023, 8:45pm

I was wondering recently how comparing strings would benefit from SIMD versus normal looping through the bytes. I followed your lead and wrote this:

const re = "abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz";
const corpus = "abcdefghijklmnopqrstuvwxyz0123456789abcdefghijklmnopqrstuvwxyz0123456789abcdefghijklmnopqrstuvwxyz0123456789";

// SIMD ==
const re_vec: @Vector(re.len, u8) = re.*;
const corpus_vec: @Vector(re.len, u8) = corpus[0..re.len].*;

var timer = try std.time.Timer.start();
var start = timer.read();
var result: bool = undefined;

for (0..1_000_000_000) |_| {
    const truth = re_vec == corpus_vec;
    result = @reduce(.And, truth);
    std.mem.doNotOptimizeAway(&result);
}

var end = timer.read();
var elapsed: f64 = @floatFromInt(end - start);

std.debug.print("\n1_000_000_000 iterations of Vector ==\n", .{});
std.debug.print("{d:.1} ns, {}\n", .{ elapsed, result });

// std.mem.startsWith (uses loop)
timer.reset();
start = timer.read();

for (0..1_000_000_000) |_| {
    result = std.mem.startsWith(u8, corpus, re);
    std.mem.doNotOptimizeAway(&result);
}

end = timer.read();
elapsed = @floatFromInt(end - start);

std.debug.print("\n1_000_000_000 iterations of std.mem.startsWith\n", .{});
std.debug.print("{d:.1} ns, {}\n", .{ elapsed, result });

But I get pretty much the same results for both SIMD Vector == and std.mem.startsWith so maybe in this operation there’s not much benefit? Or maybe the sample size is too small.

AndrewCodeDev · July 15, 2023, 5:44pm

I think there is some confusion here about AVX and the length argument (I made this same mistake and I think we need to have a better understanding of what this parameter even means).

So for AVX 512, if you google the definition, you’ll get this:

AVX-512 is a set of 512-bit SIMD extensions that allow programs to pack
sixteen single-precision eight double-precision floating-point numbers, 
or eight 64-bit or sixteen 32-bit integers within 512-bit vectors.

So if I am adding up f64 values, with AVX 512, I can do eight of them at a time. If I have f32, I can do 16 of them at a time. So there has to be logic within the @Vector class that handles when we’re out-of-bounds. I think the API doesn’t make this clear.

So for instance, if I made a @Vector(8, f64), that should be able to do one AVX instruction to add them all up. What does @Vector(16, f64) mean then? It’s twice as many, so are we doing two operations? I need to look at the disassembly more carefully.

– edit –

From Zig’s documentation:

“Operations on vectors shorter than the target machine’s native SIMD size will typically compile to single SIMD instructions, while vectors longer than the target machine’s native SIMD size will compile to multiple SIMD instructions. If a given operation doesn’t have SIMD support on the target architecture, the compiler will default to operating on each vector element one at a time. Zig supports any comptime-known vector length up to 2^32-1, although small powers of two (2-64) are most typical. Note that excessively long vector lengths (e.g. 2^20) may result in compiler crashes on current versions of Zig.”

So make sure you actually have SIMD working otherwise it’ll just be a loop. So we could be pessimizing down to a loop depending on your platform.

Durobot · July 15, 2023, 6:02pm

Try changing the number of iterations, say, to 1_000_000.
If end - start does not change significantly, I think that would be an indication that what you’re measuring gets optimized away.

Which actually makes sense, considering your strings are comptime-known.

dude_the_builder · July 15, 2023, 8:17pm

Before doing the benchmark, I had these two alternative methods (Vector == vs startsWith) behind this if to make sure I was using SIMD when I thought I was:

if (std.simd.suggestVectorSize(u8)) |size| {
    // use Vector == if recommended size is >= needle size
} else {
    // use startsWith
}

and indeed it was (I’m on a Mac M1). But even so, both methods have negligible differences in time taken (usually less than 10ns and startsWith is always faster).

dude_the_builder · July 15, 2023, 8:19pm

Yeah initially I used std.mem.doNotOptimizeAway like you did, but I removed it to see if it made a difference and the results were the same, so it seems this is not being optimized away.

– EDIT: Wait my bad, I just added doNotOptimizeAway back in to make sure and now the results went from ~84 ns to 469289875 ns, so there is a big difference and maybe the previous version was being optimized away. Interestingly, startsWith is still faster, now by ~30,000 ns. (Updated the code sample above).

dude_the_builder · July 17, 2023, 1:47am

Ok, so I significantly increased the re and corpus sizes:

const re =
    \\abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz
    \\abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz
    \\abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz
    \\abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz
;
const corpus =
    \\abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz
    \\abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz
    \\abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz
    \\abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz
    \\0123456789
;

And had to lower the iterations down to 10 million. Now SIMD Vector == blazes at 13 ms vs startsWith taking 1726 ms!

Durobot · July 17, 2023, 8:43am

Hmm… I wonder if comparing random, comptime-unknown strings would be different.

dude_the_builder · July 17, 2023, 11:51am

Yep, I modified the benchmark to use std.rand.DefaultPrng to generate n number of bytes and the results are the same. I started testing different values for n (total bytes to compare) and the cutoff point where startsWith stops being faster than Vector == is 64 bytes. Once you get to 64 bytes and above, startsWith gets slower as the number of bytes to compare grows, whereas Vector == stays constant at around 15 ms (at least on my machine).

AndrewCodeDev · July 17, 2023, 5:37pm

Yeah and your results actually make sense. There is some setup that has to be done to declare vectors (simd is famously full of boilerplate) so in the time it takes to setup, it may have already finished. It may be analogous to binary search - it’s faster in theory, but it includes enough machinery to make it slower than linear search for small sizes.

dude_the_builder · July 17, 2023, 7:43pm

I remember seeing an example of exactly what you mention about binary search in Go’s implementation of Unicode functions. If I remember correctly, they have several instances where they check if a search is below 15 items, and if so they go linear, else they do the binary search. I suspect there’s a lot of benchmarking and tweaking involved in the process of determining these kinds of thresholds.

Durobot · July 18, 2023, 4:30am

Which is exactly why I want to be able to benchmark my own code, and to be able to trust the results.

matklad · July 19, 2023, 10:41am

How to benchmark @mulAdd, or anything else

There’s no one universal approach to benchmaring anything, micro/marco/systems benchmarks vary greatly in the techniques. If we go to a super-micro level, where the thing we are interested in totals in several assembly instructions, I wouldn’t even want to measure its runtime. Rather, I’d love to look at the raw assembly syntax and try to understand whether it is what I want to see here.

One interesting tool in this space is llvm-mca

https://llvm.org/docs/CommandGuide/llvm-mca.html

This is a CPU simulator, which can tell you how the assembly would behave on various CPUs.

Durobot · July 19, 2023, 11:36am

Thank you for the link.

Of course, environment plays a huge role, and benchmarks are (almost ?) always artificial to a degree. However, I still think they can play a part when comparing several approaches to a task. The goal, of course, is to give users the best performance possible.

Looking at assembly instructions generated by the compiler is a valid approach, but I don’t see how it conflicts with benchmarking. I, for one, learned x86 assembly way back, and I can certainly say my assembly knowledge is rusty.