Benchmarking isDigit

AndrewCodeDev · June 6, 2024, 7:32am

Edit: moved from previous thread.

Hey, @endlessly_amused, I’d recommend trying it out on godbolt before going to github. In terms of branches, it’s not doing quite what I’d expect.

First, let’s reduce the function you’ve provided by directly returning the results:

fn isDigit_new(c: u8) bool {
    return switch (c>>3) {
        6 => true,
        7 => c & 6 == 0,
        else => false,
    };
}

Either way, the generated assembly has quite a few jumps/labels (here’s your original version on release fast):

isDigit_new:
        push    rbp
        mov     rbp, rsp
        sub     rsp, 3
        mov     al, dil
        mov     byte ptr [rbp - 3], al
        mov     byte ptr [rbp - 1], al
        shr     al, 3
        mov     byte ptr [rbp - 2], al
        sub     al, 6
        je      .LBB0_2
        jmp     .LBB0_6
.LBB0_6:
        mov     al, byte ptr [rbp - 2]
        sub     al, 7
        je      .LBB0_3
        jmp     .LBB0_1
.LBB0_1:
        xor     eax, eax
        and     al, 1
        movzx   eax, al
        add     rsp, 3
        pop     rbp
        ret
.LBB0_2:
        mov     al, 1
        and     al, 1
        movzx   eax, al
        add     rsp, 3
        pop     rbp
        ret
.LBB0_3:
        mov     al, byte ptr [rbp - 3]
        and     al, 6
        cmp     al, 0
        jne     .LBB0_5
        mov     al, 1
        and     al, 1
        movzx   eax, al
        add     rsp, 3
        pop     rbp
        ret
.LBB0_5:
        xor     eax, eax
        and     al, 1
        movzx   eax, al
        add     rsp, 3
        pop     rbp
        ret

Meanwhile, here’s what’s getting generated for the original version:

isDigit:
        push    rbp
        mov     rbp, rsp
        sub     rsp, 3
        mov     al, dil
        mov     byte ptr [rbp - 2], al
        mov     byte ptr [rbp - 1], al
        jmp     .LBB0_2
.LBB0_1:
        mov     al, byte ptr [rbp - 3]
        and     al, 1
        movzx   eax, al
        add     rsp, 3
        pop     rbp
        ret
.LBB0_2:
        mov     cl, byte ptr [rbp - 2]
        cmp     cl, 48
        setae   al
        cmp     cl, 57
        setbe   cl
        and     al, cl
        test    al, 1
        jne     .LBB0_3
        jmp     .LBB0_4
.LBB0_3:
        mov     al, 1
        mov     byte ptr [rbp - 3], al
        jmp     .LBB0_1
.LBB0_4:
        xor     eax, eax
        mov     byte ptr [rbp - 3], al
        jmp     .LBB0_1

The one in the standard is generating less lables and jumps. Is that really indicative of performance? I’m happy to hear other takes on this, but instruction count does matter for simple things. The one in the standard generates 33 total while the one you’re proposing generates 49… we’d have to look at those extra instructions and the operations being generated to see if they’re worth it.

I’d recommend going one lower than this actually:

export fn isDigit_new(c: u8) bool {
    return ('0' <= c and c <= '9');
}

Gets about 28 instructions

Now, again… is that actually better? I’m just counting instructions and paying attention to the number of labels and jumps but eh… probably not that big of a difference.

Edit: See @Eisenhauer’s answer for the results on ReleaseFast: I am a noob. How do I submit new code to the standard library of zig? :) - #5

pierrelgol · June 6, 2024, 8:31am

The thing with compilers is that they are really smart, we are way past the days where you had to do some bitwise magic everywhere to get the proper assembly generation. It’s nice that you gave it a shot, but truth is, when it comes to good code gen, a good rule of thumb is to keep it as simple as possible, many compilers nowadays are shaped in a way to optimize the common case.

for example in C I read a 15yo blog post explaining how to write more optimized loops. etc. This was the sort of example he gave, the logic was that not using a loop counter, was more optimized because you just needed to increment the pointer (address + sizeof(type)) instead of doing the (address + sizeof(type) * counter).

#include <stdint.h>

int basic_strlen(const char *const string)
{
    int i;

    if (string == (void*)0)
        return (0);
    for (i = 0; string[i]; i++);
    return (i);
}

int optimized_strlen(const char *string)
{
    const char *ptr = string;

    if (string == (void*)0)
        return (0);
    while (*ptr)
        ++ptr;
    return ptr - string;
}

int main(void)
{
    char *string = "This is a string";
    volatile int len1 = basic_strlen(string);
    volatile int len2 = optimized_strlen(string);
}

This was supposed to lead to better code generation. Now if you go on godbolt and look at the generated assembly for each functions, you are going to see that it’s not the case, in -O3 for example the compiler is able to recognize that the first loop is a strlen, and will straight up call it. This is because compilers are made to optimize the general case, and most people in C use the same for loops.

All of that just to say that keeping the code simple, is often the best course of action to make something fast. Of course there is some more knowledgeable people that can still outperform their compilers, but more often than not it’s easier to just be the compiler best friend and try to keep everything simple for him.

EDIT : if you want to learn more about this stuff, I’d advice you watch this talk, it’s a really cool dive in how compiler optimization really works : Understanding Compiler Optimization

Eisenhauer · June 6, 2024, 9:02am

Compiler Explorer might have silently ignored your intentions.

The usual -Doptimize=ReleaseFast known from invocations of zig build is not what you want on Compiler Explorer. There you have to use -O ReleaseFast.

Actually using ReleaseFast gets all simple examples down to 4 assembly instructions:

        add     dil, -48
        cmp     dil, 10
        setb    al
        ret

FObersteiner · June 6, 2024, 9:03am

Here’s another example I stumbled across: Compiler Explorer

That’s two implementations to test if a year is a leap year in the Proleptic Gregorian calendar. The first one uses some fancy bit-twiddling, the second one is from the Zig std lib. The interesting thing is that the first implementation generates less assembly / labels overall, BUT: in real life, most years aren’t leap years, so the std lib implementation will be better in terms of performance on average since it can return after 3 instructions (vs. 8) - (I benchmarked this…).

Long story short: measure, don’t guess.

AndrewCodeDev · June 6, 2024, 9:05am

Thanks for the correction - it was looking extremely verbose so I’ll just edit my answer and link to your answer for the release fast version

dude_the_builder · June 6, 2024, 12:06pm

isDigit1 is the std version. This just calls the function 1 billion times. (time in ms) M1 Mac Mini.

❯ zb
❯ ./zig-out/bin/main
isDigit1: result: true, took: 2853.338041
isDigit2: result: true, took: 2189.926667
❯ zb -Doptimize=ReleaseFast
❯ ./zig-out/bin/main
isDigit1: result: true, took: 0.000167
isDigit2: result: true, took: 0.073625
❯ zb -Doptimize=ReleaseSafe
❯ ./zig-out/bin/main
isDigit1: result: true, took: 0.000083
isDigit2: result: true, took: 0.094708
❯ zb -Doptimize=ReleaseSmall
❯ ./zig-out/bin/main
isDigit1: result: true, took: 0.000208
isDigit2: result: true, took: 0.040375

Edit: Just noticed that ReleaseSafe is faster than ReleaseFast in this case. Interesting…

dimdin · June 6, 2024, 12:50pm

All the releases generate the same code. The differences must be time spend in the benchmark code. How do you measure the elapsed time?

I am seeking for a good way to benchmark.
My last benchmark code runs for at least 10K operations and at least for 10ms, and displays nanoseconds per operation:

    var n: usize = 0;
    var total_ns: usize = 0;
    var timer = try Timer.start();
    while (n < 10240 and total_ns < 10 * std.time.ns_per_ms) {
        n += 1;
        timer.reset();
        ...
        total_ns += timer.read();
    }
    std.debug.print("{d} ns/op", .{ total_ns / n });

dude_the_builder · June 6, 2024, 1:12pm

Yes, when I run it several itmes in the different release modes, timings change slightly and ReleaseFast is consistently the fastest. Here’s the heart of the benchmark code:

    var timer = try std.time.Timer.start();

    for (0..1_000_000_000) |_| {
        result = isDigit1(x);
    }

    std.debug.print("isDigit1: result: {}, took: {d}\n", .{ 
        result, 
        @as(f64, @floatFromInt(timer.lap())) / 1_000_000.0,
    });

FObersteiner · June 6, 2024, 1:30pm

you might need to call std.mem.doNotOptimizeAway(result), to prevent the compiler from removing the loop since it’s a no-op after all.

dude_the_builder · June 6, 2024, 1:38pm

Just tried it, same results. I think I achieved the same by making x a var:

var x: u8 = '3';
_ = &x;

So I’m guessing thatr since the compiler can’t assume that x will never change, it can’t optimize away result = isDigit1(x).

FObersteiner · June 6, 2024, 1:44pm

Another thing to consider is that you’re benchmarking something that presumably runs in a couple of nanoseconds. On the other hand, time.Timer has to use the timer the OS gives it, which is not exactly optimized for micro-benchmarks.
My experience with this is, if you want reliable results, your total runtime of the benchmark should at least reach the milliseconds range. That way, statistics hopefully even out the inaccuracies that the OS timer brings with it.

nyc · June 6, 2024, 9:20pm

export fn is_digit(x: u8) bool {
    return (x -% '0') <= 9;
}

produces

is_digit:
        add     dil, -48
        cmp     dil, 10
        setb    al
        ret

why such complexities? What am I missing?

AndrewCodeDev · June 6, 2024, 9:34pm

Edit: thread was successfully split.

nyc · June 6, 2024, 9:54pm

unread values are easily optmized away, and _ doesn’t prevent that. std.mem.doNotOptimize doesn’t prevent constant folding either, so you need to:

send the value to an asm block (doesn’t have to do anything, just have the value an in input since asm is basically a black box.
write it to a volatile.

The same applies to reading values. they have to either come from a volatile or from an asm block. even then it can be very dificult to prevent optimizations you dont want from ruining tghe results.

The compiler will constant fold the entire function away if you are not careful. even though noinline functions (it will emit the function body then proceed to evaluate the call site away)

I have a micro-bench mark harness that produces output tables and gnuplot results if wanted that does a lot of this for you. also there’s mutliple ways to time and it uses rdtsc if available instead of clock_gettime.

part of it is on github (but i dont think the gnuplot and rdtsc code is checkin right now) as jnordwick/znh if you want to see the tricks you kind of have to do to get the compiler to behave.

dude_the_builder · June 6, 2024, 10:41pm

Thanks, I’ll check it out. Without changing my benchmark and adding your version as isDigit3, its slow in Debug but in 2nd place in the other release modes behind std:

❯ zb
❯ ./zig-out/bin/main
isDigit1: result: true, took: 2854.131541
isDigit2: result: true, took: 2188.585125
isDigit3: result: true, took: 3446.471959
❯ zb -Doptimize=ReleaseFast
❯ ./zig-out/bin/main
isDigit1: result: true, took: 0.000166
isDigit2: result: true, took: 0.068334
isDigit3: result: true, took: 0.01025
❯ zb -Doptimize=ReleaseSafe
❯ ./zig-out/bin/main
isDigit1: result: true, took: 0.000125
isDigit2: result: true, took: 0.176959
isDigit3: result: true, took: 0.012125
❯ zb -Doptimize=ReleaseSmall
❯ ./zig-out/bin/main
isDigit1: result: true, took: 0.000291
isDigit2: result: true, took: 0.031209
isDigit3: result: true, took: 0.007166

dude_the_builder · June 6, 2024, 10:51pm

@nyc , you’re right; I made volatile pointers to x and result and used those instead and got very different results. Your version is fastest in release modes but slowest in debug:

❯ zb
❯ ./zig-out/bin/main
isDigit1: result: true, took: 2861.885708
isDigit2: result: true, took: 2188.977459
isDigit3: result: true, took: 2988.589791
❯ zb -Doptimize=ReleaseFast
❯ ./zig-out/bin/main
isDigit1: result: true, took: 386.18725
isDigit2: result: true, took: 742.757
isDigit3: result: true, took: 343.441708
❯ zb -Doptimize=ReleaseSafe
❯ ./zig-out/bin/main
isDigit1: result: true, took: 387.953041
isDigit2: result: true, took: 788.654709
isDigit3: result: true, took: 345.536125
❯ zb -Doptimize=ReleaseSmall
❯ ./zig-out/bin/main
isDigit1: result: true, took: 384.575584
isDigit2: result: true, took: 729.611791
isDigit3: result: true, took: 342.576417

andrewrk · June 6, 2024, 11:27pm

Nobody should be using volatile for anything except memory-mapped I/O.

This is in the language reference:

Note that volatile is unrelated to concurrency and Atomics. If you see code that is using volatile for something other than Memory Mapped Input/Output, it is probably a bug.

Volatile is unrelated to benchmarking an “is digit” function. There is already enough confusion in the wild about volatile. Please don’t spread it among the Zig community.

I’ll say it again: do not use volatile for anything except MMIO.

nyc · June 6, 2024, 11:30pm

Its the saturating minus - there are multiple versions that all compiled to the same release fast or small code, but they differ in the debug implementations. This should be the fastest for both:

export fn is_digit_bit(x: u8) bool {
    return (@intFromBool(x >= '0') & @intFromBool(x <= '9')) == 1;
}

note: using an and compiles to the same release code too. I have four versions that all compile to the same release code, but vary in their debug code gen.

nyc · June 6, 2024, 11:38pm

Its just being used to force a load and precent constant folding or the dead store from being optimized away - nothing to do with atomics or concurrency. That or an asm block is needed - pretty common in other languages.

_ and doNotOptimize didn’t prevent that in 0.11/12 era. Is there a way I should be doing it?

In the benchmarker I use an empty voltile asm block. Is there a different way I should be doing it?

andrewrk · June 6, 2024, 11:40pm

Please notice that the implementation of std.mem.doNotOptimizeAway uses an asm block, contrary to your comment above.