Benchmarking isDigit

I It is the constant folding that becomes a problem esp on small microbenchmarks. It is where the volatile pointers are used so it can’t propagate the static value through the call.

I can’t seem to get doNotOptimizeAway to work:

The noinline functions aren’t generated and you see main just does some pointer shuffling and not call into anything (edit: made dumb mistake passing in 1 instead of ‘1’ but same result. The functions are just contant folded away. updated godbolt link)

The dNOA function takes a copy of the value and sends that into the asm block, not the value, but that just doens’t optimize away the pointer maybe? the workload seems gone.

Ok so this simplified version is working with doNotOptimizeAway. At first, I had the call to DNOA right after declaring result outside the loop and it didn’t work. Putting it inside the loop made it work:

fn withDnoa() !void {
    var result: bool = false;
    var timer = try std.time.Timer.start();

    for (0..1_000_000_000) |_| {
        result = std.ascii.isDigit('3');
        std.mem.doNotOptimizeAway(result);
    }

    std.debug.print("withDnoa: result: {}, took: {}\n", .{
        result,
        timer.read(),
     });
}

It appears to be keeping the loop, but optimizing the actual workload away. I had to make some changes to make it run on godbold with a main, but it is:

opr:
        mov     eax, 1000000000
        mov     cl, 1
.LBB2_1:
        dec     rax
        jne     .LBB2_1
        mov     al, 1
        ret

But it still is just consantly folding the function away because it knows the static input.

There is two parts to it: stop the output from being thrown away (the copy sent to DNO should work, but I’m not sure it does in all cases), and prevent the function from being constant folded and doing part of the workload at compile time. That I sdon’t see how DNO prevents.

As a preamble, I want to make it clear that I know that Zig is not C, nor is it beholden to C’s semantics.

That said, some quotes from the C standard:

An access to an object through the use of an lvalue of volatile-qualified type is a volatile access. A volatile access to an object, modifying an object, modifying a file, or calling a function that does an of those operations are all side effects, which are changes in the state of the execution environment.

next:

In the abstract machine, all expressions are valuated as specified by the semantics. An actual implementation need not evaluate part of an expression if it can deduce that its value is not used and that no needed side effects are produced including any caused by calling a function or through volatile access to an object).

and finally:

Volatile accesses to objects are evaluated strictly according to the rules of the abstract machine.

Accordingly, using volatile in C to prevent optimization from interfering with benchmarks is fairly commonplace. Naively, I would have done the same.

The language reference for Zig on volatile also refers to the volatile qualifier making use of a pointer so qualified a side effect.

It also says that volatile is unrelated to concurrency or atomics, and that use for anything other than MMIO is probably a bug.

It’s said that every safety regulation is written in blood, and I’m sure there are good reasons to take pains to make this point. There’s a good article about ways to screw yourself over with volatile, which I managed to find again, it is especially acute about using volatile to try and order atomic operations, which just won’t work. One of the other points is that volatile is buggy on some older compilers, when used to prevent an optimization. But not LLVM.

So the questions which prompted this: is volatile in Zig meant to deviate from what it means in C (and C++, if it matters)? Does it currently, is this meant to retain the option to make it differ?

Relatedly, is your intention with this warning to steer users toward doNotOptimizeAway for this sort of purpose? Should benchmarks which get deoptimized using volatile, but not with doNotOptimizeAway, be considered a bug in the latter?

Because the “probably” in the statement about volatile probably being a bug if it isn’t MMIO is doing some heavy lifting. If the goal is to deoptimize a microbenchmark, and volatile succeeds in that goal, then the program is performing to requirements.

2 Likes

That looks like a verison of the Cassio Neri paper (I did a zig implementation about a year ago. The original paper uses 100 instead of 25, I believe and its gets its speed up because the branch prediction on it is much better than mod 4.

I’m not sure if the static mod 25 is faster than the mod 100 (the compiler is going to turn it into a reciprocal multiplication) enough to outweigh the more easily predicted branch. (The 100 will be mispredicted 4 times less than the 25 obv). That sounds like an empirical question.

Weird. For what it’s worth here’s the whole program. It’s now showing all versions pretty much neck and neck, about 40ms faster than std on my mancine (ReleaseFast).

const std = @import("std");

fn isDigit1(c: u8) bool {
    return switch (c) {
        '0'...'9' => true,
        else => false,
    };
}

fn isDigit2(c: u8) bool {
    return switch (c >> 3) {
        6 => true,
        7 => c & 6 == 0,
        else => false,
    };
}

fn isDigit3(x: u8) bool {
    return x -% '0' <= 9;
}

fn isDigit4(x: u8) bool {
    return '0' <= x and x <= '9';
}

const digits = init: {
    var a = [_]bool{false} ** 256;
    for (0..256) |i| {
        if (48 <= i and i <= 57) a[i] = true;
    }
    break :init a;
};

fn isDigit5(x: u8) bool {
    return digits[x];
}

fn isDigit6(x: u8) bool {
    return (@intFromBool(x >= '0') & @intFromBool(x <= '9')) == 1;
}

fn bench(name: []const u8, func: fn (u8) bool) !void {
    var result: bool = false;
    var timer = try std.time.Timer.start();

    for (0..1_000_000_000) |_| {
        result = func('3');
        std.mem.doNotOptimizeAway(result);
    }

    std.debug.print("{s}: result: {}, took: {d}\n", .{
        name,
        result,
        @as(f64, @floatFromInt(timer.lap())) / 1_000_000.0,
    });
}

pub fn main() !void {
    try bench("isDigit1", isDigit1);
    try bench("isDigit2", isDigit2);
    try bench("isDigit3", isDigit3);
    try bench("isDigit4", isDigit4);
    try bench("isDigit5", isDigit5);
    try bench("isDigit6", isDigit6);
}
1 Like

In release fast those are all the same machine code. Any differences will be from loop body alignment or getting context switched out. Sometimes the compiler will align one loop and not another and when you move the loops around it will change the results. lol. sucks when you are like “oh my gd this is so much faster” and then you switch the loop order and you realize its just the order they were compiled in.

In gcc you prevent this by annotating the loop with an align directive to force the loop instruction code to sit on a 32 byte boundary. I don’t think there is way to do that in zig.

THE MORAL OF THE STORY: micro-benchmarking is really fucking hard.

3 Likes

What about filling some buffer with randomized memory via a random number generator, before starting the test and then actually iterating over that and xor-ing or whatever, that into the actual benchmarked calculations (for whatever is being tested)?

Reasoning being that it is unlikely the compiler will reverse engineer a pseudo random number generator with a dynamic seed that is actually feed into the inputs in some manner.

I guess ideally there would be easy ways to benchmark, without jumping through hoops, but making inputs as dynamic as possible, seems like a way to prevent the compiler from pre-computing.

Pregenerating a set of data is really useful when you have a range of values you want to test againt( eg, days_in month – there’s 480 months in a leap year cycle you can generate jan to dec for a 400 year period and it will have an even balance then loop over that a bunch. Very useful since you want to make sure even algo you are trialing gets the same input with an even distribution of inputs).

However, then you are possibly introducing some cache issues and need to worry about priming the data cash and then hoping you aren’t being limited by an cache memory bandwith wind up timing the L2 prefecter by accident. Its introduces other variables that can make the results sus.

But does work if you’re careful.

But you would need to save the seed and input it next tiem somehow or order run to run results can have different workloads, and if you hit a slowness, you’d have to have a way to regenerate your inputs so you could try and fix the issue.

It just introduces more issues and volatile or vol asm are just easier to deal with than cache issue.

2 Likes

Yes, that’s the Cassio-Neri variant; in their talk they presented

fn isLeapYear(year: u16) bool {
    const d: u16 = if (@mod(year, 100) != 0) 4 else 16;
    return (@mod(year, d) == 0);
}

which compiles to the same assembly I think - not sure what gcc does though. Their algorithms to calculate (date ↔ rd) are in fact faster than the C++ date / Howard Hinnant version according to my benchmarks. There was a bit of discussion about this over here; Travis Staloch also has a Zig implementation of the Cassio-Neri algos on his github.

1 Like

I took the time to read DNOA, and here is what is expects:

result dosn’t need to escape the loop, it just needs to survive the inline enough to be sent though the asm block. And the arg needs to go though it too, so you take the address of it and send it through the asm block too.

because asm blocks are black boxes, it can’t contant fold the x since the block can be doing anything (it just needs to not being to see into the DNOA function (eg, an extern function that it doesn’t know the body of should work too).

    for (0..1_000_000_000) |_| {
        var x: u8 = '3';
        std.mem.doNotOptimizeAway(&x);
        const result = std.ascii.isDigit(x);
        std.mem.doNotOptimizeAway(result);
    }

Somebody should document that function, because the use isn’t obvious from its call signature, and when it doesn’t work instead reading the code all the time, will just write their own.

I’ll change my benchmarker to remove the volatiles and see if this always works. Defeating compiler optimizations is a real pain the ass often, so you just tend to write your own because at least then you know what to fix.

2 Likes

The other ironic thing is that even though the std lib implementation of isLeapYear might execute less instructions on average, it might actually make the program run slower by virtue of decreasing instruction cache locality (since it takes up more space in the icache). Just another example to your point “measure, don’t guess”. And I would amend that with “measure real code, not just benchmarks”. Benchmarks often aren’t able to emulate factors that can have the biggest impact on overall performance of real software, such as cache locality. This is where profilers really come into their own.

My strategy has become, prioritize simplicity over “theoretical performance”. Once your program is running, profile your code and only add complexity in the name of performance if the profiler demonstrates actual gains. This strategy results in simpler code , more productivity, and faster programs. It’s one of those things I wish I knew when I was a much younger programmer.

9 Likes

I couldn’t agree more @marler8997. I look at my work as iterations of the well-known mantra make it work, make it right, make it fast. In the “work” phase I will hack stuff together until it looks close to what I need; in the “right” phase, I convert the code into a fully functioning implementation, with clean interfaces and design – I highly value simplicity and readability; only at the “fast” phase, if it is really necessary, will I profile and optimize the code, perhaps sacrificing some of the simplicity.

5 Likes

That’s kind of a simplistic notion that might hold for 80% of code, but doesn’t in all fields, standard library code being one of them.

Yes, microbenchmarks are always sus. The code around after inlining can have an enormous different on the performance characteristics, but while standard library code has to strike a proper balance between simple code, general usability, and performance you don’t want to leave any performance gains on the floor because it impacts the breadth of areas it is usable for and can lead to flat profiles where every function is slighly slow and there’s no good wins anywhere trapping people in local optimums. I’d even rather have a code base that have some very good well done parts and some badly thrown together parts than everything average. At least then you have a plan of attack.

Sometimes performance is a part of correctness, and the idea you can piecemeal it into work shape never works well. It has to be part of memory layout, structural design, algorithm choice, API choice, and other. It is part of every piece of a program. While you can get away with “make it work, make it correct, make it fast” on a large amount of code, I don’t think it is a proper way of thinking about high-performance code, standard libraries, or even libraries in general.

4 Likes

Yes, this is a good point. Some specific areas require every bit of performance, to the point where you need to go back to square one (zero?) more than once, until you get the overall design giving you the performance you need. A standard library might very well be in this category.

Just to clarify, the work I usually do does not benefit from this sort of approach – hence the approach I mentioned in my comment.

1 Like

You may have measured, but did you really understand? y & 3 is the same as y % 4. The difference is that the “fancy” version factors 25 out. The ability to return after 3 instructions can be had by the “fancy” version too:

fn IsLeapYear(y: u16) bool {
    return y % 4 == 0 and (y % 25 != 0 or y % 16 == 0);
}

The mod 100 version (the original Neri et al version) will prob perform slightly better I think because the branch will be predicted 4x more often.

That’s its big performance gain too. In the naive version the branch predict is only going to be 1 in 4, and the rest should be compiled down to cmovs and not matter too mcuh.

ofc that assumes:

  1. mod 25 and mod 100 cost the same (since it is likely turned into an inverse multiply, I think they would?

  2. the compiler doesn’t reduce them to the same somehow

maybe i’ll try this weekend. seems like an interesting question.

I want to know what world you live in that this is your honest evaluation of current compilers. There are so many holes in compiler optimizations, so many facilities that either don’t work well or don’t exist at all. And the optimizations that do exist sometimes just aren’t applied. I submit multiple issues to LLVM pretty much every time I look at assembly. Have you spent a lot of time looking at assembly, or are you just repeating this sentiment because someone else told you this?

2 Likes

I agree that people oversell the “magic” of compiler optimization. It frustrates me when I read stuff that essentially says “you can’t do better than an optimizing compiler” because that’s blatantly false.

Likewise, there many beginners who have no idea about how much work has gone into compilers. They make all kinds of wrong assumptions in the opposite direction (this thread was started precisely because of this and was split from its initial context).

I read @pierrelgol’s statement as if he was referring to the second group of people. The thing is, I know @pierrelgol personally and I know he actually agrees with you - but knowing him personally also biased how I read his post.

Thus, it’s important to point out that compilers are not magic but they are also better than beginners often realize.

3 Likes

Fair enough, I’d accept the idea that compilers are better than beginners. However, I still don’t think the best advice to give them is “don’t worry, compilers are smarter than you”. Instead, I’d encourage people (who are interested in this kind of thing) to read assembly, learn what’s going on, and try to understand what the instructions are doing BEFORE trying to do better or suggest improvements. And people should be taught how to benchmark correctly (I think we need help from the compiler, to randomize memory layout and such). If people are curious about assembly, I wouldn’t shut them out or tell them they aren’t smart enough to think anything worthwhile on the topic. I might caution them from thinking they understand what’s going on without adequate study, but that’s applicable to any topic with complicated/ little-known factors at play.

3 Likes