Zig build run is 10x faster than compiled binary?

mtlynch · March 3, 2024, 5:44pm

I’m running into strange behavior with zig build run, and I’m hoping others might help explain what’s happening.

I have a simple Zig app that I can run with zig build run. But if I run the compiled binary afterwards, the performance is 10x worse than with zig build run.

`src/main.zig`

const std = @import("std");

pub fn countBytes(reader: anytype) !u32 {
    var count: u32 = 0;
    while (true) {
        _ = reader.readByte() catch |err| switch (err) {
            error.EndOfStream => {
                return count;
            },
            else => {
                return err;
            },
        };
        count += 1;
    }
}

pub fn main() !void {
    var reader = std.io.getStdIn().reader();

    var timer = try std.time.Timer.start();
    const start = timer.lap();
    const count = try countBytes(&reader);
    const end = timer.read();
    const elapsed_micros = @as(f64, @floatFromInt(end - start)) / std.time.ns_per_us;

    const output = std.io.getStdOut().writer();
    try output.print("bytes:           {}\n", .{count});
    try output.print("execution time:  {d:.3}µs\n", .{elapsed_micros});
}

`build.zig`

const std = @import("std");

pub fn build(b: *std.Build) void {
    const target = b.standardTargetOptions(.{});
    const optimize = b.standardOptimizeOption(.{});

    const exe = b.addExecutable(.{
        .name = "count-bytes",
        .root_source_file = .{ .path = "src/main.zig" },
        .target = target,
        .optimize = optimize,
    });

    b.installArtifact(exe);

    const run_cmd = b.addRunArtifact(exe);
    run_cmd.step.dependOn(b.getInstallStep());

    if (b.args) |args| {
        run_cmd.addArgs(args);
    }

    const run_step = b.step("run", "Run the app");
    run_step.dependOn(&run_cmd.step);
}

$ zig version
0.11.0

Measuring runtime

If I run with zig build run, I see performance of 13µs:

$ echo '00010203040506070809' | xxd -r -p | zig build run -Doptimize=ReleaseFast
bytes:           10
execution time:  13.549µs

That zig build run command compiles a binary, but if I run that binary, performance is over 10x worse:

$ echo '00010203040506070809' | xxd -r -p | ./zig-out/bin/count-bytes 
bytes:           10
execution time:  162.195µs

I see the same thing if I compile the binary explicitly and run it:

$ rm -rf zig-out && zig build -Doptimize=ReleaseFast && echo '00010203040506070809' | xxd -r -p | ./zig-out/bin/count-bytes 
bytes:           10
execution time:  191.093µs

Does anyone know why zig build run seems to be outperforming the compiled binary by such a large margin?

dimdin · March 3, 2024, 6:17pm

Try to use poop for measuring: Performance Optimizer Observation Platform

dimdin · March 3, 2024, 6:25pm

The answer must be Input Buffering.

dimdin · March 3, 2024, 6:40pm

Maybe I am wrong, it can be Output Buffering or both input & output buffering.
The source code is in lib/zig/std/Build/Step/Run.zig functions runCommand and spawnChildAndCollect.

Clinton-Nmereole · March 3, 2024, 10:06pm

Interesting. I just tried this on the program I am working on using hyperfine for the benchmark and I got similar results.

zig build run command:

hyperfine "zig build run" -i
Benchmark 1: zig build run
  Time (mean ± σ):      77.6 ms ±   5.4 ms    [User: 45.3 ms, System: 30.8 ms]
  Range (min … max):    75.4 ms … 109.6 ms    37 runs

running the binary:

hyperfine "zig-out/bin/.wordle" -i
Benchmark 1: zig-out/bin/.wordle
  Time (mean ± σ):      70.9 µs ±  39.9 µs    [User: 380.2 µs, System: 385.6 µs]
  Range (min … max):     0.0 µs … 482.5 µs    1463 runs

dee0xeed · March 4, 2024, 6:49am

They are are opposite to what OP got.
In your benchmark zig build run is ~1000 times slower than running executable directly.

kristoff · March 4, 2024, 9:42am

Also note that zig build run just compiles the executable and lanuches it, it’s not interpreting Zig or anything like that.

andrewrk · March 4, 2024, 6:18pm

Looks like you’re doing 1 syscall per byte read? That’s going to perform extremely poorly. My guess is that the extra steps of using the build system incidentally introduced some buffering. Not sure why though. The build system is making the child process inherit the file descriptors directly.

matklad · March 5, 2024, 12:27am

I’d say readByte is a flawed API to have on a Reader. While you technically can read a byte-at-time from something like TCP socket, it just doesn’t make sense. The reader should only allow reading into a slice.

Byte-oriented API belongs to a buffered reader.

mtlynch · March 5, 2024, 1:07am

Thanks! I’m still a Zig beginner, and I hadn’t realized I was performing a syscall for each readByte call, but now that you point that out, it makes sense.

I tried switching to std.io.bufferedReader, and it makes the performance gap between zig build run and the compiled binary even wider, but it’s good to know that it’s not something I’m misunderstanding about Zig.

If I run the commands with hyperfine instead, it reports that running the compiled binary is 23x faster than zig build run (even when the build is cached).

andrewrk · March 5, 2024, 5:36am

This never occurred to me, but now that you mention it, it makes perfect sense.

mtlynch · March 5, 2024, 10:31pm

tl; dr - The zig build run version appeared to run faster because it wasn’t waiting on echo or xxd to feed input to stdin.

To close the loop on this, Andrew Ayer explained the phenomenon I was observing.

My mental model was that echo ran, then piped its results to xxd, then xxd finished and piped its results to my Zig program.

But actually in a bash pipeline, all the commands in the pipeline start at the same time.

The reason why I measured slower performance in the ./count-bytes invocation was that it was actually running faster. The binary started the timer immediately and then had to block on the countBytes call for input to arrive from xxd, which hadn’t started producing output yet.

In the zig build run version, what was probably happening was that calling zig build run rather than ./count-bytes meant that count-bytes timer started later, as the overhead of zig build probably adds 100+ms. So, instead of count-bytes starting immediately and then having to wait ~100ms for echo and xxd to get input into stdin, by the time the zig build run version starts the program, echo and xxd have finished their work, so count-bytes doesn’t have to wait to read from stdin.

Clinton-Nmereole · March 7, 2024, 6:07am

apparently I either can’t read or had too much coffee when I made that post. I apologize for being an idiot.

mtlynch · March 19, 2024, 5:09pm

Sorry for the bump, but I found this case so interesting that I did a detailed writeup on my blog with visual explanations for why the two commands had such different timings.

Why does an extraneous build step make my Zig app 10x faster?

andrewrk · March 19, 2024, 5:53pm

Thanks for sharing! That is a well-written writeup.

rofrol · March 19, 2024, 9:32pm

great. you may be interested in comments here about bash pipes Pipexec – Handling pipe of commands like a single command | Hacker News

rofrol · March 20, 2024, 4:58pm

I just read this:

anonymous-panda:

My hunch is that if you added the buffered reader and kept the original xxd in the pipe you’d see similar timings.

The amount of input data is just laughably small here to result in a huge timing discrepancy.

I wonder if there’s an added element where the constant syscalls are reading on a contended mutex and that contention disappears if you delay the start of the program.

vlovich123:

Good hunch. On my machine (13900k) & zig 0.11, the latest version of the code:

INFILE="$(mktemp)" && echo $INFILE && \ echo '60016000526001601ff3' | xxd -r -p > "${INFILE}" && \ zig build run -Doptimize=ReleaseFast < "${INFILE}" > execution time: 27.742µs

vs

echo '60016000526001601ff3' | xxd -r -p > "${INFILE}" | zig build run -Doptimize=ReleaseFast < "${INFILE}"EVM gas used: 18 > execution time: 27.999µs

The idea that the overlap of execution here by itself plays a role is nonsensical. The overlap of execution + reading a byte at a time causing kernel mutex contention seems like a more plausible explanation although I would expect someone better knowledgeable (& more motivated) about capturing kernel perf measurements to confirm. If this is the explanation, I’m kind of surprised that there isn’t a lock-free path for pipes in the kernel.

mtlynch · March 21, 2024, 12:15am

There aren’t enough details in that post to really be sure how that person got those measurements. If I take it at face value, their test is invalid because, in test 2, INFILE is available for reading before xxd populates it because they didn’t remove the file after test 1.

Regardless, my results are public, so if you have other explanations for why the pipelining change fixed the performance discrepancy, I’m interested to hear.