[Fixed] Base overhead increased in 0.16

Hi folks, no criticism here, just trying to understand stuff - have I done something wrong when porting to 0.16 or are those numbers among the expected baseline?

So, I have a collection of small apps that I use on my day-to-day that I’ve written in zig and I’m in the process of porting them to 0.16. One of such tools is true, which actually mimics the coreutils true only for the sake of having a binary I can point to - and for the fun of writing zig as well.

The same source (see below, but nothing very interesting) when compiled in 0.15.2 would take ~100μs whereas in 0.16 it takes ~300μs. I’ve consistently noticed this ~200μs increase throughout other binaries and went on to true for its simplicity. Turns out strace tells me there’s a lot more going on now wrt syscalls:

# zig 0.15.2
strace -fc ~/.local/bin/true
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
  0,00    0,000000           0         1           rt_sigaction
  0,00    0,000000           0         1           execve
  0,00    0,000000           0         1           arch_prctl
  0,00    0,000000           0         1           prlimit64
------ ----------- ----------- --------- --------- ----------------
100,00    0,000000           0         4           total

# zig 0.16
strace -fc ./zig-out/bin/true
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
  0,00    0,000000           0         1           read
  0,00    0,000000           0         2           close
  0,00    0,000000           0         2           fstat
  0,00    0,000000           0         8           mmap
  0,00    0,000000           0         3           mprotect
  0,00    0,000000           0         1           munmap
  0,00    0,000000           0         1           brk
  0,00    0,000000           0         2           pread64
  0,00    0,000000           0         1         1 access
  0,00    0,000000           0         1           execve
  0,00    0,000000           0         1           sigaltstack
  0,00    0,000000           0         1           arch_prctl
  0,00    0,000000           0         1           set_tid_address
  0,00    0,000000           0         2           openat
  0,00    0,000000           0         1           set_robust_list
  0,00    0,000000           0         2           prlimit64
  0,00    0,000000           0         1           getrandom
  0,00    0,000000           0         1           rseq
------ ----------- ----------- --------- --------- ----------------
100,00    0,000000           0        32         1 total

Both were compiled w/ --release=fast as usual.

The code:

const std = @import("std");
pub fn main() !void {
    std.process.exit(0);
}

Note that this doesn’t use std.process.Init, so one could argue that having the eager std.process.Init struct injected would justify the "overhead”, but that wouldn’t exactly be the case here.

Again, don’t take this as a criticism as ~200μs is hardly a delay, let alone a problem, at least in my use case, but my curiosity got the best of me and I had to nerdsnipe investigate.

Thanks in advance!

5 Likes

This seems worth investigating.

You can compare the two versions of start.zig to see if more is happening.
Maybe they do some extra syscall, in order to setup the threaded io, but since you don’t use it, you pay the cost for it anyway ?

I think the problem is that there is still the default panic handler installed and that one uses std.Io.Threaded (unless configured otherwise).

3 Likes

In Zig 0.16 lib/std/start.zig:

https://codeberg.org/ziglang/zig/src/tag/0.16.0/lib/std/start.zig#L696-L729

Looking at the main() path, callMain() seems to return early before std.Io.Threaded initialization happens, so I don’t think std.Io.Threaded is responsible for the startup regression here.

1 Like

FWIW, I can’t replicate your results. I copied your true program, and got this:

~ $ zig version
0.15.2
~ $ zig build-exe -OReleaseSmall true.zig
~ $ strace -fc ./true
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 95.30    0.000709         709         1           execve
  2.28    0.000017           8         2           prlimit64
  1.34    0.000010          10         1           arch_prctl
  1.08    0.000008           8         1           rt_sigaction
------ ----------- ----------- --------- --------- ----------------
100.00    0.000744         148         5           total



~ $ zig version
0.16.0
~ $ zig build-exe -OReleaseSmall true.zig
~ $ strace -fc ./true
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 92.41    0.000670         670         1           execve
  2.90    0.000021          21         1           mmap
  2.48    0.000018           9         2           prlimit64
  1.52    0.000011          11         1           arch_prctl
  0.69    0.000005           5         1           sigaltstack
------ ----------- ----------- --------- --------- ----------------
100.00    0.000725         120         6           total
~ $

Still got an extra mmap from somewhere, but that’s it

1 Like

Ah, but there is a separate threaded instance that is used by the default panic handler, default log function and std.debug.print.

That instance is a constant so does not need to be initialised, only referenced. But it does get referenced in start.zig to provide the environ and arg0 which are needed for various functionality used by the above.

That happens in callMainWithArgs (posix platforms), or mainWithoutEnv (when linking libc). Other platforms either don’t support it, or have it globally available so
Both of those then call callMain.

But that alone only results in Threaded being compiled, it does not account for any of the syscalls.


https://codeberg.org/ziglang/zig/src/tag/0.16.0/lib/std/start.zig#L629-L639
https://codeberg.org/ziglang/zig/src/tag/0.16.0/lib/std/start.zig#L666-L684

2 Likes

Ah, good point.

I was only looking at the main() startup path in start.zig, where it returns before the normal std.Io.Threaded initialization.

But I forgot that the default panic handler / default logger / std.debug.print use a separate threaded instance, so that machinery could still get pulled in independently of callMain().

std.debug also has segfault handler, which needs to allocate alt stack, so that’s probably the mmap.

1 Like

I could replicate your results by running zig build-exe -OReleaseFast ./true.zig, but at the same time then the output of zig build --release=fast differs from that of zig build-exe -OReleaseFast

# built with `zig build-exe -OReleaseFast ./tools/true.zig`
strace -fc ./true
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
  0,00    0,000000           0         1           mmap
  0,00    0,000000           0         1           execve
  0,00    0,000000           0         1           sigaltstack
  0,00    0,000000           0         1           arch_prctl
  0,00    0,000000           0         1           prlimit64
------ ----------- ----------- --------- --------- ----------------
100,00    0,000000           0         5           total

# build alongside the whole project with `zig build --release=fast`
strace -fc ./zig-out/bin/true
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 54,17    0,000104         104         1           execve
 19,79    0,000038           4         8           mmap
  4,69    0,000009           3         3           mprotect
  3,65    0,000007           7         1           munmap
  2,60    0,000005           2         2           openat
  2,08    0,000004           2         2           fstat
  1,56    0,000003           1         2           pread64
  1,56    0,000003           3         1         1 access
  1,56    0,000003           1         2           prlimit64
  1,04    0,000002           2         1           read
  1,04    0,000002           1         2           close
  1,04    0,000002           2         1           brk
  1,04    0,000002           2         1           sigaltstack
  1,04    0,000002           2         1           set_tid_address
  1,04    0,000002           2         1           set_robust_list
  1,04    0,000002           2         1           getrandom
  0,52    0,000001           1         1           arch_prctl
  0,52    0,000001           1         1           rseq
------ ----------- ----------- --------- --------- ----------------
100,00    0,000192           6        32         1 total

I found the issue. build.zig poorly configured, mindlessly adding a dependency module to all the binaries this project is producing and that module wasn’t setting up the optimization, tainting all the modules. I can’t really say that I found any log line warning about that though, which might be something to look into, but otherwise a rather stupid f*ck up from my end :sweat_smile:

Just for completeness, I’ll share the benchmark results:

poop ./zig-out/bin/true ~/.local/bin/true true
Benchmark 1 (10000 runs): ./zig-out/bin/true
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           175us ± 46.3us     148us … 1.69ms        675 ( 7%)        0%
  peak_rss            807KB ± 1.07KB     717KB …  807KB        168 ( 2%)        0%
  cpu_cycles            0   ±    0         0   …    0            0 ( 0%)        0%
  instructions          0   ±    0         0   …    0            0 ( 0%)        0%
  cache_references      0   ±    0         0   …    0            0 ( 0%)        0%
  cache_misses          0   ±    0         0   …    0            0 ( 0%)        0%
  branch_misses         0   ±    0         0   …    0            0 ( 0%)        0%
Benchmark 2 (10000 runs): /home/hkupty/.local/bin/true
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          97.0us ± 37.8us    76.6us …  894us        644 ( 6%)        ⚡- 44.7% ±  0.7%
  peak_rss            831KB ± 1.05KB     750KB …  831KB          2 ( 0%)        💩+  3.1% ±  0.0%
  cpu_cycles            0   ±    0         0   …    0            0 ( 0%)          -  nan% ± -nan%
  instructions          0   ±    0         0   …    0            0 ( 0%)          -  nan% ± -nan%
  cache_references      0   ±    0         0   …    0            0 ( 0%)          -  nan% ± -nan%
  cache_misses          0   ±    0         0   …    0            0 ( 0%)          -  nan% ± -nan%
  branch_misses         0   ±    0         0   …    0            0 ( 0%)          -  nan% ± -nan%
Benchmark 3 (10000 runs): true
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           303us ±  125us     208us … 2.07ms       1217 (12%)        💩+ 72.7% ±  1.5%
  peak_rss           1.43MB ± 9.19KB    1.40MB … 1.50MB        984 (10%)        💩+ 77.8% ±  0.0%
  cpu_cycles            0   ±    0         0   …    0            0 ( 0%)          -  nan% ± -nan%
  instructions          0   ±    0         0   …    0            0 ( 0%)          -  nan% ± -nan%
  cache_references      0   ±    0         0   …    0            0 ( 0%)          -  nan% ± -nan%
  cache_misses          0   ±    0         0   …    0            0 ( 0%)          -  nan% ± -nan%
  branch_misses         0   ±    0         0   …    0            0 ( 0%)          -  nan% ± -nan%

Note that there’s still some ~80μs that consistently shows up, which might be related to mmap.

6 Likes

Having some deps running in optimized mode and some running in debug is a perfectly legitimate way of setting up a build. In awebo (discord alternative) I build opus and rnnoise always in release fast, even in debug builds of the application, because otherwise the audio code would be too slow and make it impossible to test voice calls in debug builds of awebo.

1 Like

What did you use to build?

I commented yesterday about the unexepected (for me) difference in behaviour of

zig build -Doptimize=ReleaseFast

and

zig build --release=fast

Did you use one of those options?

I understand, but the dependency wasn’t optimized while the binary, supposedly, was, but it did behave as if it was in debug. This means that binary is now “tainted” with debug mode. Even worse is that this was clearly a mistake since it doesn’t actually use the module, so the nominal module(debug)->exe(release) in practice becomes exe(debug).

I used the latter, zig build --release=fast

1 Like

When I compiled zine that was the right one to use, the top one did compile some dependencies (from build.zig.zon) in debug mode.

Your problem seems unrelated to the --release=fast vs -Doptimize=ReleaseFast differenence I encountered.