Zig complexity as discussed by Drew Devault

Hi guys,
I read the post in Drew Devault’s blog regarding language complexity comparison. His test was basically see how many level of abstractions (syscalls) each language needs to print a Hello World. His reference was a simple x86 assembly calling the syscalls directly. Obviously he tested on Linux, but I tested the same on NetBSD comparing C, Zig, Rust, C3, C++ and Go to check by myself getting results like 69, 102, 146, 156, 171, 172 syscalls being called respectively showing Zig is the nearest to C.
In his case Zig called the same amount of syscalls than Assembly code, that was impressive. That post in his blog is from 2020, so not up to date. I would like to know if you think if this metric is useful for something and if you have tested by yourselves the results nowadays in your computers.

4 Likes

I had not seen this metric, and thus have not tried that myself.
What interest me is the relative change of syscall numbers. Do you have a more detailed description/table of those test that you did? I have a few questions that are not really answered by the numbers as given:

  • which version of Zig did you use? Which optimisation? Was libc linked?
  • What library did you use for C ( I am not sure if you can use musl on NetBSD), and or was this linked staticly? Your result of 69 for C is close to Drew’s “C (glibc, dynamic)” result, 65 syscalls, but much higher than his lowest syscall result ( “musl, static”, 5 )
  • Zig going up from 2 (3) syscalls to 102 is a much greater in crease than e.g. Rust going from 123 to 146. Is there some standard overhead in NetBSD? Any other explanation for the increase?

Zig 0.5.0 is available for FreeBSD, it might be interesting to have the numbers for that and compare to Zig 0.16.0. (assuming that FreeBSD is more like NetBSD).

In general I agree with Drew, that knowing that your program has “shit happening that you didn’t ask for” is interesting, but without some insight as to why, that brings you relatively little.
And we should not forget that you are seldom required to write a hello world program for a client and/or run it multiple times.

Being able to optimize a hello world program is impressive, but if you need an additional 50 syscalls setting up a runtime environment for a program that does anything realistic (just guessing a number here), then I am not particularily interested in a compiler developer spending much time in always producing a 2 system call hello world program ( or even working on getting that down to just 49 additional syscalls). What I am interested in is that those tasks that a program may execute repeatedly are (or can be) optimised.

1 Like

Interestingly, I just tried to reproduce Drew’s experiment on 0.14.1 and got this:

user:/tmp$ zig version
0.14.1
user:/tmp$ cat test.zig
const std = @import("std");

pub fn main() !void {
    const stdout = std.io.getStdOut();
    _ = try stdout.write("hello world\n");
}
user:/tmp$ zig build-exe test.zig -DOptimize=ReleaseSmall
user:/tmp$ strip test
user:/tmp$ strace --summary ./test
execve("./test", ["./test"], 0x7ffe65c2f458 /* 69 vars */) = 0
arch_prctl(ARCH_SET_FS, 0x1105010)      = 0
prlimit64(0, RLIMIT_STACK, NULL, {rlim_cur=8192*1024, rlim_max=RLIM64_INFINITY}) = 0
prlimit64(0, RLIMIT_STACK, {rlim_cur=16384*1024, rlim_max=RLIM64_INFINITY}, NULL) = 0
rt_sigaction(SIGSEGV, {sa_handler=0x10dfd60, sa_mask=[], sa_flags=SA_RESTORER|SA_RESTART|SA_RESETHAND|SA_SIGINFO, sa_restorer=0x10807f0}, NULL, 8) = 0
rt_sigaction(SIGILL, {sa_handler=0x10dfd60, sa_mask=[], sa_flags=SA_RESTORER|SA_RESTART|SA_RESETHAND|SA_SIGINFO, sa_restorer=0x10807f0}, NULL, 8) = 0
rt_sigaction(SIGBUS, {sa_handler=0x10dfd60, sa_mask=[], sa_flags=SA_RESTORER|SA_RESTART|SA_RESETHAND|SA_SIGINFO, sa_restorer=0x10807f0}, NULL, 8) = 0
rt_sigaction(SIGFPE, {sa_handler=0x10dfd60, sa_mask=[], sa_flags=SA_RESTORER|SA_RESTART|SA_RESETHAND|SA_SIGINFO, sa_restorer=0x10807f0}, NULL, 8) = 0
rt_sigaction(SIGPIPE, {sa_handler=0x10dfcd0, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x10807f0}, NULL, 8) = 0
write(1, "hello world\n", 12hello world
)           = 12
exit_group(0)                           = ?
+++ exited with 0 +++
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
  0,00    0,000000           0         1           write
  0,00    0,000000           0         5           rt_sigaction
  0,00    0,000000           0         1           execve
  0,00    0,000000           0         1           arch_prctl
  0,00    0,000000           0         2           prlimit64
------ ----------- ----------- --------- --------- ----------------
100,00    0,000000           0        10           total

So 10 9 syscalls instead of 2 on 0.5.0.

And I tried to reproduce on 0.5.0 too and got 3 2 syscalls:

user:/tmp$ zig version
0.5.0
user:/tmp$ zig build-exe test.zig -DOptimize=ReleaseSmall
user:/tmp$ strace --summary ./test
execve("./test", ["./test"], 0x7fff0f5f6338 /* 69 vars */) = 0
rt_sigaction(SIGSEGV, {sa_handler=0x2283d0, sa_mask=[], sa_flags=SA_RESTORER|SA_RESTART|SA_RESETHAND|SA_SIGINFO, sa_restorer=0x21bdd0}, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
write(1, "hello world\n", 12hello world
)           = 12
exit_group(0)                           = ?
+++ exited with 0 +++
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
  0,00    0,000000           0         1           write
  0,00    0,000000           0         1           rt_sigaction
  0,00    0,000000           0         1           execve
------ ----------- ----------- --------- --------- ----------------
100,00    0,000000           0         3           total

Edit: just realized one was the execve from the shell.

1 Like

On what OS did you do that? OpenBSD?

1 Like

Linux 6.8.0-110-generic

On “Hello World!” what you’re really testing is how many syscalls are in the process start-up and shut-dpwn code of the library. Back then it was one (exit), and adding the call to write() gave two. Now with “JuicyMain” it’s likely to be much higher, but you’re also getting a lot more provided for you.

It’s an interesting metric, but it will soon become dwarfed by any real work that the process does.

5 Likes

“juicy main” is also opt-in for those who care.

1 Like

Indeed, and it might be interesting to measure the syscall overhead of each style of main.

Semi-related aside: I do enjoy how close you can get to assembly in Zig, thanks to its single unit of compilation model. When working at a low level, it really does feel like a high-level DSL just above assembly at times.

If we drop down to raw syscalls:

const std = @import("std");

const msg = "hello, world\n";

pub export fn _start() noreturn {
    _ = std.os.linux.write(1, msg.ptr, msg.len);
    std.os.linux.exit(0);
}

The compiler output with -OReleaseFast -fomit-frame-pointer is:

00000000010011c0 <_start>:
 10011c0:	b8 01 00 00 00       	mov    eax,0x1
 10011c5:	bf 01 00 00 00       	mov    edi,0x1
 10011ca:	be 58 01 00 01       	mov    esi,0x1000158
 10011cf:	ba 0d 00 00 00       	mov    edx,0xd
 10011d4:	0f 05                	syscall
 10011d6:	b8 3c 00 00 00       	mov    eax,0x3c
 10011db:	31 ff                	xor    edi,edi
 10011dd:	0f 05                	syscall

Compare Drew’s assembly hello:

_start:
	mov rdx, len
	mov rsi, msg
	mov rdi, 1
	mov rax, 1
	syscall

	mov rdi, 0
	mov rax, 60
	syscall

And it’s near identical except for the order that registers get set.

If we change to -OReleaseSmall -fomit-frame-pointer, then we get a .text section that’s 4-bytes shorter (_start symbol omitted due to stripping):

00000000010011c4 <.text>:
 10011c4:	6a 01                	push   0x1
 10011c6:	58                   	pop    rax
 10011c7:	6a 0d                	push   0xd
 10011c9:	5a                   	pop    rdx
 10011ca:	be 58 01 00 01       	mov    esi,0x1000158
 10011cf:	48 89 c7             	mov    rdi,rax
 10011d2:	0f 05                	syscall
 10011d4:	6a 3c                	push   0x3c
 10011d6:	58                   	pop    rax
 10011d7:	31 ff                	xor    edi,edi
 10011d9:	0f 05                	syscall

And a binary size significantly smaller than the ASM version (after stripping the ASM version too, I should add):

$ wc -c hello
1000 hello
$ wc -c hello-asm 
8488 a.out

Of course I’m sure you could do the same with C by compiling for freestanding, and manually implementing syscall wrappers in inline assembler. It’s just nice how easy this is in Zig, with the stdlib, and that I can be confident that I’m not introducing any undefined behaviour that’d wreak havoc on the expected assembly output.

13 Likes

I did the syscall wrapper approach in Zig to get an 1808 byte webserver back in the day. Would be interesting to see how small you can get the std.os version of that. Ideally a match.

1 Like

Possibly slightly smaller! The std.os.linux versions are thin wrappers around syscall0, syscall1, syscall2, depending on the arity of the syscall, whereas yours always uses the equivalent of syscall6 and sets registers unnecessarily in many cases.

2 Likes

Yes! Tempting to try.

That’s a sweet improvement actually, adding multiple syscall wrappers amounts to 256 saved bytes. Thanks for pointing this out :slight_smile:

2 Likes

I think following through how to reproduce this is a really interesting case study, so do indulge me writing a really long answer here. Here’s my test.zig program under 0.15.2:

pub fn main() void {
    _ = std.fs.File.stdout().write("Hello, world!") catch unreachable;
}

const std = @import("std"); // bottom gang represent

Excuse the poor style. I wanted something as close to “write, exit” without calling into std.os.linux, since getting the syscalls directly seemed like cheating. The “perfect” strace ./test output for this source code looks something like this:

write(1, "Hello, world!", 13) = 13
exit(0) = ?

My output looks like this:

> zig build-exe test.zig

> strace ./test

execve("./test", ["./test"], 0x7ffd0a9a99e0 /* 57 vars */) = 0
arch_prctl(ARCH_SET_FS, 0x1188010) = 0
prlimit64(0, RLIMIT_STACK, NULL, {rlim_cur=8192*1024, rlim_max=RLIM64_INFINITY}) = 0
prlimit64(0, RLIMIT_STACK, {rlim_cur=16384*1024, rlim_max=RLIM64_INFINITY}, NULL) = 0
rt_sigaction(SIGSEGV, {sa_handler=0x113fbb0, sa_mask=[], sa_flags=SA_RESTORER|SA_RESTART|SA_RESETHAND|SA_SIGINFO, sa_restorer=0x10c1f30}, NULL, 8) = 0
rt_sigaction(SIGILL, {sa_handler=0x113fbb0, sa_mask=[], sa_flags=SA_RESTORER|SA_RESTART|SA_RESETHAND|SA_SIGINFO, sa_restorer=0x10c1f30}, NULL, 8) = 0
rt_sigaction(SIGBUS, {sa_handler=0x113fbb0, sa_mask=[], sa_flags=SA_RESTORER|SA_RESTART|SA_RESETHAND|SA_SIGINFO, sa_restorer=0x10c1f30}, NULL, 8) = 0
rt_sigaction(SIGFPE, {sa_handler=0x113fbb0, sa_mask=[], sa_flags=SA_RESTORER|SA_RESTART|SA_RESETHAND|SA_SIGINFO, sa_restorer=0x10c1f30}, NULL, 8) = 0
rt_sigaction(SIGPIPE, {sa_handler=0x113faf0, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x10c1f30}, NULL, 8) = 0
write(1, "Hello, world!", 13) = 13
exit_group(0) = ?

There’s a lot more going on there than just “write, exit”. Let’s step through it:

execve("./test", ...)

The shell I’m in runs the program. I’ll make strace ignore this in the future, it’s a product of the environment and not the binary. I think it’s because strace defaults to attaching to your shell and then starting the trace at the next fork, but don’t quote me on it.

arch_prctl(ARCH_SET_FS, ...)

This sets the x86_64 FS-register, which on Linux is used for thread-local storage.

prlimit64(0, RLIMIT_STACK, NULL, {rlim_cur=8192*1024, rlim_max=RLIM64_INFINITY}), NULL)

The program queries the stack limit, which is 8MB.

prlimit64(0, RLIMIT_STACK, {rlim_cur=16384*1024, rlim_max=RLIM64_INFINITY}, NULL)

The program increases the stack limit to 16MB.

rt_sigaction(SIGSEGV, 0x113fbb0, ..., SIGINFO, ...)
rt_sigaction(SIGILL,  0x113fbb0, ..., SIGINFO, ...)
rt_sigaction(SIGBUS,  0x113fbb0, ..., SIGINFO, ...)
rt_sigaction(SIGFPE,  0x113fbb0, ..., SIGINFO, ...)

This block registers a signal handler for segmentation faults (SIGSEGV), illegal instructions (SIGILL), wrongly aligned memory access (SIGBUS), and math exceptions (SIGFPE).

This is one of the places where your runtime protection lives in the executable. SIGINFO means we want more information than “it happened”. That’s part of how you get descriptive crashes in safe build modes.

rt_sigaction(SIGPIPE, 0x113faf0, ...)

Registers a different signal handler for SIGPIPE.

It’s almost certainly a handler that does nothing.

SIGPIPE fires when you try to write to a pipe/socket that has been closed. The default action is “terminate the program”. Helpful if you’re piping stuff from one program to another in a terminal and you want both programs to stop. Not helpful if you’re a web server and your client closes a socket.

write(1, "Hello, world!", 13) = 13

One of two syscalls we actually care about.

exit_group(0)

The other of the two syscalls we actually care about, almost. Again, this is threading support. exit_group terminates all threads, where exit only exits the current one.

To summarize:

  1. Set up thread local storage
  2. Increase the stack size
  3. Set up runtime safety
  4. Make sure we don’t suddenly exit if we’re communicating on a socket that gets closed
  5. Write “Hello World!” to stdout
  6. Exit

Can we make zig not do that extra stuff? Well… -O ReleaseSmall builds without runtime safety, and -fsingle-threaded disables threading support, so…

> zig build-exe -O ReleaseSmall -fsingle-threaded test.zig

> strace --trace='!execve' ./test 

With that we get:

// 1. Get the stack size
prlimit64(0, RLIMIT_STACK, NULL, {rlim_cur=8192*1024, rlim_max=RLIM64_INFINITY}) = 0
// 2. Increase the stack size
prlimit64(0, RLIMIT_STACK, {rlim_cur=16384*1024, rlim_max=RLIM64_INFINITY}, NULL) = 0
// 3. SIGPIPE protection
rt_sigaction(SIGPIPE, {sa_handler=0x1001508, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x100150e}, NULL, 8) = 0
// 4. Write and exit
write(1, "Hello, world!", 13) = 13
exit(0) = ?

As expected, no threading and no guard rails. Zig does exactly what you need.

Running zig build-exe -h | grep stack helpfully tells me about --stack for setting the stack limit. Apparently zig build-exe defaults to 16 MB instead of 8? Alright, I guess.

What if I request the system default size?

> zig build-exe -O ReleaseSmall -fsingle-threaded --stack 8388608 test.zig

> strace --trace='!execve' ./test

Well, almost:

// 1. Get the stack size
prlimit64(0, RLIMIT_STACK, NULL, {rlim_cur=8192*1024, rlim_max=RLIM64_INFINITY}) = 0
// 2. SIGPIPE protection
rt_sigaction(SIGPIPE, {sa_handler=0x1001508, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x100150e}, NULL, 8) = 0
// 3. Write and exit
write(1, "Hello, world!", 13) = 13
exit(0) = ?

Poking around with the CLI a little I arrived at --stack 0 as the solution here. If that’s intended, documenting it somewhere would be nice. Otherwise, a --stack default or similar would be nice.

> zig build-exe -O ReleaseSmall -fsingle-threaded --stack 0 test.zig

> strace --trace='!execve' ./test

Now we get no more stack-related syscalls:

// 1. SIGPIPE protection
rt_sigaction(SIGPIPE, {sa_handler=0x1001508, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x100150e}, NULL, 8) = 0
// 2. Write and exit
write(1, "Hello, world!", 13) = 13
exit(0) = ?

What about that SIGPIPE handler, though?

This isn’t a web server, I’m fine with exiting on closed pipes.

Well, I couldn’t find anything in zig build-exe -h so I grepped the codebase instead.

It turns out this is configured in std.Options:

pub const Options = struct {
    /// By default Zig disables SIGPIPE by setting a "no-op" handler for it.  Set this option
    /// to `true` to prevent that.
    ///
    /// Note that we use a "no-op" handler instead of SIG_IGN because it will not be inherited by
    /// any child process.
    ///
    /// SIGPIPE is triggered when a process attempts to write to a broken pipe. By default, SIGPIPE
    /// will terminate the process instead of exiting.  It doesn't trigger the panic handler so in many
    /// cases it's unclear why the process was terminated.  By capturing SIGPIPE instead, functions that
    /// write to broken pipes will return the EPIPE error (error.BrokenPipe) and the program can handle
    /// it like any other error.
    keep_sigpipe: bool = false,
}

What a nice helpful docstring. Gotta admit, this is way more effort in communication than I expected. It’s one of those “Linux was designed in ancient times” quirks, the stdlib both fixing it by default and including the explanation is really nice.

With a slight change to the code:

pub const std_options: std.Options = .{
    .keep_sigpipe = true,
};

pub fn main() void {
    _ = std.fs.File.stdout().write("Hello, world!") catch unreachable;
}

const std = @import("std");

We arrive at perfection:

> zig build-exe -O ReleaseSmall -fsingle-threaded --stack 0 test.zig

> strace --trace='!execve' ./test

write(1, "Hello, world!", 13) = 13
exit(0) = ?

If you know how to wrangle it, Zig is still be the best game in town for this kind of thing without dropping down to assembly. :wink:

Does it matter? Honestly, I don’t think so. As a benchmark for “language overhead” it’s pretty interesting, but I think anyone experienced in low level stuff who cares enough can get their language of choice to a place where the overhead is practically zero.

15 Likes

The culprit lies here: https://codeberg.org/ziglang/zig/src/commit/5cc281e7232b9f1bc5f4d732e4a37fb5df02f780/lib/std/start.zig#L598

fn expandStackSize(phdrs: []elf.Phdr) void {
    @disableInstrumentation();
    for (phdrs) |*phdr| {
        switch (phdr.p_type) {
            elf.PT_GNU_STACK => {
                if (phdr.p_memsz == 0) break;

Any PT_GNU_STACK segments of size zero are ignored by start.zig. I’m not sure what the use case for multiple PT_GNU_STACK segments is.

1 Like

I think it’s an interesting thing to think about, but I imagine that playing golf with the number of syscalls is very rarely anything you need to worry about.

I do like that I can drop down to inline assembly (or use the syscall functions in std.os.linux) to write a program where there’s nothing between my code and the operating system. There was a long time where I had only ever used high level languages and never understood there was this line, let alone where the line is.

1 Like

You’d be surprised. For a short program like this, no. For something in the hot path, it can be quite a significant performance improvement.

One of the main (if not the main) driver of Writergate was to minimize the number of syscalls being made by programs doing IO. Even the standard library collections encourage you toward reserving memory up-front (thereby minimizing the number of calls to mmap made) by including infallible xxxAssumeCapacity functions.

1 Like

That makes sense. I guess I should say more so that if the basic patterns are efficient and effective, you won’t have a ton to worry about in this department.

1 Like