Fixing errno returns from syscalls

nyc · June 6, 2024, 7:54pm

Better syscall errno handling in Zig

Zig’s error handling from syscalls has a few very large correctness problem, some performance isues, and a code size issue in release small. They aren’t difficult to fix either.

(There’s a godbolt link at the bottom to play with how the switch statement generated code.)

Currently at the layer the posix layer where all the arguments are still the form of u32 flags and other Os primatives, zig returns zig errors with a switch on the syscall return value:

const rc = system.some_syscall(...);
switch(errno(rc)) {
  // ...
};

There are multiple options for this to be compiled to:

A jump table where an offset into a table can be computed from the switched on value. An address is loaded from the table, and an indirect call is made. This can only be done when the cases are densely packed.
An if-else chain that just does comparison one after the other. The compiler can reorder them (and often does) since the switch statment doesn’t imply any ordering or importance of the branches. This is done for shorter number of arms.
A comparison tree that is like a binary search through the arm values. It picks a value to break the space in half, then jumps to the next level, and does this same. Instead of having 1-8 conditionals to evaluate, you would evaluate 3 every time.

(There are a few other potential ways like a radix tree, but I haven’t seen them in practice.)

These can be mixed together, for example if you had two compact ranges a compiler will often branch on one of the bounds to either to a look up table or a short if-else chain.

Notice how your order isn’t cosidered. It assumes every branch is equally likely. But the cost to your runtime is rarely ever symmetric for erros and success. You almost always want the success case to be as fast as possible, and you are willing to pay up on the errors.

Fix 1:
Put an if (rc == 0) return ret_val before going into the switch.

Even in the best case it is going to be slower since it turns this into an indirect call with a complex addressing mode and that involves more CPU resources and is less likely to be predicted or have correct target prediction (both are needed). The simple if will not even involve a jump on the fast path if done properly (and the switch can be put in a function and set cold to streamline it even further and prevent L1i pollution).

But in the other cases, success will be multiple branches away as it has to evaluate to a leaf of search tree it built (you can see this in the godbolt code gen).

Fix 2:
Don’t use unreachable on syscall error

An error return is not a bug. It can be in buggy code, but sometimes it is working as expected. There have been numerous github issues (some have been open for years), that point out places where zig makes incorrect assumptions about what is a programmer error and what isn’t.

Sometimes it isn’t possible to keep track of everything to prevent an error from being returned by the syscall. A couple weeks ago a stress test for a long running process crashed. It was not easy to track down because in release more unreachable forces UB, and that is a terrible debugging experience. It cannot be logged or observed in any way, you just corrypt the process and it bails somewhere else. That shit sucks.

It was tracked down to munmap returning ENOMEM. This is a valid return, and there is even a bug report from 2 years ago on it. On mmap linux tries to extend current maps. This keeps the number of VM structs small and is probabably helpful for huge page coalescing. But if you now try to munmap a page, you can split that region in two parts and the os now has to created a second mapping for the split off region. (you map page 1, then 2, then 3, but linux keeps a single 1-3 VM mapping, then you unmap 2, so linux now needs two VM mapping 1 and 3).

Without recreating the internal VM logic (that will be os and kernel verison specific), there is no way to stop this. Zig just forces UB in production on this. There is no reason this should causes such abberant behavior.

This extends to a number of other calls and values:

Sometime these are application dependant (eg, EBADF on close doesn’t always mean a bug - it can just be single threaded code where keeping track of the last failure state isn’t worth it so you just reset the resources and keep on going). The cases where it is an error can have code written in a more zig-ified layer without loss (even then I think this is a terrible decision).
Sometimes you can’t even predict the returns from syscalls. With user created code handling them now (such as FUSE and BPF), often the exact semantics of an error value aren’t knowable).
Sometimes there is just no way to probe for certain features (eg, a network filesystem that doesn’t sopport copy_file or ACLs, or about FUSE mount you can think of). The best choice is try it and handle the error.
Zig ignores time outs in syscalls when it retries (the one exception is nanosleep, because it returns much time was left if it was a spurious wakeup or interrupted). That means if your highly efficent event loop gets interrupted, it cycles back with the same timeout, and there is nothing you can do about it. So when it does finally return any time-based events you was goign to run are probably well expired.
Some programs cant crash, and I can’t stres this enough. Financial trading system, control systems, medical devices, and others are often written to attempt to recover from error - even programmer error that might not have been uncovered in testing. You reset to a known good state and keep it moving. There migth be 20 other threads doing thing and you are willing to accept one task not completing, because the other 19 are critically important or the crash might cause hours of downtime losing millions of dollars or even just a minute and losing lives. So you don’t crash, you exit gracefully if possible, tying up loose ends, but Zig takes that decision away from you. We’re not all writing command line tools or web apps where the crash isn’t as costly.
OS bugs that might only be expressed in certain versions. You can’t expect to code around that, keeping track of different kernel version (even if zig were to) is just a waste of effort. (eg, io_uring is notorious for adding and changing behavior from version to version).

All these case’s arent just theoretical. There are bug reports in zig for some of them, and they happen in other languages too. If zig wants to have a strong system level, it should account for this. And as more people use zig, these will get more.

Zig’s job is not to pass judgement on people’s architecture, and that’s what it feels like when core decides they think your program should crash on a particular error.

Fix 3: remove the switch, and for errno returning systems, generate the error from an offset into an error type.

While you may not have exact bounds on what errors can be returned, those bounds are not always correct, and the code gen is far superior in both ReleaseFast and Small (in ReleaseSmall the difference is huge).

In the current method, in a realtively normal case, ReleaseFast compiled into about 55 instructions and two jump tables totaling about 45 8-byte entries (360 bytes). The propsed methos is 12 instructions with a single branch on the success fast path.
In ReleaseSmall, the example code is compiled into a long if-else chain of 8 conditional branches and about 40 instructions. The better method is the same 12 instruction including a branch on success.
The proposed method can shared code between all syscalls since it is the the same test for zero, compute the error code, then return in every case making ReleaseSmall even better. But the current method every jump table is unique to that syscall wasting a KB per 3 or 4 unique calls used.
It allows more modular handling of errors. The current semantics can be built on top of this at a Zig interface level instead of distoring the posix level interface. The choice to cause UB, crash, or try to recover should be in the hands of the programmer. Nobody should have to rewrite the entire networking stack just so they can make small change to handle an epoll error differently.
The switch stament makes handling the error in the surrounding code more expensive. Errno values are in a fairly constrained range, but since errors are not local there can be a large difference between when error.NoMemory and error.NotSupported are first used exploding out the range so a jump table can no longer be used. The better way here keeps the same range as they are all introduced at once. In the current method, the code generated depends on non-local definitions that might even depend on import order and change your switch from a table to a tree.

The godbolt below is a fairly average scenario. The integer constants are just a random assortment of values that could come from networking syscalls.

(edit: removetthe volatile from the signatures to make things more clear and updated the godbolt link.)

LucasSantos91 · June 7, 2024, 9:52pm

The volatiles in the code you linked are hurting the generated code. Remove them and the story gets very different. In ReleaseFast:

example.current:
        mov     qword ptr [rdi], 2
        ret

example.better:
        mov     dword ptr [rdi], 2
        mov     word ptr [rdi + 4], 0
        ret

The std library’s code saves one instruction.

The options you listed are what a switch statement gets compiled down to. The code in the std library is necessarily compiling to one of these, because that’s how a switch statement works. Just let the compiler pick whichever one is fastest.

Returning an error is marked cold:

pub noinline fn returnError(st: *StackTrace) void {
    @setCold(true);
    @setRuntimeSafety(false);
    addErrRetTraceAddr(st, @returnAddress());
}

The compiler is supposed to use this information to make the success path faster.

These APIs are well documented. Informing the compiler about assumptions that are enforced by the API helps the compiler generated faster and smaller code, which was the main point of your post. If the OS isn’t following the API, we can’t blame Zig. If Zig is making the wrong assumptions, the correct approach is to fix Zig’s assumption. I haven’t looked at the Github issues about this, but I’m sure they’re not so many. It’s definitely not worth giving up performance because of some punctual bugs here and there, that can be easily fixed in a one line pull request.

That’s why you debug in Debug mode.

Zig’ std library main job is to support the Zig toolchain. Almost everything in there is used by the toolchain directly. The decisions they made are the decisions that better fit this use case. The toolchain is not a medical device or a missile, and that is the case for the vast majority of programs written in Zig. If your use case matches Zig’s, you can benefit from the work that that they already did. If it doesn’t, then just call the underlying OS functions directly, it’s super easy.

nyc · June 7, 2024, 10:13pm

The volatile was just there to prevent stuff from being constant folded away, it isn’t meant be to an implementation, that is just the scafolding. Look inside the noinline functions to see the important part.

The code in the std library is necessarily compiling to one of these, because that’s how a switch statement works. Just let the compiler pick whichever one is fastest.

That/s the point. The switch in this case doesn’t produce very good code.

These APIs are well documented.

You are just not undestanding the whole point. Zig can’t make the right assumptions because there are none that cover all very legitimate cases.

It’s definitely not worth giving up performance because of some punctual bugs here and there, that can be easily fixed in a one line pull request.

something isn’t clear here. the way zig handles it now is slow AF. It is give up performance now.

I’ll just be very blunt. The APIs suck and are trash. No matter how well you document trash, it is still trash. And zig shouldn’t be making decisions on how you should handle errors esp when they are recoverable. That I can’t prevent munmap from causing UB or a number of other thigns that aren’t really errors from crashing my program, and that I need to convince zig core to please let me handle that errore even though they have zero clue as to the surrounding context is absolutely insane. The arrogance of zig is just astounding.

you cant get every bug out of every program, so you can’t just force UB because you think it should be handled differently. This absolutism will kill the language and make it entirely inappropriate except for trivial progress trees.

Zig’ std library main job is to support the Zig toolchain

WHAT?!?!?!?!? This a very new view to me. This can’t can’t be true.

AndrewCodeDev · June 7, 2024, 10:30pm

Alright, let’s tone it back a bit here. I get that you have strong feelings about this, but I’d like to see the temperature brought down a few kelvin.

Specifically, please lighten up on the profanity a bit here.

From the top…

It has been stated a few times that what will persist in the standard library are things that support the Zig toolchain. That doesn’t mean it’s the permanent plan of the Zig core team, but it’s the current plan until 1.0. We’ve had a few posts about that so yes, @LucasSantos91 is accurate.

@nyc…

Examples, please. It may be non-trivial, but I’d like to see an example (with code) of the kind of errors you are talking about and how the API does not allow them. I can think of a few but remember that we’re not the only people reading this - it’s for the good of the community to see examples so they may judge for themselves.

gonzo · June 7, 2024, 10:46pm

I would suggest that if you really want to have a productive discussion, you should not use this kind of assertions. If you want to be heard, you have to respect others first.

LucasSantos91 · June 7, 2024, 10:46pm

volatile forces the compiler to actually issue all loads and stores. It can’t keep anything in register, not even between two consecutive instructions. It can’t fold intermediary operations. This forces the code to be compiled in a way that is worse than Debug mode. Using volatile makes the comparison to real code completely meaningless.

nyc · June 7, 2024, 10:52pm

here. instead of arguing. I moved them completely out of the signature. Here. It’s the same function body just without the the memory read at top.

Now we can stop arguing about that pointless triviality.

To show more specifically:

here is the old top of the current function:

example.current:
        mov     eax, dword ptr [rip + example.gx]
        mov     ecx, dword ptr [rip + example.gy]
        cmp     rcx, 32
        ja      .LBB3_1
        jmp     qword ptr [8*rcx + .LJTI3_0]

and here is the new:

example.current:
        cmp     edx, 32
        ja      .LBB3_1
        mov     eax, edx
        jmp     qword ptr [8*rax + .LJTI3_0]

the rest is the same. the same transformation was done to the better implementation. That code had nothing to do with the implementation. It was just there to give the compiler a workload that couldn’t be constant folded or optimized away.

LucasSantos91 · June 8, 2024, 12:57am

The function you proposed is not equivalent to the current one:

pub fn main() void {
    const ys = [_]u8{ 11, 101, 23, 104, 32, 111, 9, 113 };
    std.debug.print("Current:\n", .{});
    for (ys) |y| {
        const r = current(1, y);
        std.debug.print("{any}\n", .{r});
    }
    std.debug.print("Better:\n", .{});
    for (ys) |y| {
        const rr = better(1, y);
        std.debug.print("{any}\n", .{rr});
    }
}

Output:

Current:
error.ab
error.df
error.ja
error.jg
error.ed
error.bh
error.ca
error.ij
Better:
error.bb
thread 1 panic: invalid error code
/app/example.zig:20:50: 0x103a6bd in to_error (output.s)
    const e: Err = @errorCast(@errorFromInt(base + uu));

Godbolt.
It gives the wrong result on the first value and crashes already on the second value.

nyc · June 8, 2024, 3:58am

sry didn’t make the error table large enough (only put a 100 entries in it. This wasn’t mean to be a real implementation just enough so you can see the generated assembly, but works now, and and asm didnt change (I just decremented the err number to be in the table bounds. I pulled the errno values from errno -l so they are a real world test).

The fixes proposed are all independant (eg, adding the success fastpast, the errno from it calculation, or removing the assumption of what is valid error to handle all have independent value and can done separately or together.

Please try to take the ideas from this, not a nailed down implemention.

Current:
error.bb
error.jf
error.cd
error.jg
error.dc
error.jh
error.aj
error.ji
Better:
11 error.bb
95 error.jf
23 error.cd
96 error.jg
32 error.dc
97 error.jh
9 error.aj
98 error.ji

LucasSantos91 · June 8, 2024, 1:30pm

You convinced me. You version produces much better machine code. I think the next step should be to take a function from the std library and demonstrate that it generates suboptimal machine code, and show how your approach could improve it. With that, you could open a very compelling issue on Github. It’s been stated that the current goal of Zig development is to make the compiler faster, and since these calls happen a lot in the compiler, there could be real gains there.
It is disappointing to see that the compiler didn’t do this transformation on its own. It seems like it had all the information available to it. We can assume that all the code out there that is switching on errors is suffering the same inefficiencies. Perhaps an even better approach would be to find why the compiler is generating suboptimal code, but that would require digging through LLVM, which might be out of reach for Zig. A focused approach, manually fixing the code that is suboptimal, might be more feasible from the maintainers perspective.

I changed the code slightly and found something:

noinline fn current(x: u32, y: u32) Err!u32 {
    const u = x / 2;
    const v = y;

    const base = @intFromError(Err.aa);

    return switch(v) {
        0 => u,
        else => @errorCast(@errorFromInt(base + @as(u16, @intCast(x)))),
    };
}

This version improves the machine code significantly:

example.current:
        test    esi, esi
        je      .LBB10_3
        lea     rax, [rsp - 8]
        mov     word ptr [rsp - 8], 12
        mov     rax, qword ptr [rax]
        mov     qword ptr [rdi], rax
        ret
.LBB10_3:
        mov     word ptr [rsp - 12], 0
        mov     dword ptr [rsp - 16], 0
        lea     rax, [rsp - 16]
        mov     rax, qword ptr [rax]
        mov     qword ptr [rdi], rax
        ret

So it seems to me that the compiler should have a taken the else => unreachable as a permission to treat the errors as packed set, which would have lead to the same code that you manually wrote, but the compiler is failing to take the hint.

nyc · June 8, 2024, 2:57pm

LLVM does a pretty decent job given the information it is given (given the arms, it used to be thought of being the knapsack problem, but now I think with certain constraints, finding an efficient way to pack the arms into tables is down to n^2 (but the general case is still hard).

It just doesn’t have enough info and there isn’t a way to give priorities to arms (that would almost definitely put the optimizing it back in the NP domain), and that would definitely not be a good solution – way too complex. glibc does the check for 0 before into any errno handling code to fast path out success too. There is just no way for LLVM to know 0 is a special case.

As for the rest of the code gen, the example I used as the errno codes are in two general blocks (and this seems to be common, most syscalls will return some general errros like EINVAL that are all clustered in the lower numbers then some area specific calls that are clustered together like network calls returning ENOROUTE or ECONNREFUSED). And there might be one of two outiers.

LLVM does a good job that. It handles outliers, branches on one of the cluster bounds, and then goes in one of two jump tables depending on the cluster. Pretty good knowing what it knows.

Thanks for the code fix up - I’ll look at it later today.