Better syscall errno handling in Zig
Zig’s error handling from syscalls has a few very large correctness problem, some performance isues, and a code size issue in release small. They aren’t difficult to fix either.
(There’s a godbolt link at the bottom to play with how the switch statement generated code.)
Currently at the layer the posix layer where all the arguments are still the form of u32 flags and other Os primatives, zig returns zig errors with a switch on the syscall return value:
const rc = system.some_syscall(...);
switch(errno(rc)) {
// ...
};
There are multiple options for this to be compiled to:
-
A jump table where an offset into a table can be computed from the switched on value. An address is loaded from the table, and an indirect call is made. This can only be done when the cases are densely packed.
-
An if-else chain that just does comparison one after the other. The compiler can reorder them (and often does) since the switch statment doesn’t imply any ordering or importance of the branches. This is done for shorter number of arms.
-
A comparison tree that is like a binary search through the arm values. It picks a value to break the space in half, then jumps to the next level, and does this same. Instead of having 1-8 conditionals to evaluate, you would evaluate 3 every time.
(There are a few other potential ways like a radix tree, but I haven’t seen them in practice.)
These can be mixed together, for example if you had two compact ranges a compiler will often branch on one of the bounds to either to a look up table or a short if-else chain.
Notice how your order isn’t cosidered. It assumes every branch is equally likely. But the cost to your runtime is rarely ever symmetric for erros and success. You almost always want the success case to be as fast as possible, and you are willing to pay up on the errors.
Fix 1:
Put anif (rc == 0) return ret_val
before going into the switch.
Even in the best case it is going to be slower since it turns this into an indirect call with a complex addressing mode and that involves more CPU resources and is less likely to be predicted or have correct target prediction (both are needed). The simple if will not even involve a jump on the fast path if done properly (and the switch can be put in a function and set cold to streamline it even further and prevent L1i pollution).
But in the other cases, success will be multiple branches away as it has to evaluate to a leaf of search tree it built (you can see this in the godbolt code gen).
Fix 2:
Don’t use unreachable on syscall error
An error return is not a bug. It can be in buggy code, but sometimes it is working as expected. There have been numerous github issues (some have been open for years), that point out places where zig makes incorrect assumptions about what is a programmer error and what isn’t.
Sometimes it isn’t possible to keep track of everything to prevent an error from being returned by the syscall. A couple weeks ago a stress test for a long running process crashed. It was not easy to track down because in release more unreachable forces UB, and that is a terrible debugging experience. It cannot be logged or observed in any way, you just corrypt the process and it bails somewhere else. That shit sucks.
It was tracked down to munmap returning ENOMEM. This is a valid return, and there is even a bug report from 2 years ago on it. On mmap linux tries to extend current maps. This keeps the number of VM structs small and is probabably helpful for huge page coalescing. But if you now try to munmap a page, you can split that region in two parts and the os now has to created a second mapping for the split off region. (you map page 1, then 2, then 3, but linux keeps a single 1-3 VM mapping, then you unmap 2, so linux now needs two VM mapping 1 and 3).
Without recreating the internal VM logic (that will be os and kernel verison specific), there is no way to stop this. Zig just forces UB in production on this. There is no reason this should causes such abberant behavior.
This extends to a number of other calls and values:
-
Sometime these are application dependant (eg, EBADF on close doesn’t always mean a bug - it can just be single threaded code where keeping track of the last failure state isn’t worth it so you just reset the resources and keep on going). The cases where it is an error can have code written in a more zig-ified layer without loss (even then I think this is a terrible decision).
-
Sometimes you can’t even predict the returns from syscalls. With user created code handling them now (such as FUSE and BPF), often the exact semantics of an error value aren’t knowable).
-
Sometimes there is just no way to probe for certain features (eg, a network filesystem that doesn’t sopport copy_file or ACLs, or about FUSE mount you can think of). The best choice is try it and handle the error.
-
Zig ignores time outs in syscalls when it retries (the one exception is nanosleep, because it returns much time was left if it was a spurious wakeup or interrupted). That means if your highly efficent event loop gets interrupted, it cycles back with the same timeout, and there is nothing you can do about it. So when it does finally return any time-based events you was goign to run are probably well expired.
-
Some programs cant crash, and I can’t stres this enough. Financial trading system, control systems, medical devices, and others are often written to attempt to recover from error - even programmer error that might not have been uncovered in testing. You reset to a known good state and keep it moving. There migth be 20 other threads doing thing and you are willing to accept one task not completing, because the other 19 are critically important or the crash might cause hours of downtime losing millions of dollars or even just a minute and losing lives. So you don’t crash, you exit gracefully if possible, tying up loose ends, but Zig takes that decision away from you. We’re not all writing command line tools or web apps where the crash isn’t as costly.
-
OS bugs that might only be expressed in certain versions. You can’t expect to code around that, keeping track of different kernel version (even if zig were to) is just a waste of effort. (eg, io_uring is notorious for adding and changing behavior from version to version).
All these case’s arent just theoretical. There are bug reports in zig for some of them, and they happen in other languages too. If zig wants to have a strong system level, it should account for this. And as more people use zig, these will get more.
Zig’s job is not to pass judgement on people’s architecture, and that’s what it feels like when core decides they think your program should crash on a particular error.
Fix 3: remove the switch, and for errno returning systems, generate the error from an offset into an error type.
While you may not have exact bounds on what errors can be returned, those bounds are not always correct, and the code gen is far superior in both ReleaseFast and Small (in ReleaseSmall the difference is huge).
-
In the current method, in a realtively normal case, ReleaseFast compiled into about 55 instructions and two jump tables totaling about 45 8-byte entries (360 bytes). The propsed methos is 12 instructions with a single branch on the success fast path.
-
In ReleaseSmall, the example code is compiled into a long if-else chain of 8 conditional branches and about 40 instructions. The better method is the same 12 instruction including a branch on success.
-
The proposed method can shared code between all syscalls since it is the the same test for zero, compute the error code, then return in every case making ReleaseSmall even better. But the current method every jump table is unique to that syscall wasting a KB per 3 or 4 unique calls used.
-
It allows more modular handling of errors. The current semantics can be built on top of this at a Zig interface level instead of distoring the posix level interface. The choice to cause UB, crash, or try to recover should be in the hands of the programmer. Nobody should have to rewrite the entire networking stack just so they can make small change to handle an epoll error differently.
-
The switch stament makes handling the error in the surrounding code more expensive. Errno values are in a fairly constrained range, but since errors are not local there can be a large difference between when error.NoMemory and error.NotSupported are first used exploding out the range so a jump table can no longer be used. The better way here keeps the same range as they are all introduced at once. In the current method, the code generated depends on non-local definitions that might even depend on import order and change your switch from a table to a tree.
The godbolt below is a fairly average scenario. The integer constants are just a random assortment of values that could come from networking syscalls.
(edit: removetthe volatile from the signatures to make things more clear and updated the godbolt link.)