Zig-aio: lightweight abstraction over io_uring and coroutines

I lately got annoyed by the racy nature of traditional event loop code and decided to finally try coroutines for asynchronous IO. The result is zig-aio, it’s still not battle tested, but I’m currently refactoring one of my projects to use it, and I hope the building blocks mature from there.

It currently only supports linux (and only io_uring), but I plan on adding other backends: kqueue (bsd/mac), IO/CP (windows), and epoll (for linux fallback), which will emulate the io_uring like interface. The current IO operations exposed are only the ones that can be supported on all three platforms, but later I might add platform specific stuff, like fixed / registered buffers which allow io_uring to do buffer operations completely inside the kernel. I might specialize some stuff like multishot_accept which is generally useful, but to properly support it for all platforms I think it needs separate interface from the main one.

The coroutines api allows you to write code in traditional blocking fashion, expect it won’t actually block your whole program. It’s designed so that whenever coroutines start doing IO, the IO operations get batched together for the next scheduler tick. This means that if you have two coroutines that are running and start blocking on IO, the IO operations actually get merged into one io_uring submit. I expect the coroutines api still change a bit, and I may add some basic synchronization primitives and perhaps channels, but we’ll see.

Once it’s more mature I’ll write up docs and proper readme.

15 Likes

Added some crude docs (I’m not good at writing)
https://cloudef.github.io/zig-aio

3 Likes

Added ThreadPool so the coro code can be mixed with blocking code, similar to Tokio’s spawn_blocking.

https://cloudef.github.io/zig-aio/coro-blocking-code

2 Likes

There’s now fallback backend. It’s poll based and only tested on linux right now, I doubt it works yet elsewhere because there’s some linux specific things it uses. I also want to make it use either epoll or kqueue if available because poll requires you to iterate all the fds (granted it’s not big issue here, because the fds are short lived). I gotta say the readiness based poll/epoll/kqueue model really sucks, and io_uring really is the better way to do async io. I’ve heard Window IO/CP is similar to io_uring so hoping I can fit it into this model well.

Results of strace -c for the coro example with log prints disabled for both io_uring and fallback backends.

io_uring

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ------------------
  0.00    0.000000           0         2           close
  0.00    0.000000           0        11           mmap
  0.00    0.000000           0        11           munmap
  0.00    0.000000           0         5           rt_sigaction
  0.00    0.000000           0        84           msync
  0.00    0.000000           0         1           bind
  0.00    0.000000           0         1           listen
  0.00    0.000000           0         2           setsockopt
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         1           arch_prctl
  0.00    0.000000           0         1           gettid
  0.00    0.000000           0         2           prlimit64
  0.00    0.000000           0         2           io_uring_setup
  0.00    0.000000           0         6           io_uring_enter
  0.00    0.000000           0         1           io_uring_register
------ ----------- ----------- --------- --------- ------------------
100.00    0.000000           0       131           total

fallback

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ------------------
 70.41    0.000138           4        32           munmap
 27.55    0.000054           0       494           msync
  2.04    0.000004           4         1           close
  0.00    0.000000           0        14           read
  0.00    0.000000           0        23           poll
  0.00    0.000000           0        32           mmap
  0.00    0.000000           0        16           mprotect
  0.00    0.000000           0         5           rt_sigaction
  0.00    0.000000           0         1           bind
  0.00    0.000000           0         1           listen
  0.00    0.000000           0         2           setsockopt
  0.00    0.000000           0        16           clone
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         1           arch_prctl
  0.00    0.000000           0         1           gettid
  0.00    0.000000           0        33         5 futex
  0.00    0.000000           0         1           sched_getaffinity
  0.00    0.000000           0         1           eventfd2
  0.00    0.000000           0         2           prlimit64
------ ----------- ----------- --------- --------- ------------------
100.00    0.000196           0       677         5 total
2 Likes

io_uring without GeneralPurposeAllocator:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ------------------
 45.31    0.000087          14         6           io_uring_enter
 24.48    0.000047          23         2           io_uring_setup
  9.38    0.000018           4         4           mmap
  6.77    0.000013           3         4           munmap
  5.73    0.000011           5         2           close
  2.60    0.000005           5         1           bind
  2.08    0.000004           4         1           listen
  1.56    0.000003           1         2           setsockopt
  1.56    0.000003           3         1           io_uring_register
  0.52    0.000001           1         1           gettid
  0.00    0.000000           0         5           rt_sigaction
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         1           arch_prctl
  0.00    0.000000           0         2           prlimit64
------ ----------- ----------- --------- --------- ------------------
100.00    0.000192           5        33           total
1 Like

Are you using io_uring submission queue polling? That reduces syscalls quite a bit.

stupid question - is it possible to watch io_uring completion events via epoll readyness events?

related topic

some OS kernels were designed to force “application programs”
to use “operation complete notifications”
(kinda i have done that for you already, look into your blue mail box)

This works well for data storage devices.
They are fast and are becoming faster and faster, nvme and who knows what will come next.

… but some OS kernels (and communication protocols)
were designed with another purpose in mind -
they implies that an “application program” wait until an OS wake up a proccess
and it will copy data ether to or from user/kernel address spase

this works quite well for… ok, for “internet”, for communications between “chips”.

so

  • “async” i/o works basically good for local data storage devices
  • readiness notifications works better over “internet”

“what is wrong? what is right? I don’t know” :slight_smile:

fuck me blue! // pardon my french :slight_smile:
did not know about -c :frowning: rtfm, as it is

i’ve just strace-d some of my “24x7” services (i mean systemd services) with -c and made many-many “discoveries”

I’m not using that. While it reduces syscalls, it also requires the application to be designed around it. If they are not designed around it, it simply causes thread in kernel to spin a lot for no reason. I may make it a comptime tunable at some point but right now that’s not a goal and I think it’s not possible to replicate on the fallback backend.

The io_uring instance does use the flags SINGLE_ISSUER | COOP_TASKRUN though, I’m not using DEFER_TASKRUN, because I noticed it actually increases the amount of syscalls, but I need to investigate if it’s still good idea from latency perspective.

You need to use the io_uring specific api to wait for it, you can specify how many events you want to wait for at least.

It’s possible to make io_uring to post events to eventfd, you can’t poll io_uring itself because it’s not a fd.

In zig-aio, I’ll add io_uring specific operations later, right now the operations are something that’s portable across platforms.

Though currently the portable way to do this is to use aio.EventSource, and aio.NotifyEventSource operation or simply while loop Dynamic.complete with .blocking mode if you only need to wait locally.

Readiness only works ok if you do it on small scale or have static set of fds that you wait for. If you can use it io_uring simply is better in every scenario right now.

The readiness itself is not the problem, but the fact that when something is ready, you have a potential race condition where the fd will still block if you use it. Now there’s bunch of apis that let you create NONBLOCK fds, and sometimes you have to fcntl, or sometimes it’s simply not supported. Both waiting for readiness, and doing the operation also is a syscall. It’s honestly a mess, so people often cop out to thread pool for general purpose async event loop, high performance servers have specific things they need to do so these issues probably don’t affect them so much. Readiness based model also does not let you do async operations that does not require readiness, like fsync.

It’s not yet in kernel, but I’m hoping we get clone operation to io_uring soon too so we can spawn child processes using it as well.

I’m not proud of it, but I actually added fallback backend recently which emulates io_uring using the readiness model, if you are interested the code is here.

https://github.com/Cloudef/zig-aio/blob/master/src/aio/Fallback.zig

All tests pass, darwin and bsd are now supported. Albeit do note that both are supported by the fallback backend, which is not optimal for that specific system, it’s fine for common asynchronous code and such, but I would not write a web server for bsd/darwin using that.

That said, it’s possible to combine a designated event loop with zig-aio using either aio.EventSource or coro.ThreadPool.yieldForCompletition :wink:

In fact I think I might support multishot operations such as tcp socket accept loop this way, we’ll see.

5 Likes

Added some broken initial windows code.

The plan is to first get fallback backend to work at least with non socket operations.

Then later combine RIO (sockets), IoRing (disk IO) and the current fallback backend for anything that the former don’t support. Most likely need 3 threads which multiplex with 3 handles using WaitForMultipleObjects, ideally we may not need fallback backend at all for windows but we’ll see.

It would be nice if IoRing was more general purpose just like io_uring, the interface is pretty much identical to io_uring, but it’s only for disk IO. Still the forever File vs Socket vs Handle dilemma on windows land I guess.

Added a new example

1 Like

The examples should now compile and work on windows. The tests don’t pass under wine, but I think I might be hitting against wine specific bugs https://github.com/ziglang/zig/issues/5988 similar to this issue.

The windows right now uses the same fallback backend as all other posix platforms that are not linux, so it’s not optimal and has some serious limitations such as 64 max pollable handle limit (WaitForMultipleObjects) and each socket operation will block on a separate “kludge” thread pool.

Oh what is this “kludge” thread pool? Since the Fallback backend has to support many platforms and some platforms have some peculiar issues such as some resources simply can’t be polled, or to be able to poll them you have to add bunch of hairy code so that you can basically mix two completely different poll models into one. The simplest way out here is to sort of not handle these corner cases, but simply whenever known operation is to known to be finicky on some platform, it will instead get performed on this kludge thread pool that has higher max thread limit fallback_max_kludge_threads: usize = 1024 as the time of writing in aio.Options. The thread pools in zig-aio only use the amount of threads that’s necessary and they’ll timeout if they are inactive for 5 seconds or more, but the main thread pool in fallback backend only maxes to the CPU core amount of threads by default.

Besides windows, this kludge thread pool is used on macos for reading /dev/tty. On MacOS polling /dev/tty has been broken since OS X tiger and it has never been fixed. You can ask for readiness of /dev/tty with pselect and select, but those apis are completely different from poll and would need dedicated thread to multiplex with the main one having own event signal and fd book keeping system. I did write this code for a bit, until I decided to nuke it all and come up with this kludge thread pool solution in the end.

Speaking of /dev/tty, I added new operation aio.ReadTty, which you should use instead of aio.Read when reading tty. The aio.Read operation on MacOS won’t work on /dev/tty anymore, but instead throws that pesky EINVAL to you. So the kludge hack is only applied when you use aio.ReadTty, the aio.ReadTty also works for windows, but what it returns is different depending on the mode you use.

The aio.ReadTty operation currently looks like this:

pub const ReadTty = struct {
    pub const TranslationState = switch (builtin.target.os.tag) {
        .windows => struct {
            /// Needed for accurate resize information
            stdout: std.fs.File,
            last_mouse_button_press: u16 = 0,

            pub fn init(stdout: std.fs.File) @This() {
                return .{ .stdout = stdout };
            }
        },
        else => struct {
            pub fn init(_: std.fs.File) @This() {
                return .{};
            }
        },
    };

    pub const Mode = union(enum) {
        /// On windows buffer will contain INPUT_RECORD structs.
        /// The length of the buffer must be able to hold at least one such struct.
        direct: void,
        /// Translate windows console input into ANSI/VT/Kitty compatible input.
        /// Pass reference of the TranslationState, for correct translation a unique reference per stdin handle must be used.
        translation: *TranslationState,
    };

    pub const Error = std.posix.PReadError || error{NoSpaceLeft} || SharedError;
    tty: std.fs.File,
    buffer: []u8,
    out_read: *usize,
    mode: Mode = .direct,
    out_id: ?*Id = null,
    out_error: ?*Error = null,
    link: Link = .unlinked,
    userdata: usize = 0,
};

It’s not implemented yet, but the plan is to allow aio.ReadTty to translate the output on windows to VT escape sequences that are normally used outside the windows land, lessening the platform specific code needed for the consumer.

Next coming up I think is proper backend for windows, now that I have something to reference against. I might have to setup windows vm to test on real system rather than wine, as I can’t rely on wine for many things it seems.

I’d also want to start adding more examples and benchmarks soon.
Recently I cleaned up the platform abstractions a bit, and put commonly used things like data structures into own mini module.

The BSD coverage should be good, but I haven’t tested BSD at all
https://github.com/Cloudef/zig-aio/commit/7df88b4982e7239413d7c5faaab46e91414760d8

I made some changes for WASI
https://github.com/Cloudef/zig-aio/commit/a91ca0320b23212661afa9f773cf533547c179cb
But I’m not sure yet what’s the WASI equivalent for some things that zig-aio requires.

Also I need to add rest of io_uring specific ops, flags and modes as I do want people to be able to use the io_uring backend and expect most of io_uring stuff to work the same. The io_uring backend is quite direct translation and does not do much honestly.
https://github.com/Cloudef/zig-aio/blob/master/src/aio/IoUring.zig

Aside from above aio module related stuff. The coro module also has gone through some mass refactoring. It’s now split into smaller pieces, the code is much smaller (!) https://github.com/Cloudef/zig-aio/tree/master/src/coro and the API has had some major changes, making the overall abstraction slightly more simpler. I want to keep the coroutine part simple and instead give reliable foundation to built on top of. I’m quite happy with the current state, and the refactors were mainly encouraged by me adding a aio backend and example into libvaxis, where I realized some hairy bits and corner cases that weren’t looking good.

Now coro tasks won’t die on their own, instead their result always have to be collected, or they have to be canceled. It’s also possible to ignore this and deinit the scheduler that spawned the tasks, and it will try to cancel all the tasks in the end. But it’s good idea still to always collect or cancel the tasks, because the scheduler has no idea what the tasks are doing and simply tries to cancel them from newest to oldest and hoping it will go through. If it won’t, then it will simply shutdown the IO, and deallocate the managed stacks for the tasks. If the tasks needed cleanup, that’s no longer possible and may thus leak memory.

Vaxis aio integration and example are in this commit:
https://github.com/rockorager/libvaxis/commit/b84f9e58a6fd71328f55f994f8775f81f9849a08

3 Likes

Windows IOCP backend is now the default for Windows. It’s not yet complete, but it has feature parity of the fallback backend (for what it was for windows at least).

Other thing is that I got annoyed by non-linux platforms having lackluster timer facilities compared to io_uring’s timeout or timerfd, so there’s now TimerQueue in the minilib directory which job is to provide facilities with timerfd feature parity for all the supported platforms.

The posix CLOCK_ constants being wildly different between POSIX platforms is kinda funny. MONOTONIC / BOOTTIME having swapped meanings, darwin MONOTONIC not actually being monotonic … :sweat_smile:

3 Likes

There’s a saying about standards that states the good thing about them is that there ares so many to choose from. I guess you can extend that to: “even if you choose just one, you have so many interpretations to choose from.” lol

2 Likes

Windows backend now implements everything except aio.ChildExit. It also supports canceling IO ops (those supported by IOCP anyways). Tests don’t still pass fully, but I think most of them are wine bugs still and I haven’t got around yet to try on vm.

Since windows requires sockets to be opened in special OVERLAPPED mode, you need to open sockets either with aio.Socket operation or using aio.socket instead of std.posix.socket. Unfortunately there is no API to dup socket handle that’s in OVERLAPPED mode, like you can for regular files with ReOpenFile.
https://devblogs.microsoft.com/oldnewthing/20130812-00/?p=3533

I think what I’m going to do next, is to include tracing facilities in both aio and coro, so I can implement something like tokio’s console https://github.com/tokio-rs/console. Having these early on is a good idea I think. Of course all tracing will only be included by default in debug builds only and is thus comptime toggle.

1 Like