Zig-aio: lightweight abstraction over io_uring and coroutines

Cloudef · June 17, 2024, 7:29am

I lately got annoyed by the racy nature of traditional event loop code and decided to finally try coroutines for asynchronous IO. The result is zig-aio, it’s still not battle tested, but I’m currently refactoring one of my projects to use it, and I hope the building blocks mature from there.

It currently only supports linux (and only io_uring), but I plan on adding other backends: kqueue (bsd/mac), IO/CP (windows), and epoll (for linux fallback), which will emulate the io_uring like interface. The current IO operations exposed are only the ones that can be supported on all three platforms, but later I might add platform specific stuff, like fixed / registered buffers which allow io_uring to do buffer operations completely inside the kernel. I might specialize some stuff like multishot_accept which is generally useful, but to properly support it for all platforms I think it needs separate interface from the main one.

The coroutines api allows you to write code in traditional blocking fashion, expect it won’t actually block your whole program. It’s designed so that whenever coroutines start doing IO, the IO operations get batched together for the next scheduler tick. This means that if you have two coroutines that are running and start blocking on IO, the IO operations actually get merged into one io_uring submit. I expect the coroutines api still change a bit, and I may add some basic synchronization primitives and perhaps channels, but we’ll see.

Once it’s more mature I’ll write up docs and proper readme.

Check the examples directory for basic usage
https://github.com/Cloudef/spurdo-editor/blob/master/src/spurdo/lang/Lsp.zig also code that talks with lsp

Cloudef · June 17, 2024, 6:10pm

Added some crude docs (I’m not good at writing)
https://cloudef.github.io/zig-aio

Cloudef · June 19, 2024, 11:05am

Added ThreadPool so the coro code can be mixed with blocking code, similar to Tokio’s spawn_blocking.

https://cloudef.github.io/zig-aio/coro-blocking-code

Cloudef · June 21, 2024, 4:42am

There’s now fallback backend. It’s poll based and only tested on linux right now, I doubt it works yet elsewhere because there’s some linux specific things it uses. I also want to make it use either epoll or kqueue if available because poll requires you to iterate all the fds (granted it’s not big issue here, because the fds are short lived). I gotta say the readiness based poll/epoll/kqueue model really sucks, and io_uring really is the better way to do async io. I’ve heard Window IO/CP is similar to io_uring so hoping I can fit it into this model well.

Cloudef · June 21, 2024, 6:03am

Results of strace -c for the coro example with log prints disabled for both io_uring and fallback backends.

io_uring

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ------------------
  0.00    0.000000           0         2           close
  0.00    0.000000           0        11           mmap
  0.00    0.000000           0        11           munmap
  0.00    0.000000           0         5           rt_sigaction
  0.00    0.000000           0        84           msync
  0.00    0.000000           0         1           bind
  0.00    0.000000           0         1           listen
  0.00    0.000000           0         2           setsockopt
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         1           arch_prctl
  0.00    0.000000           0         1           gettid
  0.00    0.000000           0         2           prlimit64
  0.00    0.000000           0         2           io_uring_setup
  0.00    0.000000           0         6           io_uring_enter
  0.00    0.000000           0         1           io_uring_register
------ ----------- ----------- --------- --------- ------------------
100.00    0.000000           0       131           total

fallback

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ------------------
 70.41    0.000138           4        32           munmap
 27.55    0.000054           0       494           msync
  2.04    0.000004           4         1           close
  0.00    0.000000           0        14           read
  0.00    0.000000           0        23           poll
  0.00    0.000000           0        32           mmap
  0.00    0.000000           0        16           mprotect
  0.00    0.000000           0         5           rt_sigaction
  0.00    0.000000           0         1           bind
  0.00    0.000000           0         1           listen
  0.00    0.000000           0         2           setsockopt
  0.00    0.000000           0        16           clone
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         1           arch_prctl
  0.00    0.000000           0         1           gettid
  0.00    0.000000           0        33         5 futex
  0.00    0.000000           0         1           sched_getaffinity
  0.00    0.000000           0         1           eventfd2
  0.00    0.000000           0         2           prlimit64
------ ----------- ----------- --------- --------- ------------------
100.00    0.000196           0       677         5 total

Cloudef · June 21, 2024, 6:20am

io_uring without GeneralPurposeAllocator:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ------------------
 45.31    0.000087          14         6           io_uring_enter
 24.48    0.000047          23         2           io_uring_setup
  9.38    0.000018           4         4           mmap
  6.77    0.000013           3         4           munmap
  5.73    0.000011           5         2           close
  2.60    0.000005           5         1           bind
  2.08    0.000004           4         1           listen
  1.56    0.000003           1         2           setsockopt
  1.56    0.000003           3         1           io_uring_register
  0.52    0.000001           1         1           gettid
  0.00    0.000000           0         5           rt_sigaction
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         1           arch_prctl
  0.00    0.000000           0         2           prlimit64
------ ----------- ----------- --------- --------- ------------------
100.00    0.000192           5        33           total

dude_the_builder · June 21, 2024, 12:56pm

Are you using io_uring submission queue polling? That reduces syscalls quite a bit.

dee0xeed · June 21, 2024, 9:31pm

stupid question - is it possible to watch io_uring completion events via epoll readyness events?

dee0xeed · June 21, 2024, 10:29pm

Cloudef/zig-aio/blob/master/src/aio/IoUring.zig#L173-L177


      
          // Disabled for now, this seems to increase syscalls a bit
          // However, it may still be beneficial from latency perspective
          // I need to play this with more later, and if there's no single answer
          // then it might be better to be exposed as a tunable.
          // std.os.linux.IORING_SETUP_SINGLE_ISSUER | std.os.linux.IORING_SETUP_DEFER_TASKRUN, // 6.1

Cloudef · June 22, 2024, 3:13am

You need to use the io_uring specific api to wait for it, you can specify how many events you want to wait for at least.

It’s possible to make io_uring to post events to eventfd, you can’t poll io_uring itself because it’s not a fd.

In zig-aio, I’ll add io_uring specific operations later, right now the operations are something that’s portable across platforms.

Though currently the portable way to do this is to use aio.EventSource, and aio.NotifyEventSource operation or simply while loop Dynamic.complete with .blocking mode if you only need to wait locally.

Cloudef · June 22, 2024, 3:16am

Readiness only works ok if you do it on small scale or have static set of fds that you wait for. If you can use it io_uring simply is better in every scenario right now.

The readiness itself is not the problem, but the fact that when something is ready, you have a potential race condition where the fd will still block if you use it. Now there’s bunch of apis that let you create NONBLOCK fds, and sometimes you have to fcntl, or sometimes it’s simply not supported. Both waiting for readiness, and doing the operation also is a syscall. It’s honestly a mess, so people often cop out to thread pool for general purpose async event loop, high performance servers have specific things they need to do so these issues probably don’t affect them so much. Readiness based model also does not let you do async operations that does not require readiness, like fsync.

It’s not yet in kernel, but I’m hoping we get clone operation to io_uring soon too so we can spawn child processes using it as well.

I’m not proud of it, but I actually added fallback backend recently which emulates io_uring using the readiness model, if you are interested the code is here.

https://github.com/Cloudef/zig-aio/blob/master/src/aio/Fallback.zig

Cloudef · June 22, 2024, 7:01am

All tests pass, darwin and bsd are now supported. Albeit do note that both are supported by the fallback backend, which is not optimal for that specific system, it’s fine for common asynchronous code and such, but I would not write a web server for bsd/darwin using that.

That said, it’s possible to combine a designated event loop with zig-aio using either aio.EventSource or coro.ThreadPool.yieldForCompletition

In fact I think I might support multishot operations such as tcp socket accept loop this way, we’ll see.

Cloudef · June 23, 2024, 9:23am

Added some broken initial windows code.

The plan is to first get fallback backend to work at least with non socket operations.

Then later combine RIO (sockets), IoRing (disk IO) and the current fallback backend for anything that the former don’t support. Most likely need 3 threads which multiplex with 3 handles using WaitForMultipleObjects, ideally we may not need fallback backend at all for windows but we’ll see.

It would be nice if IoRing was more general purpose just like io_uring, the interface is pretty much identical to io_uring, but it’s only for disk IO. Still the forever File vs Socket vs Handle dilemma on windows land I guess.

Cloudef · June 25, 2024, 12:32pm

Added a new example

github.com

Cloudef/zig-aio/blob/master/examples/coro_wttr.zig

const std = @import("std");
const aio = @import("aio");
const coro = @import("coro");
const log = std.log.scoped(.coro_aio);

// Just for fun, try returning a error from one of these tasks

fn getWeather(completed: *u32, allocator: std.mem.Allocator, city: []const u8, lang: []const u8) anyerror![]const u8 {
    defer completed.* += 1;
    var url: std.BoundedArray(u8, 256) = .{};
    try url.writer().print("https://wttr.in/{s}?AF&lang={s}", .{ city, lang });
    var body = std.ArrayList(u8).init(allocator);
    var client: std.http.Client = .{ .allocator = allocator };
    defer client.deinit();
    _ = try client.fetch(.{
        .location = .{ .url = url.constSlice() },
        .response_storage = .{ .dynamic = &body },
    });
    return body.toOwnedSlice();
}

This file has been truncated. show original

Cloudef · June 28, 2024, 12:45am

The examples should now compile and work on windows. The tests don’t pass under wine, but I think I might be hitting against wine specific bugs https://github.com/ziglang/zig/issues/5988 similar to this issue.

The windows right now uses the same fallback backend as all other posix platforms that are not linux, so it’s not optimal and has some serious limitations such as 64 max pollable handle limit (WaitForMultipleObjects) and each socket operation will block on a separate “kludge” thread pool.

Oh what is this “kludge” thread pool? Since the Fallback backend has to support many platforms and some platforms have some peculiar issues such as some resources simply can’t be polled, or to be able to poll them you have to add bunch of hairy code so that you can basically mix two completely different poll models into one. The simplest way out here is to sort of not handle these corner cases, but simply whenever known operation is to known to be finicky on some platform, it will instead get performed on this kludge thread pool that has higher max thread limit fallback_max_kludge_threads: usize = 1024 as the time of writing in aio.Options. The thread pools in zig-aio only use the amount of threads that’s necessary and they’ll timeout if they are inactive for 5 seconds or more, but the main thread pool in fallback backend only maxes to the CPU core amount of threads by default.

Besides windows, this kludge thread pool is used on macos for reading /dev/tty. On MacOS polling /dev/tty has been broken since OS X tiger and it has never been fixed. You can ask for readiness of /dev/tty with pselect and select, but those apis are completely different from poll and would need dedicated thread to multiplex with the main one having own event signal and fd book keeping system. I did write this code for a bit, until I decided to nuke it all and come up with this kludge thread pool solution in the end.

Speaking of /dev/tty, I added new operation aio.ReadTty, which you should use instead of aio.Read when reading tty. The aio.Read operation on MacOS won’t work on /dev/tty anymore, but instead throws that pesky EINVAL to you. So the kludge hack is only applied when you use aio.ReadTty, the aio.ReadTty also works for windows, but what it returns is different depending on the mode you use.

The aio.ReadTty operation currently looks like this:

pub const ReadTty = struct {
    pub const TranslationState = switch (builtin.target.os.tag) {
        .windows => struct {
            /// Needed for accurate resize information
            stdout: std.fs.File,
            last_mouse_button_press: u16 = 0,

            pub fn init(stdout: std.fs.File) @This() {
                return .{ .stdout = stdout };
            }
        },
        else => struct {
            pub fn init(_: std.fs.File) @This() {
                return .{};
            }
        },
    };

    pub const Mode = union(enum) {
        /// On windows buffer will contain INPUT_RECORD structs.
        /// The length of the buffer must be able to hold at least one such struct.
        direct: void,
        /// Translate windows console input into ANSI/VT/Kitty compatible input.
        /// Pass reference of the TranslationState, for correct translation a unique reference per stdin handle must be used.
        translation: *TranslationState,
    };

    pub const Error = std.posix.PReadError || error{NoSpaceLeft} || SharedError;
    tty: std.fs.File,
    buffer: []u8,
    out_read: *usize,
    mode: Mode = .direct,
    out_id: ?*Id = null,
    out_error: ?*Error = null,
    link: Link = .unlinked,
    userdata: usize = 0,
};

It’s not implemented yet, but the plan is to allow aio.ReadTty to translate the output on windows to VT escape sequences that are normally used outside the windows land, lessening the platform specific code needed for the consumer.

Next coming up I think is proper backend for windows, now that I have something to reference against. I might have to setup windows vm to test on real system rather than wine, as I can’t rely on wine for many things it seems.

I’d also want to start adding more examples and benchmarks soon.
Recently I cleaned up the platform abstractions a bit, and put commonly used things like data structures into own mini module.

The BSD coverage should be good, but I haven’t tested BSD at all
https://github.com/Cloudef/zig-aio/commit/7df88b4982e7239413d7c5faaab46e91414760d8

I made some changes for WASI
https://github.com/Cloudef/zig-aio/commit/a91ca0320b23212661afa9f773cf533547c179cb
But I’m not sure yet what’s the WASI equivalent for some things that zig-aio requires.

Also I need to add rest of io_uring specific ops, flags and modes as I do want people to be able to use the io_uring backend and expect most of io_uring stuff to work the same. The io_uring backend is quite direct translation and does not do much honestly.
https://github.com/Cloudef/zig-aio/blob/master/src/aio/IoUring.zig

Aside from above aio module related stuff. The coro module also has gone through some mass refactoring. It’s now split into smaller pieces, the code is much smaller (!) https://github.com/Cloudef/zig-aio/tree/master/src/coro and the API has had some major changes, making the overall abstraction slightly more simpler. I want to keep the coroutine part simple and instead give reliable foundation to built on top of. I’m quite happy with the current state, and the refactors were mainly encouraged by me adding a aio backend and example into libvaxis, where I realized some hairy bits and corner cases that weren’t looking good.

Now coro tasks won’t die on their own, instead their result always have to be collected, or they have to be canceled. It’s also possible to ignore this and deinit the scheduler that spawned the tasks, and it will try to cancel all the tasks in the end. But it’s good idea still to always collect or cancel the tasks, because the scheduler has no idea what the tasks are doing and simply tries to cancel them from newest to oldest and hoping it will go through. If it won’t, then it will simply shutdown the IO, and deallocate the managed stacks for the tasks. If the tasks needed cleanup, that’s no longer possible and may thus leak memory.

Vaxis aio integration and example are in this commit:
https://github.com/rockorager/libvaxis/commit/b84f9e58a6fd71328f55f994f8775f81f9849a08

Cloudef · July 4, 2024, 2:50pm

Windows IOCP backend is now the default for Windows. It’s not yet complete, but it has feature parity of the fallback backend (for what it was for windows at least).

Other thing is that I got annoyed by non-linux platforms having lackluster timer facilities compared to io_uring’s timeout or timerfd, so there’s now TimerQueue in the minilib directory which job is to provide facilities with timerfd feature parity for all the supported platforms.

github.com

Cloudef/zig-aio/blob/master/src/minilib/TimerQueue.zig

//! Mimics linux's timerfd timers
//! Used on platforms where native timers aren't accurate enough or have other limitations
//! Requires threading support

// TODO: Bunch of stuff to complete here still, but works for what aio can do at the moment

const builtin = @import("builtin");
const std = @import("std");

const root = @import("root");
pub const options: Options = if (@hasDecl(root, "timer_queue_options")) root.timer_queue_options else .{};

pub const Options = struct {
    /// Force the use of foreign backend even if the target platform is linux
    /// Mostly useful for testing
    force_foreign_backend: bool = false,
};

pub const Closure = struct {
    pub const Callback = *const fn (context: *anyopaque, user_data: usize) void;

This file has been truncated. show original

The posix CLOCK_ constants being wildly different between POSIX platforms is kinda funny. MONOTONIC / BOOTTIME having swapped meanings, darwin MONOTONIC not actually being monotonic …

dude_the_builder · July 4, 2024, 3:23pm

There’s a saying about standards that states the good thing about them is that there ares so many to choose from. I guess you can extend that to: “even if you choose just one, you have so many interpretations to choose from.” lol

Cloudef · July 7, 2024, 1:52am

Windows backend now implements everything except aio.ChildExit. It also supports canceling IO ops (those supported by IOCP anyways). Tests don’t still pass fully, but I think most of them are wine bugs still and I haven’t got around yet to try on vm.

Since windows requires sockets to be opened in special OVERLAPPED mode, you need to open sockets either with aio.Socket operation or using aio.socket instead of std.posix.socket. Unfortunately there is no API to dup socket handle that’s in OVERLAPPED mode, like you can for regular files with ReOpenFile.
https://devblogs.microsoft.com/oldnewthing/20130812-00/?p=3533

I think what I’m going to do next, is to include tracing facilities in both aio and coro, so I can implement something like tokio’s console https://github.com/tokio-rs/console. Having these early on is a good idea I think. Of course all tracing will only be included by default in debug builds only and is thus comptime toggle.