Zio - async I/O framework based on libxev

matklad · October 27, 2025, 10:51am

It’s much faster than both Go and Rust’s Tokio.

That’s quite a big claim! I am confused enough about the whole context-switch thing to find this theoretically possible to believe, but it would really help to dissect specific benchmarks here, and also explain where, exactly, the difference is coming from, in terms of assembly code, CPU counters, and what not!

GitHub - jimblandy/context-switch: Comparison of Rust async and Linux thread context switch time. is I think a good start, though it’s pretty thin on the actual explanation of the observed effects…

lalinsky · October 27, 2025, 11:40am

I was having trouble believing it myself, because I don’t focus on performance at this point at all.

But there are a few points, the “much faster” was in the context of a single thread, and that’s a big constraint. Both Go and Tokio are designed from multi-thread workloads and there is a lot of overhead coming from that. Stackless coroutines in Rust have higher switching cost than stackful coroutines, purely by the fact that there is a lot more state management. I’m almost certain, that if I tried, I could probably beat any of them with Node in a single thread.

Multiple threads, that much much closer, but still very good to my surprise. In any case, performance is not my focus for now. The reality is that Tokio has a huge overhead in async switching, other Rust frameworks are better. And even in Rust, people are starting to do stackful coroutines.

This is a good paper to read regarding performance of stackful coroutines:

lalinsky · October 27, 2025, 11:44am

This is a simple ping-pong benchmark, all libraries will eventually resolve to running them on a single thread, because the task blocked by the channel read will continue on the same thread as the channel write.

Zig/Zio:
  Total rounds: 1000000
  Total time: 222.37 ms (0.222 s)
  Time per round: 222 ns
  Messages/sec: 8994010
  Rounds/sec: 4497005

Go
  Total rounds: 1000000
  Total time: 334.08 ms (0.334 s)
  Time per round: 334 ns
  Messages/sec: 5986608
  Rounds/sec: 2993304

Rust/Tokio
  Total rounds: 1000000
  Total time: 334.92 ms (0.335 s)
  Time per round: 335 ns
  Messages/sec: 5971564
  Rounds/sec: 2985782

As I said, it’s mind-blowing to me, but this is reality.

matklad · October 27, 2025, 2:23pm

Oh dear, what a toure-de-force it is, thanks a bunch, I’ve been looking for a paper like that for ages!

lalinsky · October 28, 2025, 10:28am

Just merged an interesting change, that’s not possible in the current std.Io design (without spawning a separate coroutine for each operation). I already had the option of using e.g. ResetEvent in select(), but that’s easy due to the fact that the event has state. Now I also added support for Channel operations to participate in select(). It allows for code like this:

          var recv1 = ch1.asyncReceive();
          var recv2 = ch2.asyncReceive();

          const result = try select(rt, .{ .ch1 = &recv1, .ch2 = &recv2 });
          switch (result) {
              .ch1 => |val| { ... },
              .ch2 => |val| { ... },
          }

This is implemented purely using wait queues (“callbacks”) on the backend side, no active waiting is happening.

lalinsky · October 31, 2025, 8:15am

Starting to work on documentation:

matklad · October 31, 2025, 11:04am

Unlike goroutines in Go or virtual threads in Java, tasks in Zio have a fixed stack size. This is a big limitation, coming from the fact that Zig is a language with manual memory management.

One thing I’ve been wondering here, is whether it’s feasible to do circumvent this using virtual memory? What you can do is to map 8MiB stack, but rely on demand paging to allocate physical memory. This is not great still, because you have to do the work to setup memory maps for 8MiB worth of pages. But is there some way to directly reserve a penultimate level of page table, so that you pay for only one entry up-front, and setup virtual address mapping lazily as well?

Linux actually does something like that for the stack of the main thread as well, such that 8MiB of stack aren’t even present in the page table for the start (Stack probe puzzle), something like that should be possible for green threads, no? (though, the answer here might start with writing your own kernel…)

kristoff · October 31, 2025, 11:08am

I remember reading about this having been considered by other implementations (iirc by Rust as well back when green threads were used) but then discarded for some reason that currently escapes me.

lalinsky · October 31, 2025, 11:27am

This is is how Windows Fibers work, for example. It’s possible, but in order to work efficiently, it needs compiler support. You need the compiler to inject a check just after function start, to check the stack and possibly allocate more memory. I’m planning to do something like that during yield, but it’s not a high priority task, because on Linux with overcommit on, the OS does it for me anyway. I can allocate fairly large stacks, more than available memory, but Linux will not reserve all of it.

lalinsky · October 31, 2025, 11:30am

Rust used to have segmented stacks, which is an idea from earlier versions of Go. Once you are low on stack memory, you allocate a new segment and use that from now on. When you return, you deallocate the segment. That has very bad performance, if this happens in a large loop that calls some function repeatedly.

kristoff · October 31, 2025, 11:48am

Regardless, I remember reading about it having been considered.

Thinking about it more, I think it was discarded because you can pull this off on 64bit systems easily, but on 32bit the address space starts getting crowded.

EDIT: for clarity, ‘it’ being the idea of using memory mapping tricks to avoid segmentation / reallocation.

matklad · October 31, 2025, 11:55am

Zig already includes stack probing: zig/lib/compiler_rt/stack_probe.zig at 62de7a2efd760bec85a41d52eca641fd61cd4a5d · ziglang/zig · GitHub

So, it should be possible to write a signal handler that sets up memory map on signal.

lalinsky · October 31, 2025, 4:14pm

I was looking into this and found something fascinating, because I already update the stack guard, base in Window TIB, I should get automatic stack growth handling on Windows if I allocate the memory correctly.

Linux would require SIGSEGV handler and having the stack probing enabled, but it might not be needed because Linux supports overcommit.

I was not spending too much time on stack allocation, just naively allocate slice using gpa, but I’ll spend some time on this, use mmap for Linux, VirtualAlloc on Window and property setup the guard page everywhere. Even without the SIGSEGV handler, it should be a pretty good cross-platform solution. Now sure about macOS, but I don’t care about that too much.

andrewrk · October 31, 2025, 9:57pm

Reminder that relying on overcommit is not portable. If you design a framework this way then you are creating an alternate ecosystem. That’s something that I’ve personally been striving very hard to avoid. Zig is meant to be a reusable language, which means that you should be able to create abstractions that can be used in any project.

lalinsky · November 1, 2025, 9:17am

I understand the sentiment, but scaling to millions of concurrent operations is going to be platform dependent no matter what. For example, even on Linux with custom virtual memory reservations and moving the guard page in signal handler, I need to inform users to configure vm.max_map_count if they want to scale. I could just as well inform users that they need to enable overcommit. The current option with naive fixed stacks is actually the most portable, it works everywhere. As I add more functionality to scale, it will be more and more dependant on what the platform provides, but always usable on all supported platforms.

lalinsky · November 1, 2025, 10:58am

Have some ugly AI-made proofs of concepts, both for Windows using the native stack growth system and Linux using SIGSEGV handler (I think this one should work on other unixes as well).

github.com/lalinsky/zio

linux_stack_growth_poc.zig

rtl-user-stack-poc

// Linux lazy stack growth PoC using SIGSEGV handler
// Demonstrates true lazy commit with PROT_NONE + mprotect
//
// Build: zig build-exe linux_stack_growth_poc.zig -target x86_64-linux

const std = @import("std");
const builtin = @import("builtin");
const posix = std.posix;
const linux = std.os.linux;

comptime {
    if (builtin.os.tag != .linux) {
        @compileError("This PoC only works on Linux");
    }
}

const STACK_RESERVE_SIZE = 8 * 1024 * 1024; // 8 MB
const INITIAL_COMMIT_SIZE = 256 * 1024;     // 256 KB
const COMMIT_CHUNK_SIZE = 64 * 1024;        // Commit 64 KB at a time on fault
const PAGE_SIZE = 4096;

This file has been truncated. show original

github.com/lalinsky/zio

windows_stack_growth_recursive_poc.zig

rtl-user-stack-poc

// Windows automatic stack growth PoC using RtlCreateUserStack
// Demonstrates kernel-managed stack expansion via guard pages
//
// Build: zig build-exe windows_stack_growth_recursive_poc.zig -target x86_64-windows

const std = @import("std");
const builtin = @import("builtin");
const windows = std.os.windows;

comptime {
    if (builtin.os.tag != .windows) {
        @compileError("This PoC only works on Windows");
    }
}

const STACK_RESERVE_SIZE = 8 * 1024 * 1024; // 8 MB
const INITIAL_COMMIT_SIZE = 256 * 1024;     // 256 KB

// INITIAL_TEB structure
const INITIAL_TEB = extern struct {

This file has been truncated. show original

Looking forward to integrating this properly. Would allow me to start with e.g. 64KiB initial committed memory and grow up to let’s say 8MB.

mperillo · November 1, 2025, 11:39am

From Behind The Scenes of Bun Install | Bun Blog
macOs has getaddrinfo_async_start().

lalinsky · November 2, 2025, 5:15am

Another important feature done, generic timeout support. Unlike with stackless coroutines, we can’t just do select on everything. So I decided to include API like this:

var timeout: Timeout = .init;
defer timeout.clear(rt);

timeout.set(rt, 100 * std.time.ns_per_ms);

reader.interface.takeDelimiterInclusive('\n') catch |err| switch (err) {
    error.EndOfStream => break,
    error.Canceled => return rt.checkTimeout(err, &timeout), // this translates error.Canceled to error.Timeout
    else => return err,
};

It’s essentially automatic self-cancelation, that includes some tracking in the background, so that you can recover from it (unlike user cancelation, which is unrecoverable)

mperillo · November 2, 2025, 11:19am

Is there any change to implement these optimizations in Zig?

lalinsky · November 2, 2025, 11:25am

They already are implemented in zio.