Understanding std.Io - Proposal for a high-performance implementation

mutech · April 10, 2026, 3:31pm

My Context:

I’m working on a messaging platform that’s focused on low latency, high throughput and scalability (which one is not…).

The platform receives messages and forwards them (mostly unchanged) as a result of routing based on a subject tree. Very similar to what NATS does.

To get optimal performance, several serial performance killers need to be avoided/mitigated:

system calls (batch/avoid/gather/scatter in dedicated threads)
memory copies (reuse input buffers, gather IO)
memory allocations (reuse buffers, allocate before need, keep dedicated IO buffers w/ short life time)
cache failures: minimize the data on which a core/thread operates, sticky threads

Our current strategy is to io-uring, have a dedicated IO thread receiving IO jobs via MSG_RING, and deliver results back to caller. If IO thread becomes a bottleneck, it can be scaled up. Non-IO related syscalls can be dispatched to IO or dedicated syscaller-threads, handled like IO.

At this stage, all the remaining tasks are pure compute, no IO, no syscalls, just syscall-free message passing. These compute tasks can now be optimized for caching. E.g. N worker threads receive and process messages for sharded clients, chatty clients get high prio-threads with smaller shards, etc.

std.Io:
I believe most of this should already work with std.Io, with the exception of a dedicated IO thread and a lack of way to express where compute actions should run.

The current std.Io implementation I saw also behave differently from how I expected: The uring implementation creates fibers for everything. If I run any IO (blocking or async), a fiber gets created. That’s probably owed to the fact that the code is new and not yet optimized. I also saw no infrastructure to control where and how compute should run, as fiber in the same thread or in a separate one.

My understanding of blocking/async/concurrent:
- std.Io., when called without async should (or can) run blocking.

std.Io.async(action…) should dispatch the action. To receive a result later, the current workflow needs to be or become a fiber. Currently it seems like every action dispatch creates a new fiber, which is too expensive (if true). Worse, the fiber created is also dispatched to a thread pool, even if the results are collected in a group and only a single code path awaits the results.

This implementation is (in my understanding) correct, but not optimal. In a test use case, I use this to batch IO operations in order to conserve syscalls. The uring implementation using async and grouping to batch syscalls is slower than the default blocking syscalls for small number of calls (in the hundreds). The test case is stat in a small dir (1 file), /tmp with 1K files and /nix/store with 160K files. The price of indiscriminately creating fibers is quite high:

dir entries single std-batched custom-batched
1 file 1 1x 1.6x 1.1x
/tmp 1019 1x 0.9x 0.2x
/nix/store 155K 1 0.34x 0.17x

Custom directly batches stat calls in a custom io-uring. This is about 5x faster than single syscalls and twice as fast as std.Io at scale in all but the single file case. std.Io gives away one of it’s huge conceptual advantages here. It should not be significantly slower the custom solution.

In this test I disabled threading, w/ threading enabled, results are even worse for uring (almost never better than single syscall).

Questions:

Is my understanding of blocking/async/concurrent as defined by std.Io correct?
Do you believe that my concept using dedicated IO and/or syscall threads is promissing? Assumption is that IO tasks rarely if ever saturate a core and latency of a schedule loop for IO should be minimal (harvest requests, dispatch results, syscall uring, rince-repeat).
I expect a lot of gains from sticky threads syscall-free compute, is this reasonable, do you have experience with this? If so, should std.Io not have a mechanism to declare “where” to await results or dispatch compute tasks? I didn’t see a way to integrate that into std.Io.

vulpesx · April 10, 2026, 4:00pm

First, std.Io, is an interface. But you are referring to a single implementation when you mention it, that being std.Io.Uring

You could, if you wanted, make your own implementation that functions the way you want.

Unless you already did that in your custom-batched benchmark.

If you can show Uring is slower in most use cases, and it meets their other criteria, they may switch it to your implementation.

the API is blocking in behaviour, that does not mean the implementation is using blocking syscalls.
io.async it means “this can, but is not required to, run concurrently” in one word “asynchronously”.
io.concurrent means “this must run concurrently, error if that’s not possible”.

It does not in any way have any specifics on how the implementation accomplishes that.

Does this matter? It’s up to the programmer to choose the best implementation for their use case. And thanks to the interface, only a couple of lines need to be changed to experiment.

yes, you have shown that for your benchmarked case at the very least, if not your intended use case.

If the benchmarks are any indication? Yes! Whether they are actually a good indicator is a different question. I can’t say, I am not experienced with in this area.

That should probably be the purview of the Io implementation, but they are open to expanding the API, in fact they need to, many things are not currently supported.

matklad · April 10, 2026, 4:17pm

Would be great to publish that benchmark!

There is something in std.IO for doing more than one syscall per fiber, though I don’t know how advanced is that: https://codeberg.org/ziglang/zig/pulls/30743

andrewrk · April 10, 2026, 5:04pm

Eventually most of the file system and networking functionality are expected to migrate to become based on Operation, making them eligible to be used with Batch.

You can think of Batch as a low level concurrency mechanism which provides concurrency at an Operation layer, which is efficient and portable, but more difficult to abstract around, particularly if you need to run some logic in between operations.

Meanwhile Future (async/await/concurrent/cancel) is the equivalent but at a function abstraction layer, which is very flexible and ergonomic, but it allocates task memory and error.ConcurrencyUnavailable (when using concurrent), or unwanted blocking operations (when using async), can occur in more circumstances than the lower level Batch APIs.

So, generally, if you’re trying to write optimal, reusable software, Batch is the way to go if possible, otherwise, you can always use the Future APIs if that turns out to be a pain in the butt. Or you can start with Future APIs and then optimize by reworking some stuff to use Batch later.

mutech · April 10, 2026, 6:18pm

I will make a repo with it and post a link later. I want to check if I can monkey patch async w/o creating fibers into current uring to see if this is actually what makes the difference between the custom implementation and std.Io.Uring. The code there is scary. Makes me feel stupid
From what I see in strace, there seems to be one fiber per scheduled IO task, instead of one per group-await. Though that seems unplausible, that would be 60M * 155K. That shouldn’t be faster than single syscall, given all the context switching. I’ll come back when I understand that better.

mutech · April 10, 2026, 6:27pm

andrewrk:

Eventually most of the file system and networking functionality are expected to migrate to become based on Operation, making them eligible to be used with Batch.

You can think of Batch as a low level concurrency mechanism which provides concurrency at an Operation layer, which is efficient and portable, but more difficult to abstract around, particularly if you need to run some logic in between operations.

Meanwhile Future (async/await/concurrent/cancel) is the equivalent but at a function abstraction layer, which is very flexible and ergonomic, but it allocates task memory and error.ConcurrencyUnavailable (when using concurrent), or unwanted blocking operations (when using async), can occur in more circumstances than the lower level Batch APIs.

So, generally, if you’re trying to write optimal, reusable software, Batch is the way to go if possible, otherwise, you can always use the Future APIs if that turns out to be a pain in the butt. Or you can start with Future APIs and then optimize by reworking some stuff to use Batch later.

I was looking at batch first, but it’s not yet usable. But I actually think that async + io-uring should be good enough from the performance perspective. It’s just that io-uring seems to be implementing async as concurrent and that’s what seems to be killing the performance in my test. The overhead of creating a main-fiber or the cancel machinery itself is probably not the problem. There are just too many unnecessary syscalls in this particular setting.

I actually think that your async/concurrent abstraction is the best concept for concurrency and IO I saw in 30 years. I’m trying to patch up the uring for this test case to see if it can match up with the handrolled implementation, pretty confident it should. That would be impressive for an interface that is so intuitive to use… (Thank you for your cool work here, it’s really fun to play with Zig!)

andrewrk · April 10, 2026, 7:46pm

Can you elaborate?

mutech · April 10, 2026, 7:53pm

Doesn’t have the stat ops I was using for the test. Only file read/write streaming, net receive and dev-ioctl. But I might just have overlooked something, just started experimenting with std.Io.

andrewrk · April 10, 2026, 8:26pm

Right, OK, that’s exactly what I meant by this:

mutech · April 10, 2026, 9:12pm

I’m experiencing a couple of waterloo moments here.

I tested the single syscall per file version initially with std.Io.Uring, which turned out to be deceiving, because it uses io-uring and not a direct syscall (when called without async/concurrent).

After fixing that, my handrolled Uring was actually slower than single syscall. Reason for that seems to be the overhead of the kernel dispatching some ops in a kernel thread, which is slower that the actual syscall overhead ( Poor performance on file operations · Issue #830 · axboe/liburing · GitHub ). So my whole batching statx experiment is doomed to be useless.

That still leaves two items. First std.Io.Uring should probably not indiscriminately dispatch blocking calls to io-uring (unless of course there are other reasons to do it than performance). The question why the uring implementation via std.Io.Uring is much slower than handrolled (probably the use of fibers per call and more so dispatching them to thread pools if no concurrency was requested).

Here’s the current version of the test: main.zig · GitHub

dee0xeed · April 24, 2026, 9:21pm

do you mean that “uring implementation“ in the Linux kernel “creates fibers for everything“?

what are “fibers“ in the context? some coroutines implementations inside Linux kernel alike `async/await`?

inside an OS kernel? can not believe, it’ just matter of assembler tricks in C’s long_jump and alike

– да что ж такое такое-то с разметкой-то тут творится?

meant this

npc1054657282 · April 24, 2026, 9:39pm

Click here for a markdown editor

dee0xeed · April 24, 2026, 9:41pm

very many thanx

vulpesx · April 25, 2026, 3:35am

Fibres/green threads/M:N threading… is where the user land program implements its own threads on top of OS threads.

“uring implementation” here is referring the std.Io.Uring implementation of std.Io that takes advantage of the iouring API of the Linux kernel.

That Io implementation uses fibres for concurrency instead of raw OS threads. This gives it more control over what code is run which lets it take better advantage of the nonblocking I/O.