Understanding std.Io - Proposal for a high-performance implementation

My Context:

I’m working on a messaging platform that’s focused on low latency, high throughput and scalability (which one is not…).

The platform receives messages and forwards them (mostly unchanged) as a result of routing based on a subject tree. Very similar to what NATS does.

To get optimal performance, several serial performance killers need to be avoided/mitigated:

  • system calls (batch/avoid/gather/scatter in dedicated threads)
  • memory copies (reuse input buffers, gather IO)
  • memory allocations (reuse buffers, allocate before need, keep dedicated IO buffers w/ short life time)
  • cache failures: minimize the data on which a core/thread operates, sticky threads

Our current strategy is to io-uring, have a dedicated IO thread receiving IO jobs via MSG_RING, and deliver results back to caller. If IO thread becomes a bottleneck, it can be scaled up. Non-IO related syscalls can be dispatched to IO or dedicated syscaller-threads, handled like IO.

At this stage, all the remaining tasks are pure compute, no IO, no syscalls, just syscall-free message passing. These compute tasks can now be optimized for caching. E.g. N worker threads receive and process messages for sharded clients, chatty clients get high prio-threads with smaller shards, etc.

std.Io:
I believe most of this should already work with std.Io, with the exception of a dedicated IO thread and a lack of way to express where compute actions should run.

The current std.Io implementation I saw also behave differently from how I expected: The uring implementation creates fibers for everything. If I run any IO (blocking or async), a fiber gets created. That’s probably owed to the fact that the code is new and not yet optimized. I also saw no infrastructure to control where and how compute should run, as fiber in the same thread or in a separate one.

My understanding of blocking/async/concurrent:
-
std.Io., when called without async should (or can) run blocking.

  • std.Io.async(action…) should dispatch the action. To receive a result later, the current workflow needs to be or become a fiber. Currently it seems like every action dispatch creates a new fiber, which is too expensive (if true). Worse, the fiber created is also dispatched to a thread pool, even if the results are collected in a group and only a single code path awaits the results.

This implementation is (in my understanding) correct, but not optimal. In a test use case, I use this to batch IO operations in order to conserve syscalls. The uring implementation using async and grouping to batch syscalls is slower than the default blocking syscalls for small number of calls (in the hundreds). The test case is stat in a small dir (1 file), /tmp with 1K files and /nix/store with 160K files. The price of indiscriminately creating fibers is quite high:

dir entries single std-batched custom-batched
1 file 1 1x 1.6x 1.1x
/tmp 1019 1x 0.9x 0.2x
/nix/store 155K 1 0.34x 0.17x

Custom directly batches stat calls in a custom io-uring. This is about 5x faster than single syscalls and twice as fast as std.Io at scale in all but the single file case. std.Io gives away one of it’s huge conceptual advantages here. It should not be significantly slower the custom solution.

In this test I disabled threading, w/ threading enabled, results are even worse for uring (almost never better than single syscall).

Questions:

  • Is my understanding of blocking/async/concurrent as defined by std.Io correct?
  • Do you believe that my concept using dedicated IO and/or syscall threads is promissing? Assumption is that IO tasks rarely if ever saturate a core and latency of a schedule loop for IO should be minimal (harvest requests, dispatch results, syscall uring, rince-repeat).
  • I expect a lot of gains from sticky threads syscall-free compute, is this reasonable, do you have experience with this? If so, should std.Io not have a mechanism to declare “where” to await results or dispatch compute tasks? I didn’t see a way to integrate that into std.Io.

First, std.Io, is an interface. But you are referring to a single implementation when you mention it, that being std.Io.Uring

You could, if you wanted, make your own implementation that functions the way you want.

Unless you already did that in your custom-batched benchmark.

If you can show Uring is slower in most use cases, and it meets their other criteria, they may switch it to your implementation.

the API is blocking in behaviour, that does not mean the implementation is using blocking syscalls.
io.async it means “this can, but is not required to, run concurrently” in one word “asynchronously”.
io.concurrent means “this must run concurrently, error if that’s not possible”.

It does not in any way have any specifics on how the implementation accomplishes that.

Does this matter? It’s up to the programmer to choose the best implementation for their use case. And thanks to the interface, only a couple of lines need to be changed to experiment.

yes, you have shown that for your benchmarked case at the very least, if not your intended use case.

If the benchmarks are any indication? Yes! Whether they are actually a good indicator is a different question. I can’t say, I am not experienced with in this area.

That should probably be the purview of the Io implementation, but they are open to expanding the API, in fact they need to, many things are not currently supported.

Would be great to publish that benchmark!

There is something in std.IO for doing more than one syscall per fiber, though I don’t know how advanced is that: https://codeberg.org/ziglang/zig/pulls/30743

Eventually most of the file system and networking functionality are expected to migrate to become based on Operation, making them eligible to be used with Batch.

You can think of Batch as a low level concurrency mechanism which provides concurrency at an Operation layer, which is efficient and portable, but more difficult to abstract around, particularly if you need to run some logic in between operations.

Meanwhile Future (async/await/concurrent/cancel) is the equivalent but at a function abstraction layer, which is very flexible and ergonomic, but it allocates task memory and error.ConcurrencyUnavailable (when using concurrent), or unwanted blocking operations (when using async), can occur in more circumstances than the lower level Batch APIs.

So, generally, if you’re trying to write optimal, reusable software, Batch is the way to go if possible, otherwise, you can always use the Future APIs if that turns out to be a pain in the butt. Or you can start with Future APIs and then optimize by reworking some stuff to use Batch later.

4 Likes

I will make a repo with it and post a link later. I want to check if I can monkey patch async w/o creating fibers into current uring to see if this is actually what makes the difference between the custom implementation and std.Io.Uring. The code there is scary. Makes me feel stupid :wink:
From what I see in strace, there seems to be one fiber per scheduled IO task, instead of one per group-await. Though that seems unplausible, that would be 60M * 155K. That shouldn’t be faster than single syscall, given all the context switching. I’ll come back when I understand that better.