My Context:
I’m working on a messaging platform that’s focused on low latency, high throughput and scalability (which one is not…).
The platform receives messages and forwards them (mostly unchanged) as a result of routing based on a subject tree. Very similar to what NATS does.
To get optimal performance, several serial performance killers need to be avoided/mitigated:
- system calls (batch/avoid/gather/scatter in dedicated threads)
- memory copies (reuse input buffers, gather IO)
- memory allocations (reuse buffers, allocate before need, keep dedicated IO buffers w/ short life time)
- cache failures: minimize the data on which a core/thread operates, sticky threads
Our current strategy is to io-uring, have a dedicated IO thread receiving IO jobs via MSG_RING, and deliver results back to caller. If IO thread becomes a bottleneck, it can be scaled up. Non-IO related syscalls can be dispatched to IO or dedicated syscaller-threads, handled like IO.
At this stage, all the remaining tasks are pure compute, no IO, no syscalls, just syscall-free message passing. These compute tasks can now be optimized for caching. E.g. N worker threads receive and process messages for sharded clients, chatty clients get high prio-threads with smaller shards, etc.
std.Io:
I believe most of this should already work with std.Io, with the exception of a dedicated IO thread and a lack of way to express where compute actions should run.
The current std.Io implementation I saw also behave differently from how I expected: The uring implementation creates fibers for everything. If I run any IO (blocking or async), a fiber gets created. That’s probably owed to the fact that the code is new and not yet optimized. I also saw no infrastructure to control where and how compute should run, as fiber in the same thread or in a separate one.
My understanding of blocking/async/concurrent:
- std.Io., when called without async should (or can) run blocking.
- std.Io.async(action…) should dispatch the action. To receive a result later, the current workflow needs to be or become a fiber. Currently it seems like every action dispatch creates a new fiber, which is too expensive (if true). Worse, the fiber created is also dispatched to a thread pool, even if the results are collected in a group and only a single code path awaits the results.
This implementation is (in my understanding) correct, but not optimal. In a test use case, I use this to batch IO operations in order to conserve syscalls. The uring implementation using async and grouping to batch syscalls is slower than the default blocking syscalls for small number of calls (in the hundreds). The test case is stat in a small dir (1 file), /tmp with 1K files and /nix/store with 160K files. The price of indiscriminately creating fibers is quite high:
dir entries single std-batched custom-batched
1 file 1 1x 1.6x 1.1x
/tmp 1019 1x 0.9x 0.2x
/nix/store 155K 1 0.34x 0.17x
Custom directly batches stat calls in a custom io-uring. This is about 5x faster than single syscalls and twice as fast as std.Io at scale in all but the single file case. std.Io gives away one of it’s huge conceptual advantages here. It should not be significantly slower the custom solution.
In this test I disabled threading, w/ threading enabled, results are even worse for uring (almost never better than single syscall).
Questions:
- Is my understanding of blocking/async/concurrent as defined by std.Io correct?
- Do you believe that my concept using dedicated IO and/or syscall threads is promissing? Assumption is that IO tasks rarely if ever saturate a core and latency of a schedule loop for IO should be minimal (harvest requests, dispatch results, syscall uring, rince-repeat).
- I expect a lot of gains from sticky threads syscall-free compute, is this reasonable, do you have experience with this? If so, should std.Io not have a mechanism to declare “where” to await results or dispatch compute tasks? I didn’t see a way to integrate that into std.Io.