That’s quite a big claim! I am confused enough about the whole context-switch thing to find this theoretically possible to believe, but it would really help to dissect specific benchmarks here, and also explain where, exactly, the difference is coming from, in terms of assembly code, CPU counters, and what not!
I was having trouble believing it myself, because I don’t focus on performance at this point at all.
But there are a few points, the “much faster” was in the context of a single thread, and that’s a big constraint. Both Go and Tokio are designed from multi-thread workloads and there is a lot of overhead coming from that. Stackless coroutines in Rust have higher switching cost than stackful coroutines, purely by the fact that there is a lot more state management. I’m almost certain, that if I tried, I could probably beat any of them with Node in a single thread.
Multiple threads, that much much closer, but still very good to my surprise. In any case, performance is not my focus for now. The reality is that Tokio has a huge overhead in async switching, other Rust frameworks are better. And even in Rust, people are starting to do stackful coroutines.
This is a good paper to read regarding performance of stackful coroutines:
This is a simple ping-pong benchmark, all libraries will eventually resolve to running them on a single thread, because the task blocked by the channel read will continue on the same thread as the channel write.
Zig/Zio:
Total rounds: 1000000
Total time: 222.37 ms (0.222 s)
Time per round: 222 ns
Messages/sec: 8994010
Rounds/sec: 4497005
Go
Total rounds: 1000000
Total time: 334.08 ms (0.334 s)
Time per round: 334 ns
Messages/sec: 5986608
Rounds/sec: 2993304
Rust/Tokio
Total rounds: 1000000
Total time: 334.92 ms (0.335 s)
Time per round: 335 ns
Messages/sec: 5971564
Rounds/sec: 2985782
As I said, it’s mind-blowing to me, but this is reality.
Just merged an interesting change, that’s not possible in the current std.Io design (without spawning a separate coroutine for each operation). I already had the option of using e.g. ResetEvent in select(), but that’s easy due to the fact that the event has state. Now I also added support for Channel operations to participate in select(). It allows for code like this:
Unlike goroutines in Go or virtual threads in Java, tasks in Zio have a fixed stack size. This is a big limitation, coming from the fact that Zig is a language with manual memory management.
One thing I’ve been wondering here, is whether it’s feasible to do circumvent this using virtual memory? What you can do is to map 8MiB stack, but rely on demand paging to allocate physical memory. This is not great still, because you have to do the work to setup memory maps for 8MiB worth of pages. But is there some way to directly reserve a penultimate level of page table, so that you pay for only one entry up-front, and setup virtual address mapping lazily as well?
Linux actually does something like that for the stack of the main thread as well, such that 8MiB of stack aren’t even present in the page table for the start (Stack probe puzzle), something like that should be possible for green threads, no? (though, the answer here might start with writing your own kernel…)
I remember reading about this having been considered by other implementations (iirc by Rust as well back when green threads were used) but then discarded for some reason that currently escapes me.
This is is how Windows Fibers work, for example. It’s possible, but in order to work efficiently, it needs compiler support. You need the compiler to inject a check just after function start, to check the stack and possibly allocate more memory. I’m planning to do something like that during yield, but it’s not a high priority task, because on Linux with overcommit on, the OS does it for me anyway. I can allocate fairly large stacks, more than available memory, but Linux will not reserve all of it.
Rust used to have segmented stacks, which is an idea from earlier versions of Go. Once you are low on stack memory, you allocate a new segment and use that from now on. When you return, you deallocate the segment. That has very bad performance, if this happens in a large loop that calls some function repeatedly.
Regardless, I remember reading about it having been considered.
Thinking about it more, I think it was discarded because you can pull this off on 64bit systems easily, but on 32bit the address space starts getting crowded.
EDIT: for clarity, ‘it’ being the idea of using memory mapping tricks to avoid segmentation / reallocation.
I was looking into this and found something fascinating, because I already update the stack guard, base in Window TIB, I should get automatic stack growth handling on Windows if I allocate the memory correctly.
Linux would require SIGSEGV handler and having the stack probing enabled, but it might not be needed because Linux supports overcommit.
I was not spending too much time on stack allocation, just naively allocate slice using gpa, but I’ll spend some time on this, use mmap for Linux, VirtualAlloc on Window and property setup the guard page everywhere. Even without the SIGSEGV handler, it should be a pretty good cross-platform solution. Now sure about macOS, but I don’t care about that too much.
Reminder that relying on overcommit is not portable. If you design a framework this way then you are creating an alternate ecosystem. That’s something that I’ve personally been striving very hard to avoid. Zig is meant to be a reusable language, which means that you should be able to create abstractions that can be used in any project.
I understand the sentiment, but scaling to millions of concurrent operations is going to be platform dependent no matter what. For example, even on Linux with custom virtual memory reservations and moving the guard page in signal handler, I need to inform users to configure vm.max_map_count if they want to scale. I could just as well inform users that they need to enable overcommit. The current option with naive fixed stacks is actually the most portable, it works everywhere. As I add more functionality to scale, it will be more and more dependant on what the platform provides, but always usable on all supported platforms.
Have some ugly AI-made proofs of concepts, both for Windows using the native stack growth system and Linux using SIGSEGV handler (I think this one should work on other unixes as well).
Looking forward to integrating this properly. Would allow me to start with e.g. 64KiB initial committed memory and grow up to let’s say 8MB.
Another important feature done, generic timeout support. Unlike with stackless coroutines, we can’t just do select on everything. So I decided to include API like this:
It’s essentially automatic self-cancelation, that includes some tracking in the background, so that you can recover from it (unlike user cancelation, which is unrecoverable)