Alternative std.Io implementation in zio

I’ve managed to get the std.Io implementation in zio to a state, where I’d like to ask people to help me testing it. It’s still not complete, some filesystem parts are missing, UDP networking is missing, but it’s already usable for many things.

Here you can see an echo server example:

It works in the same runtime as zio, so you have stackful coroutines, meaning you can have hundreds or even thousands of tasks running on the same thread. I/O is implemented using event loops based on io_uring (Linux), kqueue (macOS, FreeBSD, NetBSD), IOCP (Windows), poll (any other POSIX system that I don’t explicitly test for).

I’m currently in the middle of refactoring how executors work, which was mainly needed to allow for the same style of programming that std.Io prefers, where the main threads is also a coroutine. Due to the refactoring, runtime is currently running in single-threaded, unless you explicitly spawn additional threads and run rt.run() in them.

If you want to play with it, you need to use the zig-0.16 branch:

zig fetch --save "git+https://github.com/lalinsky/zio.zig#zig-0.16"

I’m trying to keep it up to date with Zig master, but it’s just best effort, there are still too many breaking changes happening and it takes me a while to accommodate them.

10 Likes

IIRC Io.Group leaks memory when used in the way you’re using it with the Io.Threaded implementation of Io. Is that still the case with zio, or do you reap completed tasks in some way?

2 Likes

I didn’t realize this problem, my implementation currently has the same problem. But I have the infrastructure to reap the tasks as they complete, so I’ll do the changes.

However, with zio you also have the option of using the native API for the accept loop and spawn+detach will definitely not leak.

Done, I’ve updated the code to avoid the leak. I have a hybrid doubly-linked list for the task list, that actually allows me to remove the tasks early and still be concurrently safe, so this was easy.

Thank you, this is the kind of feedback I was hoping for.

1 Like

I tried to build zio from source but I got an error:

install
└─ install zio
   └─ compile lib zio Debug native 2 errors
src/select.zig:140:12: error: invalid builtin function: '@Type'
    return @Type(.{ .@"struct" = .{
           ^~~~~
src/select.zig:180:12: error: invalid builtin function: '@Type'
    return @Type(.{ .@"union" = .{
           ^~~~~
error: 2 compilation errors

I’m using the latest master version.

Thanks.

AFAIK in 0.16.0 the builtin @Type is replaced by a dedicated builtin for each tag of the type union so you should have @Struct, @Union, …

The error seems to be exactly that so my guess is a version difference between the two of you.

You are using the main branch of zio, that’s for Zig 0.15 and doesn’t support Io.

You need the zig-0.16 branch of zio.

I replaced @Type with @Struct, but I still got errors from the examples from Target, .handler = .{ .sigaction = stackFaultHandler } and const fault_addr = getFaultAddress(info)

I switched to zig-0.16 branch, but I still got the same errors.

Are you building it standalone, or part of your own project? Can you show me which commit ID are you trying to build? And which version of Zig are you using?

I’m testing it regularly myself and have CI, so I’m quite sure it works.

Sorry, I forgot to pull changes for the zig-0.16 branch.

I would like to test zio for a new project using libusb.

Thanks.

Let me know how it goes. I’m very interested in use cases like this, as my primary motivation for the library is very different, so you might some things I missed.

One doubt about libusb is if io_uring will work.

What is the default runtime on Linux and how can I select a runtime using epoll?

Thanks

Both backends should work, however I just looked at this and it seems that you need to poll on file fds. That’s currently not directly supported, you can fake it by creating a socket with the same fd. I was planning to add file poll just a few days ago, to support some other use case, so I’ll add it over the weekend. Just a note that you will need to use zio API for this, as the std Io interface has no concept of polling.

Looking at libusb: Polling and timing [Detailed Description], things are going to be complicated, if you want to do polling without using libusb API.

Thanks.

.x86_64 => asm volatile (
            \\ leaq 0f(%%rip), %%rdx
            \\ movq %%rsp, 0(%%rax)
            \\ movq %%rbp, 8(%%rax)
            \\ movq %%rdx, 16(%%rax)
            \\
            ++ (if (is_windows)
                \\ // Load TEB pointer and save TIB fields
                \\ movq %%gs:0x30, %%r10
                \\ movq 0x20(%%r10), %%r11
                \\ movq %%r11, 24(%%rax)
                \\ movq 0x1478(%%r10), %%r11
                \\ movq %%r11, 32(%%rax)
                \\ movq 0x08(%%r10), %%r11
                \\ movq %%r11, 40(%%rax)
                \\ movq 0x10(%%r10), %%r11
                \\ movq %%r11, 48(%%rax)
                \\
            else
                "")
            ++
            \\ // Restore stack pointer and base pointer
            \\ movq 0(%%rcx), %%rsp
            \\ movq 8(%%rcx), %%rbp
            \\
            ++ (if (is_windows)
                \\ // Load TEB pointer and restore TIB fields
                \\ movq %%gs:0x30, %%r10
                \\ movq 24(%%rcx), %%r11
                \\ movq %%r11, 0x20(%%r10)
                \\ movq 32(%%rcx), %%r11
                \\ movq %%r11, 0x1478(%%r10)
                \\ movq 40(%%rcx), %%r11
                \\ movq %%r11, 0x08(%%r10)
                \\ movq 48(%%rcx), %%r11
                \\ movq %%r11, 0x10(%%r10)
                \\

I’ve been using my own fiber implementation on Windows, long before std.Io was a thing. Since I wasn’t bound by a specific API, I did things my way. You can make the context swapping a lot faster.
I made my fiber intrusive, inspired by the current linked list implementation in std. The linked list doesn’t allocate it’s Nodes, instead the user needs to provide one. In my case, I ask the user to provide the memory that will be used for the coroutine’s stack. The memory needs to be committed already. With that, I don’t need stack guards. Disabling stack guards improves performance of every function, even those that are not using coroutines. The whole growing the stack progressively is a stupid relic from the past. Today’s stacks should be pre-commited always. This saves the cycles that would be spent checking if growing is necessary, and avoids performance hiccups during page commits. It also means that memory consumption is bounded by a realistic amount. Once you’re out, you’re out, you have to wait until other tasks finish. With the reserve-then-grow approach, memory consumption will continue increasing, until, at a certain point, you start memory swapping, tanking performance, and if you keep adding more taks, you’ll eventually run out of memory and crash. There is no mechanism to recover from an out of stack space situation. This cannot happen with pre-commited memory (as long as its size was properly calculated).
With all that said, once you’re using pre-commited memory, you don’t have to update the TEB. In over a year of using my coroutines, the only use I’ve ever seen for the TEB is to check for running out of stack. At the beginning of the your thread, set

const teb = std.os.windows.teb();
teb.NtTib.StackLimit = @ptrFromInt(1);

This will prevent Windows from freaking out about running out of stack. After that, just forget the TEB. Nothing bad happens.
This makes the context swap code just:

\\ lea (-8 * 27)(%rsp), %rax
\\ lea (8 * 27)(%rcx), %rsp
pub inline fn switchContext(
    noalias current_context_param: *Context,
    noalias new_context: *Context,
) void {

you don’t need to take a pointer to structs where you’ll store your temp data. You have a perfectly good place for temp data, the stack itself. Storing data through a pointer is inefficient. You can just push a whole bunch of stuff onto the stack. It’s going to be frozen anyways. In my case, the tasks are linked through a linked list, and I also used the inactive stacks to put the nodes for the linked list.

Also, I don’t think you want to make this inline. This function is going to be really hot, it might be better to have it loaded in a single place in the cache, and every coroutine access it.

              .rax = true,
              .rcx = true,
              .rdx = true,
              .rbx = true,
              .rsi = true,
              .rdi = true,
              .r8 = true,
              .r9 = true,
              .r10 = true,
     ...

You shouldn’t trust Zig’s clobber thing. There was one point where it wasn’t properly saving and restoring the correct registers and, most often, it would save a whole bunch of stuff unnecessarily. On windows, you have to save xmm registers, some of the scalar registers, and that’s it. It’s very likely this code is saving everything else. This means you need to put your assembly at the top level, not in a function.

Putting everything together, this is my context swap function:

comptime {
    asm (
            \\.global swapFiber
            \\swapFiber:
            \\ mov %rbx, (8 * 4)(%rsp)
            \\ mov %rdi, (8 * 3)(%rsp)
            \\ mov %rsi, (8 * 2)(%rsp)
            \\ mov %r12, (8 * 1)(%rsp)
            \\
            \\ mov %rbp, (-8 * 1)(%rsp)
            \\ mov %r13, (-8 * 2)(%rsp)
            \\ mov %r14, (-8 * 3)(%rsp)
            \\ mov %r15, (-8 * 4)(%rsp)
            \\
            \\ vmovapd %xmm6,  (-8 * 7)(%rsp)
            \\ vmovapd %xmm7,  (-8 * 9)(%rsp)
            \\ vmovapd %xmm8,  (-8 * 11)(%rsp)
            \\ vmovapd %xmm9,  (-8 * 13)(%rsp)
            \\ vmovapd %xmm10, (-8 * 15)(%rsp)
            \\ vmovapd %xmm11, (-8 * 17)(%rsp)
            \\ vmovapd %xmm12, (-8 * 19)(%rsp)
            \\ vmovapd %xmm13, (-8 * 21)(%rsp)
            \\ vmovapd %xmm14, (-8 * 23)(%rsp)
            \\ vmovapd %xmm15, (-8 * 25)(%rsp)
            \\
            \\ lea (-8 * 27)(%rsp), %rax
            \\ lea (8 * 27)(%rcx), %rsp
            \\
            \\ restoreFiber:
            \\ mov (8 * 4)(%rsp), %rbx
            \\ mov (8 * 3)(%rsp), %rdi
            \\ mov (8 * 2)(%rsp), %rsi
            \\ mov (8 * 1)(%rsp), %r12
            \\
            \\ mov (-8 * 1)(%rsp), %rbp
            \\ mov (-8 * 2)(%rsp), %r13
            \\ mov (-8 * 3)(%rsp), %r14
            \\ mov (-8 * 4)(%rsp), %r15
            \\
            \\ vmovapd (-8 * 7)(%rsp), %xmm6
            \\ vmovapd (-8 * 9)(%rsp), %xmm7
            \\ vmovapd (-8 * 11)(%rsp), %xmm8
            \\ vmovapd (-8 * 13)(%rsp), %xmm9
            \\ vmovapd (-8 * 15)(%rsp), %xmm10
            \\ vmovapd (-8 * 17)(%rsp), %xmm11
            \\ vmovapd (-8 * 19)(%rsp), %xmm12
            \\ vmovapd (-8 * 21)(%rsp), %xmm13
            \\ vmovapd (-8 * 23)(%rsp), %xmm14
            \\ vmovapd (-8 * 25)(%rsp), %xmm15
            \\
            \\ ret
        );
}

But all in all, it’s awesome that we already have competing implementations of Io. The std lib doesn’t even have an implemention of evented IO for Windows. Have you considered making a pull request?

8 Likes

The current context switching is actually the fastest you can get. If you check the other zio thread, I posted a paper that explain why it works the way it works.

Stack growing is implemented because there is currently no way if computing stack usage of a function and I really want this scalable to thousands and thousands of coroutines, so I can’t commit several megabytes of stsck space. I could on Linux with overcommit, but the current behavior is better. Especially on Windows, where I actually use kernel functionality that is in place for Windows Fibers, that automatically extends the stack.

This can’t be contributed to std, it’s not done in the direct style that the Zig team prefers and also it’s partially AI assisted, which is a strict no no for Zig.

6 Likes

Yes, calculating the amount of stack usage is the main problem in my implementation. I wrote a helper that does a sort of watermark calculation of the maximum stack usage. I have to run this function once, when writing the code, then I get the watermark, then I write the final code where I actually allocate the necessary amount of space. @stackSize would solve this.
About scalability, this approach actually goes agaisn’t what you’re trying to do. If you actually run thousands of thousands of coroutines, you will allocate many gigabytes of memory, without any sort of control, and the system presumes these allocations won’t fail. At this level, you might actually run out memory, and you would have no option for recovery. You would segfault and crash. When you are in control of the stack usage, if there isn’t enough space for a coroutine, you just put it in a queue, or pause until another coroutine finishes and frees up some space. This is actual cooperation between coroutines. The operating system’s implementation is just a free for all fight over memory. Linux’s overcommit doesn’t help here. If you run out of memory, you’d crash the same way.

I’m not sure if @stackSize could work. What about recursion? That’s essentially dynamic stack allocation.

Obviously, if you use memory than available, oom will kill you most likely, but that would happen no matter what. As far as i know, most linux distros overcommit memory by default, so there will be no OutOfMemory error.

I still think my solution of starting at 64kb and growing exponentially is better than giving each coroutine 4mb like the wip version in std.

It’s the opposite. error.OutOfMemory doesn’t kill you. You are being informed that you can’t allocate more memory, so you can recover, like doing other things while waiting for more memory to become available.
When every coroutine gets infinite reserved space, you won’t get an OutOfMemory. At some point, one function is going to try to use a memory page, the OS will try to allocate a physical page for it, it won’t find a free page, and then your program is going to crash. You’re not informed of it, and control flow doesn’t even come back to you so you can recover.

2 Likes