Alternative std.Io implementation in zio

.x86_64 => asm volatile (
            \\ leaq 0f(%%rip), %%rdx
            \\ movq %%rsp, 0(%%rax)
            \\ movq %%rbp, 8(%%rax)
            \\ movq %%rdx, 16(%%rax)
            \\
            ++ (if (is_windows)
                \\ // Load TEB pointer and save TIB fields
                \\ movq %%gs:0x30, %%r10
                \\ movq 0x20(%%r10), %%r11
                \\ movq %%r11, 24(%%rax)
                \\ movq 0x1478(%%r10), %%r11
                \\ movq %%r11, 32(%%rax)
                \\ movq 0x08(%%r10), %%r11
                \\ movq %%r11, 40(%%rax)
                \\ movq 0x10(%%r10), %%r11
                \\ movq %%r11, 48(%%rax)
                \\
            else
                "")
            ++
            \\ // Restore stack pointer and base pointer
            \\ movq 0(%%rcx), %%rsp
            \\ movq 8(%%rcx), %%rbp
            \\
            ++ (if (is_windows)
                \\ // Load TEB pointer and restore TIB fields
                \\ movq %%gs:0x30, %%r10
                \\ movq 24(%%rcx), %%r11
                \\ movq %%r11, 0x20(%%r10)
                \\ movq 32(%%rcx), %%r11
                \\ movq %%r11, 0x1478(%%r10)
                \\ movq 40(%%rcx), %%r11
                \\ movq %%r11, 0x08(%%r10)
                \\ movq 48(%%rcx), %%r11
                \\ movq %%r11, 0x10(%%r10)
                \\

I’ve been using my own fiber implementation on Windows, long before std.Io was a thing. Since I wasn’t bound by a specific API, I did things my way. You can make the context swapping a lot faster.
I made my fiber intrusive, inspired by the current linked list implementation in std. The linked list doesn’t allocate it’s Nodes, instead the user needs to provide one. In my case, I ask the user to provide the memory that will be used for the coroutine’s stack. The memory needs to be committed already. With that, I don’t need stack guards. Disabling stack guards improves performance of every function, even those that are not using coroutines. The whole growing the stack progressively is a stupid relic from the past. Today’s stacks should be pre-commited always. This saves the cycles that would be spent checking if growing is necessary, and avoids performance hiccups during page commits. It also means that memory consumption is bounded by a realistic amount. Once you’re out, you’re out, you have to wait until other tasks finish. With the reserve-then-grow approach, memory consumption will continue increasing, until, at a certain point, you start memory swapping, tanking performance, and if you keep adding more taks, you’ll eventually run out of memory and crash. There is no mechanism to recover from an out of stack space situation. This cannot happen with pre-commited memory (as long as its size was properly calculated).
With all that said, once you’re using pre-commited memory, you don’t have to update the TEB. In over a year of using my coroutines, the only use I’ve ever seen for the TEB is to check for running out of stack. At the beginning of the your thread, set

const teb = std.os.windows.teb();
teb.NtTib.StackLimit = @ptrFromInt(1);

This will prevent Windows from freaking out about running out of stack. After that, just forget the TEB. Nothing bad happens.
This makes the context swap code just:

\\ lea (-8 * 27)(%rsp), %rax
\\ lea (8 * 27)(%rcx), %rsp
pub inline fn switchContext(
    noalias current_context_param: *Context,
    noalias new_context: *Context,
) void {

you don’t need to take a pointer to structs where you’ll store your temp data. You have a perfectly good place for temp data, the stack itself. Storing data through a pointer is inefficient. You can just push a whole bunch of stuff onto the stack. It’s going to be frozen anyways. In my case, the tasks are linked through a linked list, and I also used the inactive stacks to put the nodes for the linked list.

Also, I don’t think you want to make this inline. This function is going to be really hot, it might be better to have it loaded in a single place in the cache, and every coroutine access it.

              .rax = true,
              .rcx = true,
              .rdx = true,
              .rbx = true,
              .rsi = true,
              .rdi = true,
              .r8 = true,
              .r9 = true,
              .r10 = true,
     ...

You shouldn’t trust Zig’s clobber thing. There was one point where it wasn’t properly saving and restoring the correct registers and, most often, it would save a whole bunch of stuff unnecessarily. On windows, you have to save xmm registers, some of the scalar registers, and that’s it. It’s very likely this code is saving everything else. This means you need to put your assembly at the top level, not in a function.

Putting everything together, this is my context swap function:

comptime {
    asm (
            \\.global swapFiber
            \\swapFiber:
            \\ mov %rbx, (8 * 4)(%rsp)
            \\ mov %rdi, (8 * 3)(%rsp)
            \\ mov %rsi, (8 * 2)(%rsp)
            \\ mov %r12, (8 * 1)(%rsp)
            \\
            \\ mov %rbp, (-8 * 1)(%rsp)
            \\ mov %r13, (-8 * 2)(%rsp)
            \\ mov %r14, (-8 * 3)(%rsp)
            \\ mov %r15, (-8 * 4)(%rsp)
            \\
            \\ vmovapd %xmm6,  (-8 * 7)(%rsp)
            \\ vmovapd %xmm7,  (-8 * 9)(%rsp)
            \\ vmovapd %xmm8,  (-8 * 11)(%rsp)
            \\ vmovapd %xmm9,  (-8 * 13)(%rsp)
            \\ vmovapd %xmm10, (-8 * 15)(%rsp)
            \\ vmovapd %xmm11, (-8 * 17)(%rsp)
            \\ vmovapd %xmm12, (-8 * 19)(%rsp)
            \\ vmovapd %xmm13, (-8 * 21)(%rsp)
            \\ vmovapd %xmm14, (-8 * 23)(%rsp)
            \\ vmovapd %xmm15, (-8 * 25)(%rsp)
            \\
            \\ lea (-8 * 27)(%rsp), %rax
            \\ lea (8 * 27)(%rcx), %rsp
            \\
            \\ restoreFiber:
            \\ mov (8 * 4)(%rsp), %rbx
            \\ mov (8 * 3)(%rsp), %rdi
            \\ mov (8 * 2)(%rsp), %rsi
            \\ mov (8 * 1)(%rsp), %r12
            \\
            \\ mov (-8 * 1)(%rsp), %rbp
            \\ mov (-8 * 2)(%rsp), %r13
            \\ mov (-8 * 3)(%rsp), %r14
            \\ mov (-8 * 4)(%rsp), %r15
            \\
            \\ vmovapd (-8 * 7)(%rsp), %xmm6
            \\ vmovapd (-8 * 9)(%rsp), %xmm7
            \\ vmovapd (-8 * 11)(%rsp), %xmm8
            \\ vmovapd (-8 * 13)(%rsp), %xmm9
            \\ vmovapd (-8 * 15)(%rsp), %xmm10
            \\ vmovapd (-8 * 17)(%rsp), %xmm11
            \\ vmovapd (-8 * 19)(%rsp), %xmm12
            \\ vmovapd (-8 * 21)(%rsp), %xmm13
            \\ vmovapd (-8 * 23)(%rsp), %xmm14
            \\ vmovapd (-8 * 25)(%rsp), %xmm15
            \\
            \\ ret
        );
}

But all in all, it’s awesome that we already have competing implementations of Io. The std lib doesn’t even have an implemention of evented IO for Windows. Have you considered making a pull request?

8 Likes