Is it possible to implement Stack Stitching with Hot Split and *just work* with Zig?

I planned to implement a Go-like runtime (v1) in Zig - with Stack Stitching w/ Hot Split.

I got this working for the happy path, but I want to be able to use Zig’s defer stack to have fiber-fault isolation.

I transpile a language to Zig, so I can make sure that any locks, heap allocations, resources etc are put on the defer / errdefer stack, and this makes cleaning up a fiber overflow easy in theory.

But is there any way to get this working with Stack Stitching w/ Hot Split?

Essentially, I have an LLVM Machine Pass that injects asm at the very top of every function (transpiled) like so:

entry:
  # START MY CUSTOM PROLOGUE BEFORE ANYTHING ELSE
  jmp resume if (rsp - frame_size > LIMIT)
  call morestack   # This creates a larger stack, switches to it, hijacks the return to CALL lessstack, which moves the stack pointer back to the old stack, then RETS whatever would normally be returned
  #  morestack JUMPs to here (not RET)

resume:
  # END MY PROLOGUE
  # the existing function …

This just works if the fiber doesn’t error. But a fiber not erroring for any reason is completely un-realistic. I’m lost at how I can prove this will just work with the defer and errdefer stack, unwinding, backtracing, etc.

For one, I’m in way over my head. I can’t find out definitely what I must guarantee to ensure Zig’s unwinding would just work. Under the hood, I thought errors are just returned not unwound, so this shouldn’t really be a problem. If it works with normal returns, it should just work with errors. But I can’t make assumptions. I’m not sure how much testing would be involved for me to feel confident it actually works, instead of just coincidentally passes whatever tests I write. Lastly, Zig is in flux, so I’m hoping there could be a solution that will be future proof.

Naively, I assumed this would work. Now I’m feeling the pain of my stupidity /=

Stack Splitting would effectively be useless if it only works with fibers that don’t ever error.

For one thing, there’s no guarantee that the backend is LLVM. When Zig reaches its goal of removing the LLVM dependency and gains parity with LLVM optimized code generation, your solution will have more limited value.

Theoretically, this is not a hard problem to solve even without LLVM, no?

As long as Zig can output asm, I can pretty easily insert an epilogue at the top of functions.

I’m less concerned with that, and more concerned that I don’t know anything about how Zig actually stores errors and error defers on the stack, and if something I’m doing is not compatible in some weird edge case.

I can test that at least some errors do propagate through what I’m doing, but I don’t know enough to know how I know if this actually works, or it just passes the one test case that I can think of.

1 Like

I believe this is correct. I don’t know anything in Zig that unwinds the stack (that is, recovers the state of a previous frame). It is mentioned in the std library, but I don’t know in which circumstance Zig itself does it. Of course, if you call into C++ code, then it will do exceptions, but they are not initiated by Zig.
There is, however, stack tracing, that is, recording which functions were called, in order. Zig doesn’t do anything special, it walks the stack according to the platform prescribed way. For Windows, that information is in the unwind tables, which lives in the .pdata section, even though you don’t necessarily need to use it for unwinding. Zig and debuggers will use this information to find where the return address is within the stack. I believe to get your code to work with stack traces, you just need to write an approppriate entry in this table. However there is no mechanism for writing to the table after the program is loaded, so you need to write during building.
For Linux, all I know is that this information is called call frame information (CFI), and it lives in the DWARF file.
Other than messed up stack traces, I think your code will work, regardless of the function returning an error.
If you call into external functions on windows, you have to be careful with the TIB, otherwise Windows will think you caused a stack overflow. I talked about it here.
With all that said, there’s a reason why Go abandoned this approach. It’s not really performant. It’s much better to just know the stack size you need for your function, and allocate it all up front. I also talked about this in the linked discussion. It’s ridiculous that in 2026 we still don’t know the stack usage of our functions. The compiler needs to know this, except for the obvious corner cases, like recursive functions, so why doesn’t it just give it to us? It’s great that Zig is tackling this, but this should have been standard practice long before Zig even showed up.