Using std.Io.Writer for efficient canvas batching from WebAssembly

mbrock · September 10, 2025, 2:30pm

I was working on a little audio visualizer using 2D HTML5 canvas and Web Audio, and decided to port the inner loop to Zig. But doing it the easy way means on every frame the WebAssembly will be context switching to a JavaScript bridge function thousands of times per frame! There has to be a better way!

Here’s what I came up with: GitHub - mbrock/songpost: canvas systems programming with zig

It’s a pretty classic Zig data oriented design. Instead of doing a thousand “syscalls” per frame, we accumulate commands in a buffer. And instead of having large heterogenous commands, we use u16 indices into per-frame pools of coordinates and colors, and fixed-width 8 byte commands.

Now for me there was a moment during this hacking session where I had a sudden exhilarating moment of understanding the big deal about Unix. This was when I had written the serialization code enough to wonder how to actually send it from WebAssembly to the JavaScript world. Do I need to implement my own Writer vtable? And I realized that—wait, I already have a simple file I/O write syscall handler for debug output—so if I just use Zig’s file writer with, say fd 3, I don’t have to implement anything!

And that’s how it works. I write a bunch of struct vectors to fd 3 with the standard file writer, with a simple packet protocol. The JavaScript side can decode everything with super efficient typed array views. And if you compile it to a native program, you just redirect fd 3 to a file and inspect the output!

For me this experiment was really interesting and beautiful in terms of starting to really understand the point of the new buffered writer interface, and even of Unix itself!

floooh · September 10, 2025, 3:17pm

It would be interesting to know if and how much this approach reduces overhead compared to thousands of WASM=>JS calls (e.g. according to the linked article, a hundred million WASM => JS calls take about half a second, which means 5 nanoseconds per call, or 10 clock cycles at 2 GHz - apologies in advance if my math is wrong

But from my own experience with WebGL2 and WebGPU: a few thousand WASM=>JS calls per 60 or 120Hz frame are really nothing compared to the time that’s actually spent inside those calls (e.g. in the WebGL2 or WebGPU implementation code).

mbrock · September 10, 2025, 3:28pm

Yes, this would be very interesting to benchmark! It should be easy to make a comparison with my audio example. I will definitely try it!

mbrock · September 10, 2025, 3:29pm

By the way, I put up a live demo at https://song.less.rest if you want to see the visualizer in action. Just click the page to play!

mbrock · September 10, 2025, 3:48pm

If anything feels right about the buffering approach, it may not come to fruition in the canvas “backend”, but for rendering to something like WebGPU you would definitely want to gather commands into a buffer, right?

It’s also easy to imagine compiling this to native code and then writing a separate program that simply reads command buffers from stdio and pipes it into some native canvas or GPU API!

floooh · September 10, 2025, 5:43pm

It depends I guess. WebGPU has its own command buffer anyway which is populated via WGPURenderPassEncoder calls. The difference would be doing those WGPURenderPassEncoder calls directly on the WASM side, where each individual calls goes into the JS shim, vs buffering (maybe higher level?) render commands into a memory buffer within the WASM side, passing that to the JS side and then looping over those buffered commands and calling into the JS interface of WGPURenderPassEncoder.

I could imagine that the the ‘custom buffer’ approach could be slightly faster, not because the WASM=>JS call overhead is high, but because you would have one tight encoding loop on the WASM side and another tight decoding loop on the JS side, and maybe those could be individually better optimized than the sort-of ‘vertical slice’ of crossing the WASM/JS boundary for each render command…

The question is just: is the performance difference big enough to justify the extra code

PS: what would be really interesting is a WebGPU implementation where the entire WGPUCommandBuffer lives within the WASM side, along with all the CommandEncoders, so that the entire process of populating the CommandBuffer doesn’t leave WASM, and the entire CommandBuffer would be passed to the JS side. Interesting idea but also extremely unlikely to ever happen