I created the fastest WASM interpreter — WASMZ

Introduction

wasmz is a WebAssembly runtime written in Zig, designed to be fast, compact, and easy to embed. It implements a full-featured interpreter with support for modern WebAssembly proposals and WASI.

Benchmark

https://ray-d-song.github.io/wasmz/bench.html

I’ve also documented most of the performance optimizations done by Wasmz at the bottom of this page.

Real-World Testing

wasmz passes real-world WebAssembly module tests:

  • esbuild - JavaScript bundler compiled to WASM

  • QuickJS - Lightweight JavaScript engine compiled to WASM

  • SQLite - Database engine compiled to WASM

Motivation

Existing mainstream WASM interpreters (not heavyweights like wasmtime that include AOT and JIT, but pure interpreters) lack support for modern features (new exception and GC proposals), making it difficult to run my other project that compiles Kotlin to Wasm.

At the same time, I want to build something to verify what I’ve learned from studying Lua and Ruby source code. Wasm is a very suitable target, and Zig is a handy tool.

Initially, I tried C3 and Go, but honestly, I don’t like C3’s syntax, and they have issues with support on some niche platforms, while Go is hard to run on bare metal.

This is my first Zig program, so there are many rough parts. Sorry about that—this is also what I’ll mainly polish in the next stage.

If you find workloads where wasmz has a clear disadvantage, please let me know. I will do my best to optimize them :slight_smile:

25 Likes

Oh this is really nice. I like the wasmi crate (honestly mostly because it can be compiled to wasm itself, like it’s used in Typst) but for whatever reason it still has not implemented the GC spec. (No longer a proposal!)

I might actually use this implementation as a reference to add GC to wasmi if the architecture is close enough.

Edit: here is some of my playing around with wasmi. I might repeat this exercise with zig now.

1 Like

You did the GC extensions :slight_smile: I was hoping someone would. Great project, thanks.

Welcome to Ziggit!

2 Likes

This is cool. Register machines rule!

I don’t know much about WASM, but I’m surprised that most of them seem to be stack machines.

If you want some feedback (from someone who doesn’t know much about WASM) - it’s not clear from the project (and your benchmark) how you got the .wasm module.

Where/how do you compile fib.c to fib.wasm - I’m assuming you just used something like emscripten?

I may use some of this in the future - thanks!

1 Like

LLVM has a wasm target so that’s how you get fib.c to fib.wasm . Emscripten and WASI are (as far as I understand them) just a definition of the API surface between the wasm module and the host (in this case, WASMZ).

It would be rude to post the link twice in as many comments, but my article linked in my previous comment shows a very simple example of creating a wasm module in rust. Zig is just as simple. I am not familiar with C toolchains.

2 Likes

Very pleasantly surprised it wasn’t just LLM dumped code. Nice work!

1 Like

Just as asibahi said, you need emscripten and wasi-sdk.

From my experience, Rust and Zig natively support Wasm as a target platform, so compilation is the easiest, while other languages are very troublesome…

Great article! Compiling Wasmz itself to Wasm and running it in the browser is also my goal for the next phase. Currently, there are couplings between the various parts, and I’ve realized this isn’t good design…

By the way, I initially planned to use Zola for Wasmz’s documentation, but I couldn’t find a suitable theme for documentation (easydocs felt a bit too fancy for me…), so I used mdBook instead. When I have free time, I plan to contribute a documentation theme similar to mdBook’s and migrate to it.

2 Likes

Wasmi author here.

@Ray-D-Song very impressive work! I always am looking forward to seeing new efficient Wasm interpreters. Have you seen makepad-stitch or silverfir-nano’s interpreter?

The benchmarks look well made and realistic - although too small to actually conclude a “fastest” interpreter in all honesty. Both makepad-stitch and silverfir-nano claim to be faster than Wasm3, yet are not compared. I am actually a bit surprised that Wasmi performs so well on x86_64 since I have not (yet) optimized it there. Also it is missing the accumulator-based interpreter architecture (your fp0 and r0) - currently working on this.

The laziness behind your GC implementation is certainly a very good idea which will likely influence Wasmi once I am building the Wasm gc proposal support. Unfortunately, Wasmi required a lot of internal work to use a more efficient and flexible architecture so I haven’t had the time to focus on implementing the missing Wasm proposals.

One minor nit:

Inspired by the Wasm3 M3 architecture [..]

I would advise against using the “M3” name since Wasm3’s interpreter architecture was never a new invention but known and used for decades prior to its creation. The Wasm3 developers simply didn’t know those techniques and “re-invented” them. Those techniques are called (direct or indirect) “threaded-code” architecture for the tail-call or computed-goto based instruction dispatch and “accumulator-based interpreter architecture” for storing state in registers (your fp0 and r0).

4 Likes

Welcome to Ziggit @Herobird!

It’s true that threaded code is not original to Wasm3, but I don’t think that has to be true in order to have drawn inspiration from it. They call their backend “M3”, it’s all over the source code.

I remember reading a post from the wasm3 author about M3, but I didn’t find it just now when I looked. I don’t recall that he made any claims to have invented the fundamental technique, although he may have and I just forgot: my recollection is simply the claim that it’s better and faster than naïve switching, and that’s probably true.

Not much new under the sun. It’s also interesting to ponder how authentic reinvention should be presented: it’s nowhere near as easy as just looking something up, after all. Almost a matter of luck, really.

I do agree that “fastest gun in the west” is a claim which should be copiously backed by evidence before it’s made. I’d say burntsushi sets the standard here: he makes absolutely sure the audience is convinced by the claim.

I found the article btw. He does say “novel architecture”, and it now has a prior art section at the bottom talking about work he was unaware of which is appreciably similar, at least.

That seems fine to me. “novel” claims less than, say, “fastest”.

Thank you for your friendly reply!

When using “fastest” as the title, I was actually very apprehensive. Honestly, before running the benchmarks, I didn’t expect the results to be so good… It was just a fun side project from another one of my businesses, just a bit more serious.
I know engineers should be rigorous enough, but in the end, when writing the title, the devil inside me won. :weary_face:
During my May vacation, I will add more benchs and refactor part of the code.

Regarding the M3 architecture, I actually mentioned in the bench section of the documentation that the optimization for Direct Threaded Code Dispatch comes from this article Squeezing a Little More Performance Out of Bytecode Interpreters · Stefan-Marr.de. However, initially, I didn’t implement accumulator register optimization (that is, no r0 and fp0), and this part was referenced from wasm3. You’re right; once I supplement the relevant knowledge, I’ll consider making the correction.

By the way, while writing, I extensively read many Wasm implementations, and Wasmi’s engineering is the best I’ve seen. Although I haven’t written much Rust, the project’s multi-layered architecture and well-encapsulated macros made me feel your pursuit of quality.

But I have a question: about Wasmi’s binary size. While reading its code, I also pondered ways to significantly reduce its size. In fact, I forked it and made some small experiments, but none were satisfactory. Do you have any ideas on this?

5 Likes

Hi and sorry about the delay for this reply.

I just told you about the “M3” critique because I think using the term does not honor the actual inventors of this technique decades earlier. To some people this might even be perceived as ignorance to their research. The more “M3” term is used, the more people might reuse this term, so I thought it was my duty to intervene where possible. I was and am still unsure whether it is appropriate to “lecture” people about this to be honest. I am sorry if it wasn’t.

I know engineers should be rigorous enough, but in the end, when writing the title, the devil inside me won. :weary_face:

When I was writing the article about Wasmi v0.32 some time ago the hardest for me was coming up with a proper title that is not clickbait but also spawns interest about the topic.

By the way, while writing, I extensively read many Wasm implementations, and Wasmi’s engineering is the best I’ve seen.

Thank you for the kind words. I am really interested in the Zig programming language and really wanted to dig into a Zig project to learn more about it and yours is very much on top of my list now. :slight_smile:

But I have a question: about Wasmi’s binary size. While reading its code, I also pondered ways to significantly reduce its size. In fact, I forked it and made some small experiments, but none were satisfactory. Do you have any ideas on this?

I also wondered a bit about this but Rust is known for its somewhat bloaty codegen unfortunately. As Rust developer you really have to be careful about this and for the longest time, Wasmi development was focusing on other things. However, it also heavily depends on the optimization and compiler settings. Additionally, Wasmi can be modularized a lot, e.g. you can disable its WASI support which should get rid of a huge chunk since it is a very heavy-weight Wasmtime dependency.

Furthermore, Wasmi’s SIMD support takes a whole lot of space but can also be disabled.
Finally, Wasmi’s CLI argument parser uses clap which again is known to be pretty fast and featureful but also bloaty.

There are some more or less unstable compiler flags which can drastically reduce binary artifact size by e.g. not using Rust’s internal formatting facilities (which are known to be bloaty) or using a simpler panic infrastructure (plain abort) etc.

However, I assume the reason is that Wasmi simply uses a lot of generics which are monomorphized during compilation in a way that might be efficient for runtime performance but somewhat bloaty for binary artifact size.

IIRC I was able to get Wasmi’s CLI application stripped down to ~1.5-2 MB.

2 Likes

Hi and sorry for bringing up this kinda old topic. However, I wanted to check the artifact size of Wasmi myself.

  • cargo build -p wasmi_cli –profile bench yields a 5.2M binary
  • cargo build -p wasmi_cli --profile bench --no-default-features -F run yields a 2.0MB binary
  • cargo build -p wasmi_cli –profile size –no-default-features -F run yields a 1.9MB binary

Where

[profile.bench]
lto = “fat”
codegen-units = 1

[profile.size]
inherits = “release”
lto = “fat”
codegen-units = 1
opt-level = “s”

Note that --no-default-featuresdisables features such as WASI and SIMD and .wat and .wast support.

Built on an Apple Macbook M2 Pro.

With some more tricks it is possible to bring down these numbers to ~1.2MB.
That’s still too big and I am sure to bring these numbers down.

Thank you very much for being willing to continue this conversation with me. I recently started my May holiday, so I’ve returned to optimizing wasmz.

One of the optimizations is to provide more and finer-grained compilation flags, similar to wasmi.

Regarding the --no-default-features option you mentioned, my view is that most people don’t need WAT format support, so this option could be disabled by default.

If we don’t approach it from the perspective of stripping features (for example, using software emulation to simulate hardware SIMD functionality), then the remaining entry points for reducing size are: using unsafe Rust, or going a step further by implementing some functionality directly in C (though that seems counterproductive).

In fact, after I tried implementing some features of wasmz using C, the binary size was reduced by 120k, which surprised me a lot. It even made me start considering whether I should adopt C as the primary language for implementing certain modules.

Going further, I began to think about whether there is a method to optimize the binary size for compiled generic languages like Rust and Zig. Another Rust project of mine (providing PG protocol support for DuckDB) currently produces an astonishing 80M executable file.

Sorry, my reply is a bit scattered. In the past, I have been using languages like Golang, which have a heavier runtime and rely less on generics, so I don’t have much experience in this area. If you have any practical experience with this kind of optimization (reducing the binary size of compiled generic languages), please let me know.

1 Like