I created the fastest WASM interpreter — WASMZ

Introduction

wasmz is a WebAssembly runtime written in Zig, designed to be fast, compact, and easy to embed. It implements a full-featured interpreter with support for modern WebAssembly proposals and WASI.

Benchmark

https://ray-d-song.github.io/wasmz/bench.html

I’ve also documented most of the performance optimizations done by Wasmz at the bottom of this page.

Real-World Testing

wasmz passes real-world WebAssembly module tests:

  • esbuild - JavaScript bundler compiled to WASM

  • QuickJS - Lightweight JavaScript engine compiled to WASM

  • SQLite - Database engine compiled to WASM

Motivation

Existing mainstream WASM interpreters (not heavyweights like wasmtime that include AOT and JIT, but pure interpreters) lack support for modern features (new exception and GC proposals), making it difficult to run my other project that compiles Kotlin to Wasm.

At the same time, I want to build something to verify what I’ve learned from studying Lua and Ruby source code. Wasm is a very suitable target, and Zig is a handy tool.

Initially, I tried C3 and Go, but honestly, I don’t like C3’s syntax, and they have issues with support on some niche platforms, while Go is hard to run on bare metal.

This is my first Zig program, so there are many rough parts. Sorry about that—this is also what I’ll mainly polish in the next stage.

If you find workloads where wasmz has a clear disadvantage, please let me know. I will do my best to optimize them :slight_smile:

32 Likes

Oh this is really nice. I like the wasmi crate (honestly mostly because it can be compiled to wasm itself, like it’s used in Typst) but for whatever reason it still has not implemented the GC spec. (No longer a proposal!)

I might actually use this implementation as a reference to add GC to wasmi if the architecture is close enough.

Edit: here is some of my playing around with wasmi. I might repeat this exercise with zig now.

1 Like

You did the GC extensions :slight_smile: I was hoping someone would. Great project, thanks.

Welcome to Ziggit!

4 Likes

This is cool. Register machines rule!

I don’t know much about WASM, but I’m surprised that most of them seem to be stack machines.

If you want some feedback (from someone who doesn’t know much about WASM) - it’s not clear from the project (and your benchmark) how you got the .wasm module.

Where/how do you compile fib.c to fib.wasm - I’m assuming you just used something like emscripten?

I may use some of this in the future - thanks!

1 Like

LLVM has a wasm target so that’s how you get fib.c to fib.wasm . Emscripten and WASI are (as far as I understand them) just a definition of the API surface between the wasm module and the host (in this case, WASMZ).

It would be rude to post the link twice in as many comments, but my article linked in my previous comment shows a very simple example of creating a wasm module in rust. Zig is just as simple. I am not familiar with C toolchains.

2 Likes

Very pleasantly surprised it wasn’t just LLM dumped code. Nice work!

1 Like

Just as asibahi said, you need emscripten and wasi-sdk.

From my experience, Rust and Zig natively support Wasm as a target platform, so compilation is the easiest, while other languages are very troublesome…

Great article! Compiling Wasmz itself to Wasm and running it in the browser is also my goal for the next phase. Currently, there are couplings between the various parts, and I’ve realized this isn’t good design…

By the way, I initially planned to use Zola for Wasmz’s documentation, but I couldn’t find a suitable theme for documentation (easydocs felt a bit too fancy for me…), so I used mdBook instead. When I have free time, I plan to contribute a documentation theme similar to mdBook’s and migrate to it.

3 Likes

Wasmi author here.

@Ray-D-Song very impressive work! I always am looking forward to seeing new efficient Wasm interpreters. Have you seen makepad-stitch or silverfir-nano’s interpreter?

The benchmarks look well made and realistic - although too small to actually conclude a “fastest” interpreter in all honesty. Both makepad-stitch and silverfir-nano claim to be faster than Wasm3, yet are not compared. I am actually a bit surprised that Wasmi performs so well on x86_64 since I have not (yet) optimized it there. Also it is missing the accumulator-based interpreter architecture (your fp0 and r0) - currently working on this.

The laziness behind your GC implementation is certainly a very good idea which will likely influence Wasmi once I am building the Wasm gc proposal support. Unfortunately, Wasmi required a lot of internal work to use a more efficient and flexible architecture so I haven’t had the time to focus on implementing the missing Wasm proposals.

One minor nit:

Inspired by the Wasm3 M3 architecture [..]

I would advise against using the “M3” name since Wasm3’s interpreter architecture was never a new invention but known and used for decades prior to its creation. The Wasm3 developers simply didn’t know those techniques and “re-invented” them. Those techniques are called (direct or indirect) “threaded-code” architecture for the tail-call or computed-goto based instruction dispatch and “accumulator-based interpreter architecture” for storing state in registers (your fp0 and r0).

4 Likes

Welcome to Ziggit @Herobird!

It’s true that threaded code is not original to Wasm3, but I don’t think that has to be true in order to have drawn inspiration from it. They call their backend “M3”, it’s all over the source code.

I remember reading a post from the wasm3 author about M3, but I didn’t find it just now when I looked. I don’t recall that he made any claims to have invented the fundamental technique, although he may have and I just forgot: my recollection is simply the claim that it’s better and faster than naïve switching, and that’s probably true.

Not much new under the sun. It’s also interesting to ponder how authentic reinvention should be presented: it’s nowhere near as easy as just looking something up, after all. Almost a matter of luck, really.

I do agree that “fastest gun in the west” is a claim which should be copiously backed by evidence before it’s made. I’d say burntsushi sets the standard here: he makes absolutely sure the audience is convinced by the claim.

1 Like

I found the article btw. He does say “novel architecture”, and it now has a prior art section at the bottom talking about work he was unaware of which is appreciably similar, at least.

That seems fine to me. “novel” claims less than, say, “fastest”.

Thank you for your friendly reply!

When using “fastest” as the title, I was actually very apprehensive. Honestly, before running the benchmarks, I didn’t expect the results to be so good… It was just a fun side project from another one of my businesses, just a bit more serious.
I know engineers should be rigorous enough, but in the end, when writing the title, the devil inside me won. :weary_face:
During my May vacation, I will add more benchs and refactor part of the code.

Regarding the M3 architecture, I actually mentioned in the bench section of the documentation that the optimization for Direct Threaded Code Dispatch comes from this article Squeezing a Little More Performance Out of Bytecode Interpreters · Stefan-Marr.de. However, initially, I didn’t implement accumulator register optimization (that is, no r0 and fp0), and this part was referenced from wasm3. You’re right; once I supplement the relevant knowledge, I’ll consider making the correction.

By the way, while writing, I extensively read many Wasm implementations, and Wasmi’s engineering is the best I’ve seen. Although I haven’t written much Rust, the project’s multi-layered architecture and well-encapsulated macros made me feel your pursuit of quality.

But I have a question: about Wasmi’s binary size. While reading its code, I also pondered ways to significantly reduce its size. In fact, I forked it and made some small experiments, but none were satisfactory. Do you have any ideas on this?

5 Likes

Hi and sorry about the delay for this reply.

I just told you about the “M3” critique because I think using the term does not honor the actual inventors of this technique decades earlier. To some people this might even be perceived as ignorance to their research. The more “M3” term is used, the more people might reuse this term, so I thought it was my duty to intervene where possible. I was and am still unsure whether it is appropriate to “lecture” people about this to be honest. I am sorry if it wasn’t.

I know engineers should be rigorous enough, but in the end, when writing the title, the devil inside me won. :weary_face:

When I was writing the article about Wasmi v0.32 some time ago the hardest for me was coming up with a proper title that is not clickbait but also spawns interest about the topic.

By the way, while writing, I extensively read many Wasm implementations, and Wasmi’s engineering is the best I’ve seen.

Thank you for the kind words. I am really interested in the Zig programming language and really wanted to dig into a Zig project to learn more about it and yours is very much on top of my list now. :slight_smile:

But I have a question: about Wasmi’s binary size. While reading its code, I also pondered ways to significantly reduce its size. In fact, I forked it and made some small experiments, but none were satisfactory. Do you have any ideas on this?

I also wondered a bit about this but Rust is known for its somewhat bloaty codegen unfortunately. As Rust developer you really have to be careful about this and for the longest time, Wasmi development was focusing on other things. However, it also heavily depends on the optimization and compiler settings. Additionally, Wasmi can be modularized a lot, e.g. you can disable its WASI support which should get rid of a huge chunk since it is a very heavy-weight Wasmtime dependency.

Furthermore, Wasmi’s SIMD support takes a whole lot of space but can also be disabled.
Finally, Wasmi’s CLI argument parser uses clap which again is known to be pretty fast and featureful but also bloaty.

There are some more or less unstable compiler flags which can drastically reduce binary artifact size by e.g. not using Rust’s internal formatting facilities (which are known to be bloaty) or using a simpler panic infrastructure (plain abort) etc.

However, I assume the reason is that Wasmi simply uses a lot of generics which are monomorphized during compilation in a way that might be efficient for runtime performance but somewhat bloaty for binary artifact size.

IIRC I was able to get Wasmi’s CLI application stripped down to ~1.5-2 MB.

2 Likes

Hi and sorry for bringing up this kinda old topic. However, I wanted to check the artifact size of Wasmi myself.

  • cargo build -p wasmi_cli –profile bench yields a 5.2M binary
  • cargo build -p wasmi_cli --profile bench --no-default-features -F run yields a 2.0MB binary
  • cargo build -p wasmi_cli –profile size –no-default-features -F run yields a 1.9MB binary

Where

[profile.bench]
lto = “fat”
codegen-units = 1

[profile.size]
inherits = “release”
lto = “fat”
codegen-units = 1
opt-level = “s”

Note that --no-default-featuresdisables features such as WASI and SIMD and .wat and .wast support.

Built on an Apple Macbook M2 Pro.

With some more tricks it is possible to bring down these numbers to ~1.2MB.
That’s still too big and I am sure to bring these numbers down.

Thank you very much for being willing to continue this conversation with me. I recently started my May holiday, so I’ve returned to optimizing wasmz.

One of the optimizations is to provide more and finer-grained compilation flags, similar to wasmi.

Regarding the --no-default-features option you mentioned, my view is that most people don’t need WAT format support, so this option could be disabled by default.

If we don’t approach it from the perspective of stripping features (for example, using software emulation to simulate hardware SIMD functionality), then the remaining entry points for reducing size are: using unsafe Rust, or going a step further by implementing some functionality directly in C (though that seems counterproductive).

In fact, after I tried implementing some features of wasmz using C, the binary size was reduced by 120k, which surprised me a lot. It even made me start considering whether I should adopt C as the primary language for implementing certain modules.

Going further, I began to think about whether there is a method to optimize the binary size for compiled generic languages like Rust and Zig. Another Rust project of mine (providing PG protocol support for DuckDB) currently produces an astonishing 80M executable file.

Sorry, my reply is a bit scattered. In the past, I have been using languages like Golang, which have a heavier runtime and rely less on generics, so I don’t have much experience in this area. If you have any practical experience with this kind of optimization (reducing the binary size of compiled generic languages), please let me know.

1 Like

@Ray-D-Song hi and sorry for the late reply.

[..] my view is that most people don’t need WAT format support, [..]

There are mainly two groups of people: the ones using a Wasm interpreter as a library component and the ones using your Wasm interpreter as CLI application. The former likely uses .wasm the latter likely uses both .wasm and .wat (and maybe even .wast) formats.

[..] using software emulation to simulate hardware SIMD functionality [..]

Concerning Wasm simd support, there again are two different approaches. Wasmi for example implemented its SIMD internals using unsafe Rust in order to make the Rust/LLVM codegen generate SIMD instructions if possible. On the contrary, tinywasm conditionally uses unsafe SIMD intrinsics.

[..] I should adopt C as the primary language for implementing certain modules [..]

Despites Wasmi’s really bad artifact binary size I still think it is generally possible to achieve artifact binary sizes comparable to C. However, given Rust’s variety of langauge features one has to be more careful. In the past, Wasmi was primarily optimized for runtime performance - thus artifact binary size was not measured and controlled allowing it to explode. There are some more or less well-known techniques thought to prevent bloating the binary in Rust.

For example, panic and formatting infrastructure in Rust regularly shows to be an offender as panics with internal formatting are often used to signal unreachable or bad state in a program. Given that generics in languages like C++ and Rust are monomorphized a single panic with a formatted string output can easily end up bloating the binary when there are many instances of this generic code.

Since LLVM’s optimizer currently fails to merge common paths of monomorphized generic code it is still important for a programmer to do that painful job yourself. Below is an example in a recent code change from Wasmi:

#[cold]
#[inline]
fn unsupported_operand_pair(lhs: impl AsRef<Operand>, rhs: impl AsRef<Operand>) -> ! {
    #[inline(never)]
    fn impl_(lhs: &Operand, rhs: &Operand) -> ! {
        unreachable!("unsupported operator pair: lhs = {lhs:?}, rhs = {rhs:?}")
    }
    let lhs = lhs.as_ref();
    let rhs = rhs.as_ref();
    impl_(lhs, rhs)
}

The #[cold] attributes tells the compiler that this path is unlikely to be taken and not important for the binary size. The bloaty part is what happens inside the impl_ function since it has a formatted panics (unreachable!). We explicitly put #[inline(never)] onto it in order to tell the compiler not to embed it into the surrounding code because it is used by a generic routine that has lots of instances. That way we made sure to only have one copy of this code fragment in the final binary despite having tons of monomorphised call sites. We put #[inline] on the outer unsupported_operand_pair function since its body is likely just no-op besides the call to impl_.

For Wasmi a friend of mine and me designed a tool to show where the bloat is located in a Rust binary. For example we compiled Wasmi using: cargo bloat --message-format=json -p wasmi_cli --profile bench -n 0 -w > out.json to get the JSON output which we fed to:

And this is the result:

1 Like

Looks like getting rid of clap would be an easy win.

IIRC, the og author of cargo bloat, RazrFalcon, made a binary size oriented parser called pico-args.

1 Like

@asibahi yes that could be a win for Wasmi’s binary size.

There were attempts to replace clap in the past:

The issue is that it is always a trade-off. clap is heavy weight for usability reasons.
For example, it automatically writes --help dialogues (and even different ones for -h) for the app and all its sub commands - keeping everything in sync automatically.
Furthermore, one can specify the entire CLI in a single struct due to proc. macros which is unmatched for maintainability.

The real question is: how important is it to optimize the binary size of a CLI application?
Usually, users who have strict requirements (e.g. for embedded) use Wasmi as a library which drops the whole clap question entirely.

According to:

Key take aways:

  • The main offender to artifact binary size is Wasmi’s complex translator and wasmparser’s validation logic.
  • One could introduce a crate feature to disable Wasm validation support in Wasmi - useful if for example a user knows in advance that all inputs are pre-validated.
  • Also you can easily spot the aforementioned panic and formatting bloat that can be handled as described above to some extend.
  • Another idea is to introduce (de)serialization of Wasmi’s internal IR and drop its Wasm translation logic entirely. This is how tinywasm and Wasmtime achieve good artifact binary size wins compared to Wasmi at the moment.
  • Finally, of course, applying code refactorings to generic code should help but this is a lengthy, labour intense process with not so many really low hanging fruits.
1 Like

Hey, Wasm3 (co-)author here.

First of all, VERY nice work. It is always great to see new interpreter implementations pushing the space forward, especially ones that are compact and embeddable rather than immediately jumping into JIT/AOT.

On the “M3” / threaded-code / tail-call architecture topic

I completely agree that the underlying ideas are much older than Wasm3. Wasm3 did not invent threaded interpreters, accumulator-style state, or tail-call-oriented dispatch. But I do think Wasm3 helped popularize this particular shape of interpreter architecture in the WebAssembly world, and made it more visible to people building small, fast, portable Wasm runtimes.

One thing I’m personally happy about is that Wasm3 created some healthy competition in the interpreter space. Before that, many people treated “Wasm interpreter” (and interpreters in general) as something that would naturally be slow or incomplete, and the serious performance discussion was mostly around JITs and AOT engines.

Wasm3 showed that an interpreter could be quite competitive, while staying small enough to run in places where the bigger runtimes simply do not fit. It also demonstrated very fast cold-start behavior, which matters a lot for embedded, Faas and short-lived execution scenarios.

WAMR is a good example here. Its original interpreter was rather slow, and at the time the official position was essentially that they were not going to invest much effort into interpreter speed optimizations. Later, the WAMR team changed course and implemented the “fast interpreter”, which they clearly indicate is inspired by Wasm3 architecture.


P.S. split into multiple posts, as I needed to use more than 2 links.

10 Likes

Main focus and achievement of Wasm3

Running WebAssembly on tiny MCUs and deeply embedded targets. Not “fastest interpreter on every benchmark”, but making Wasm practical on devices with very limited RAM, flash, toolchains, and operating-system support. A lot of the design tradeoffs came from that world.

Wasm3 also helped push some ecosystem discussions around embedded Wasm, including the custom page size proposal. The default 64 KiB Wasm page size is very natural for desktop/server engines, but it is quite painful on small microcontrollers. Having a way to formulate that problem clearly for the spec/community was, in my opinion, one of the more important outcomes of the project.

And yes, I’m also glad the tail-call / TCO-style interpreter approach got more attention afterwards. Josh Haberman’s “Parsing protobuf at 2+GB/s” article mentions Wasm3 specifically, and that line of work later influenced broader discussion around efficient interpreters, including improvements in the Python interpreter. That is probably one of the nicest forms of impact an open-source project can have: not only being useful itself, but also giving other people a concrete reference point for improving their own runtimes.

So, from my side, I would not frame this as “Wasm3 invented fast interpreters”. It did not.
I think it is fair to say that Wasm3 helped bring this family of techniques into the modern Wasm interpreter conversation, especially for constrained and embedded environments.

4 Likes