Bun’s Zig fork got 4x faster compilation times

I think it was foolish of me not to consider that these enhancements were already planned. It seems like I didn’t have the trust that the Zig team is due.

9 Likes

I actually find it telling in this case that the Bun devs don’t feel like putting in whatever extra time it would take to vet/edit their own code so that they could propose it for upstreaming into Zig and stand behind it as their own.

Perhaps they are aware that there is ongoing work in the Zig project which will achieve the same overall feature but in a more rigorous manner.

5 Likes

It doesn’t matter how much they vetted/edited it. It was produced with the help of AI and per Zig’s Code of Conduct it wouldn’t be welcomed.

3 Likes

I certainly wouldn’t mind if an AI would look at a bug report and automatically generate the exact steps to reproduce it.

5 Likes

I think beyond debug builds there are good reasons to pursuit a faster llvm-backend, particularly for release builds. I often need to make release builds when testing an optimization, and waiting two minutes (and then waiting another minute for benchmark results) every time I make a change is kind of annoying. It would be nice if we could get faster iteration speed for this use-case as well.

5 Likes

I also often have this problem(and even more so when working on a C++ or Rust codebase) such that I use a different workflow for this iterative optimization.

I have a script that basically just copies the file or entire directory, that I’m working on, into a new directory(name is a timestamp or given by an argument), compiles the code in the current state, and runs the benchmarks.

Then I can continue my train of thought and test other ideas I have and just repeatedly fire off new benchmark runs. Though one must ensure that only one runs at a time with a lock file or something. I often do this on a cluster so it just issues a new batch job and things can even run in parallel on different nodes. But if I do it locally, I make sure that I’m just in the tty running vim, such that the impact from me is minimal.

Then when some things are ready I can look at the results and see if it worth it or not.

1 Like

@Karim.MK Would you mark @mlugg’s writeup as a “solution” of this thread? To pin it for people just stumbling upon this.

2 Likes

The approach used by Bun here isn’t appropriate for release builds because it hinders optimization.

Fundamentally, the reason that splitting your bitcode across multiple LLVM modules works to speed up compilation is because each LLVM module is analyzed, optimized, and codegenned entirely independently. This means it is impossible for LLVM to optimize across module boundaries. In languages where there’s a clear boundary where optimization doesn’t matter so much (e.g. Rust crates) that might be fine, but Bun’s approach here is to split functions across their N LLVM modules based on a hash of the function name (resulting in an approximately even distributions of functions to modules), which makes the optimization boundaries quite literally pseudo-random (albeit deterministic, though their parallel Sema implementation will make it non-deterministic).

If, as in your case, you’re trying to test optimizations, you don’t want those optimizations to be randomly changing on you all the time!

The only case where that might sound okay is if the optimization you’re testing doesn’t affect your entire application, but only a small part of it. However, there’s a better solution in that case: only build the part of your application you’re actually trying to test! Thanks to Zig’s lazy analysis this is fairly straightforward (example implementation below). We use this when developing the Zig compiler itself—we can pass -Ddev=x86_64-linux to our zig build command to build a compiler which only supports the self-hosted x86_64 backend and ELF linkers, which results in a smaller binary which is faster to compile (especially if making an optimized build, IIRC it’s over 5x less time in LLVM Emit Object compared to a full build).

Example implementation of building subsets of your application
const enabled_features: std.enums.EnumSet(Feature) = .initMany(&.{
    foo, bar,
});

pub const Feature = enum {
    foo,
    bar,
    qux,

    pub fn check(f: Feature) if (enabled_features.contains(f)) void else noreturn {
        if (enabled_features.contains(f)) return;
        std.debug.panic("feature '{t}' is disabled in this build", .{f});
    }
};

// example usage:
fn doFoo() void {
    Feature.check(.foo);
    // code here will only be analyzed if `Feature.foo` is enabled
}
fn doSomeStuff(something: bool) void {
    if (something) {
        Feature.check(.bar);
        // code here will only be analyzed if `Feature.bar` is enabled
    } else {
        Feature.check(.qux);
        // code here will only be analyzed if `Feature.qux` is enabled
    }
}

In the simplest version of this, you just change the enabled_features definition in source when you want to compile a different feature set, but you can also make up some build system interaction if you find it useful!

In the Zig compiler, where the “feature” type is called dev.Feature, we decided to have another type dev.Env which is another enum specifying a few common configurations (i.e. sets of features) you might want to build with, and that dev.Env type is what we allow you to specify over zig build using -Ddev=.... Feel free to copy that idea if it sounds useful!

22 Likes

Two clarifications here (low confidence, as this is something I “know about”, rather than just “know”):

  • I believe that even with multiple LLVM codegen units, it still can do thin-lto across them. So its not only “everything in one pile” and “a bunch of neat independent piles”, there’s also a map-reduce middle ground, where codegen is mostly independent&parallel, but then also unit-summaries are computed, which allow for (limited) cross-unit inlining.
  • The “clear boundary” doesn’t really hold for rust, for two reasons:
    • First, while “crate” is a user-visible separate compilation unit, when compiling, a crate is split into multiple codegen-units for LLVM, and that split is not something the user has control of. It’s a bit smarter than just hashing the name of every function in a crate (Back-end parallelism in the Rust compiler | Nicholas Nethercote), but it is the same kind of thing.
    • Second even though crates are human-authored, they are often not particularly great boundaries for splitting codegen work, as its way to easy to end up with a non-inlinable trivial helper (Inline In Rust)

So, in Rust, we have a graph of crates (subdivided by human), and each crate is split heuristically by the compiler into a bunch of codegen units (CUs). Then, there are four modes of how we get from this partition to a binary:

  • no LTO whatsoever, the Debug build
  • Thin local LTO – CUs within the same crate are thin-LTO’ed against each other, crates are not LTOed, the default Release build
  • Thin LTO – everything is thin-LTO’ed, what should have been default Release build, because that’s the only sane compilation model for zero-cost abstraction language with separate compilation.
  • Full LTO – compilation step is morally just cating all the sources together, and the actual compiler is hiding inside the linker, compiling the project as just one big ball of code (requires tens of gigabytes of RAM).

EDIT: to pull in the key quote from Nicholas article (rust module == zig file):

Graph partitioning is an NP-hard problem. There are several common algorithms which are moderately complex to implement. rustc instead does something much simpler. It starts by simply creating one CGU per Rust module: every function in a module is put into the same CGU. Then, if the numbers of CGUs exceeds the limit (by default 16 for non-incremental builds and 256 for incremental builds) it repeatedly merges the two smallest CGUs until the limit is reached. This approach is simple, fast, and takes advantage of domain-specific knowledge in a useful way—program modules tend to provide good natural boundaries.

19 Likes

I believe that even with multiple LLVM codegen units, it still can do thin-lto across them.

Good point, I did neglect to mention that. Of course, this will necessarily come at the cost of some compilation speed—I don’t know much about thin-LTO so I’m not sure what the compilation speed cost is nor the optimization benefit. But you’re right to say that there’s a bit more nuance to this than I suggested!

First, while “crate” is a user-visible separate compilation unit, when compiling, a crate is split into multiple codegen-units for LLVM

Ah, I was unaware of that, interesting! Thanks for the link :grinning_face_with_smiling_eyes:

Second even though crates are human-authored, they are often not particularly great boundaries for splitting codegen work

Also a good point. I opted not to get into this because my first comment was already the length of a blog post, but I agree with pretty much everything you said.

The only thing I might push back on a little is:

[Thin LTO is] the only sane compilation model for zero-cost abstraction language with separate compilation

Perhaps I’m underestimating how effective thin LTO is, but to me, full LTO seems like the “correct” way to do this—fundamentally, I do not want to give up potentially significant optimization opportunities based on semi-arbitrary module splits! Solutions like thin LTO have always felt to me like workarounds for LLVM’s optimization framework not scaling very well. To be clear, I am in no way trying to suggest that it is easy to make an optimization framework which uses memory efficiently enough for full LTO to be a non-issue!—but I do feel (though with the recognition that I could be wrong, given that I’m entirely unfamiliar with the codebase) that a huge amount of LLVM’s memory usage could theoretically be eliminated with no impact on optimization.

Also, even if you do want to set arbitrary optimization boundaries, I think my biggest gripe is that the component which best knows where those boundaries would be best placed is LLVM itself! LLVM could start by running some efficient passes full-LTO-style, i.e. on the whole graph—maybe some inlining to eliminate small shared functions?—but after that initial work, it could consider the call graph and determine the best partition into smaller units which it would be reasonable to optimize independently (plus it can always do some thin-LTO-style inlining at the end if useful). As I’m sure you’re thinking, that is pretty much exactly what’s going on in the blog post you linked on back-end parallelism in Rust! Analyzing the function call graph to partition into sets of functions which are mostly independent is, at its core, optimization work, and I find it quite odd that right now LLVM expects every frontend to handle that itself somehow.

11 Likes

To push back against push-back, I would say that “full LTO” is definitionally incompatible with separate compilation — if you look at all the code before you start compiling (to machine code), you aren’t compiling separately.

But I think we are on the same page — the way I’d phrase this, the separate compilation comes from the days where even the code for your program didn’t fit in RAM and necessitated the use of overlays. Today, “source code fits in RAM” and “the resulting binary fits in RAM” are fair assumptions, and we can just make that work efficiently, instead of contorting the compiler to fit the shape of makefiles from 50 years ago.

EDIT: to clarify a bit more, I am primarily thinking here about lto="thin" codegen-units=1 Rust configuration, where we let go of the internal boundaries, and only cut at crates.

11 Likes

To push back against push-back, I would say that “full LTO” is definitionally incompatible with separate compilation

Oops, sorry, you’re completely right. For some reason I read “separate compilation” as talking about different files/crates/modules/whatever-s being handled in isolation by the compiler frontend, which is quite clearly not what that means! My apologies, no idea why I read it that way :stuck_out_tongue:

But I think we are on the same page

Indeed, I would agree with everything you just said.

6 Likes

I wonder, what sort of plan do they have for evolving past 0.15, if any. The compiler enhancements, on paper, wouldn’t provide additional friction, however they have also added things such as private fields, which would not be compatible obviously. The only way I can see in which this could work for the future is tech debt is nonexistent, maintain the language fork alongside Bun. Amazing plan given LLMs are godly machines which have no limitations and cost 0.00 to run.
Not that I have any reason to care, I’m not going to be the one maintaining the code or juggling bugs back and forward between my bespoke DSL and runtime, I just feel like it’s shortsighted even if we assume LLMs will evolve to be magical.
Best of luck to whoever has to maintain it.

3 Likes

Yeah, that sounds really inconsisten, but may you could you create a dependency graph of the functions (ignoring function pointers, but llvm can’t optimize much across function pointers anyways), and then split the graph into regions of the same code size, such that you cut the least number of edges. Doing that should keep the performance impact minimal, and since the dependency graph is usually like DAG, with only a few and usually small circles, the number of cuts you need should be pretty small.

In a game this is not so easy. E.g. I recently tested the performance of block updates on the client, in order to do that and on a realistic test scenario, I need the following features: asset loading, server ticks, networking, mesh generation, world generation to get the initial terrain, gui so I can look at the frametime graph. I think overall it actually takes less time to just wait the 2 minutes, than to find and isolate all these modules.

What would satisfy me though is replace @setRuntimeSafety with @optimizeFor · Issue #978 · ziglang/zig · GitHub, for a quick test this would make it trivial to only optimize the code that I care about. And even for more thorough tests, there are many codepaths that I don’t care about performance at all not even in releases (e.g. logging which thanks to comptime code duplication still takes a significant part of the compile time, but practically no part of the run time) that I could disable optimizations with this.

6 Likes

That would be enough, in spite of the fact that it only impacts the function it is called in, not other functions called by that function?

Yes, it only affects the scope in which it’s set: Documentation - The Zig Programming Language

I’m glad you broke it down like this, because my first thought reading the linked article was “why not just use the self hosted backend and enjoy incremental complitation?”

2 Likes

I would hope that it at least attempts to inline functions and optimize the inlined ones.

remember that feeling when some cracked out dev submits a PR and it gets merged, and you’re like holy sh*t dave, great job! we need to cherish those moments not only from the past, but for the future! why let AI rob an awesome human experience that we get only a few times in life? when our lives are bounded by time which is gold

10 Likes