Bun’s Zig fork got 4x faster compilation times

I believe that even with multiple LLVM codegen units, it still can do thin-lto across them.

Good point, I did neglect to mention that. Of course, this will necessarily come at the cost of some compilation speed—I don’t know much about thin-LTO so I’m not sure what the compilation speed cost is nor the optimization benefit. But you’re right to say that there’s a bit more nuance to this than I suggested!

First, while “crate” is a user-visible separate compilation unit, when compiling, a crate is split into multiple codegen-units for LLVM

Ah, I was unaware of that, interesting! Thanks for the link :grinning_face_with_smiling_eyes:

Second even though crates are human-authored, they are often not particularly great boundaries for splitting codegen work

Also a good point. I opted not to get into this because my first comment was already the length of a blog post, but I agree with pretty much everything you said.

The only thing I might push back on a little is:

[Thin LTO is] the only sane compilation model for zero-cost abstraction language with separate compilation

Perhaps I’m underestimating how effective thin LTO is, but to me, full LTO seems like the “correct” way to do this—fundamentally, I do not want to give up potentially significant optimization opportunities based on semi-arbitrary module splits! Solutions like thin LTO have always felt to me like workarounds for LLVM’s optimization framework not scaling very well. To be clear, I am in no way trying to suggest that it is easy to make an optimization framework which uses memory efficiently enough for full LTO to be a non-issue!—but I do feel (though with the recognition that I could be wrong, given that I’m entirely unfamiliar with the codebase) that a huge amount of LLVM’s memory usage could theoretically be eliminated with no impact on optimization.

Also, even if you do want to set arbitrary optimization boundaries, I think my biggest gripe is that the component which best knows where those boundaries would be best placed is LLVM itself! LLVM could start by running some efficient passes full-LTO-style, i.e. on the whole graph—maybe some inlining to eliminate small shared functions?—but after that initial work, it could consider the call graph and determine the best partition into smaller units which it would be reasonable to optimize independently (plus it can always do some thin-LTO-style inlining at the end if useful). As I’m sure you’re thinking, that is pretty much exactly what’s going on in the blog post you linked on back-end parallelism in Rust! Analyzing the function call graph to partition into sets of functions which are mostly independent is, at its core, optimization work, and I find it quite odd that right now LLVM expects every frontend to handle that itself somehow.

11 Likes