Systems Distributed: Don't Forget To Flush

andrewrk · August 26, 2025, 2:17am

Guess what LTO stands for

vulpesx · August 26, 2025, 2:21am

I am aware, I just have the totally unfounded assumption that LTO is less capable of optimisation than the compiler.

This mostly comes from linkers being separate programs to the compiler, in most cases, so the linker would have to redo analysis while having less information to begin with.

Many unfounded assumptions here, feel free to correct me.

andrewrk · August 26, 2025, 3:31am

Sorry for assuming incorrectly.

LTO is fairly simple conceptually - I’m guessing you’re just missing one little detail and then it will all click into place:

Instead of emitting machine code, in LTO mode, compilers emit higher level compiler-specific IR into pseudo-object files. GCC emits GIMPLE; Clang emits LLVM bitcode (equivalent to LLVM IR). These pseudo object files are not compatible with third party linkers.
When the linker is invoked, it notices that some object files are not in fact machine code but still need to go through the compilation pipeline to be transformed into machine code. The linker combines all the IR into one big module and then runs the full compiler pipeline on it, creating one big pseudo-object that is then combined with all the rest of the actual objects.

Anyway, point being, yeah it runs the whole analysis on everything all at once. It makes the compilation phase fast and the linking phase really slow.

It’s very similar to “unity builds” in C/C++, which are effectively LTO but done manually by the programmer.

vulpesx · August 26, 2025, 3:35am

I had the thought that a linker might have some way to pass on more information to it in addition to the machine code. But that makes more sense, thank you.

matklad · August 26, 2025, 9:05am

Good mental model for LTO is “square peg in a round hole”. The C compilation model, with separate compilation of object files and concatenation them into the final executable, with slight adjustment of certain addresses, doesn’t really work for languages with monomorphisation and source-level abstraction, like C++ or Rust. The shape of the language sometimes compels you to do things like getters, but, if a getter is compiled separately, it won’t be inlined, so the downstream code won’t be able to understand that it is, in fact, a getter, and that, eg, it is safe to cache the results in registers, rather than loading from memory every time.

The reasonable compilation model for this sort of language is along the lines of what Zig does: compile everything in one go, give the compiler full context about everything to allow inlining.

The problem with reasonable is that it doesn’t fit into existing build files, which are already structured as a manual map-reduce job.

The solution is to basically hack this, to turn the linker into compiler, smuggle IR inside object files, and make the final linker job to do the bulk of compilation. Not because it makes sense, but simply because this allows re-using existing makefile.

Though, worth mentioning that there are two different kinds of LTO: full LTO and a more recent thin LTO. Full LTO is basically unity build, yeah.

Thin lto is what you get if you really want separate compilation, and you really want zero-cost abstraction, and you actually think how you would solve it. With thin LTO, we do structure our entire build as map-reduce job for real, and we don’t just pretend to do this. But, in the map phase, where we compile individual object files, we also collect a bit of meta information like “this function is small, its worth inlining, here’s its source code”. So, in the “reduce” phase, we can run cross-CU inlining and further optimizations, but our decisions are based on the information we gathered in the map phase, and not on the global view of the entire world.

Corollary: it doesn’t make sense that Rust doesn’t default to lto=thin for release builds. I probably should blog about that at some point…

floooh · August 26, 2025, 9:36am

Regarding getters, I find it pretty neat that compilers can ‘extract’ the one item the caller is interested in as long as the compiler can see the entire code and the getter has no side effects (I would expect that this also works with LTO when the complex getter is in another compilation unit):

…e.g. instead of tons of granular getter functions you can have a single “just give me everything getter” and even if the user just picks one or two items from the result the code will be specialized to just those items.

PS: and here’s an example where this idea breaks down because the compiler can’t make sure that the extern functions don’t have side effects, so it needs to call them all even though the caller is only interested in a single result

PPS: nice, Clang actually has an attribute for that situation (to declare that a getter function is side-effect free):

matklad · August 26, 2025, 9:52am

My mental model in general is that compiler doesn’t care at all where you draw your function boundaries. It will inline & SROA, and then all the usual redundancy-elimination optimizations go to town. The only thing that really matters is “can compiler see the code it is supposed to optimize” (so, no extern / function pointers)

Calder-Ty · August 26, 2025, 12:14pm

that was an interesting learning excursion, I hadn’t heard of Scalar Replacement of Aggregates before.

matklad · August 26, 2025, 12:42pm

Yup, SROA is very important, and poorly know.

It essentially is inlining for data.

Normal inlining, inlining for code, replaces function call with the body of the function. The purpose here is not to eliminate function call overhead per se, but rather to let the compiler “see” what the nested function does, and apply optimizations across two functions. For example, if both the parent and the callee functions invoke this.get_x(), then, after the inlining, compiler can notice that the second get_x is redundant. Or the inlined function might be called inside the loop, and the callee might do a bounds check, which, after inlining, can be hoisted out of the loop.

SROA is the same idea, but applied to data. Aggregates (structs) are, in general, slabs of memory with specific layout. A struct is a pointer, and you “know” that you can find fields by offsetting the pointer. Similar to how compiler have troubles understanding what a function call can do, compiler is bad at reasoning about memory. The memory might be aliased by some different pointer, it might be accessible by a global, or might be even written by some other thread. In general, if a compiler sees code like

point.x;
f();
point.x;

where point.x is understood as load from memory, it’s pretty hard for compiler to “see” that the results would be the same.

What SROA does is that it bulk loads the struct into registers, and then stores it back at the end:

let x = point.x;
let y = point.y;

f();

point.x = x;
point.y = y;

With locals, compiler has full picture, it knows that no one can touch a local unless its address is escaped, so it can start applying simplification to them.

So that’s SROA: we take a data structure which is a memory in specified layout, and “inline” it into a set of local registers.

cryptocode · August 26, 2025, 12:58pm

When will you start work on the Zig optimization pipeline?

matklad · August 26, 2025, 1:10pm

Hopefully not any time soon, two Rust compiler front-ends is enough compilering for me for the time being

mnemnion · August 26, 2025, 4:11pm

Which, at least in principle, is a good argument for the “no defined memory layout” policy for Zig structs. Instead of burdening the optimizer with proving that code doesn’t rely on specifics of struct layout, just decline to make those guarantees to begin with.

I’m not sure how much of a difference it makes in practice, but the fewer ways to accidentally pessimize code are made available, the better.

Calder-Ty · August 26, 2025, 6:26pm

That’s a great point. I was just thinking about this today and wondering if Zig would ever get guarantees about struct layout, There seem to be a lot of things that are precluded because struct layouts are not guaranteed, but I never thought about how that can help with optimization.

EDIT:
I mean, I did know how it could help by letting the compiler pick the best byte-order, but not in other kinds of optimizations.

affine-root-system · August 26, 2025, 7:37pm

What’s precluded? Genuinely curious.
It seems to me that the goal is that any trick you could play based on guaranteed struct layout, is something you can just do explicitly, no tricks or hidden assumptions required.

mnemnion · August 26, 2025, 8:03pm

An example from something I’m working on: the struct has a few ‘registers’ and a much larger stack(s) portion. It would be useful to ensure that everything which is ‘hot’ ends up in the same cache line, and that said cache line is what the pointer to the entire struct points to.

I’ve figured out how to ensure this manually, but it’s messy compared to an explicitly-specified layout.

KilianHanich · August 26, 2025, 8:10pm

If you (really) want that, use extern struct and read up on how it works.

But most people don’t need (or want to) specify exactly every single struct is laid out in memory, but just the important ones (if at all).

mwilbur · October 3, 2025, 2:01pm

Please do!