Systems Distributed: Don't Forget To Flush

31 Likes

a bit unfortunate the camera was on me during most slide transitions - it kinda made things awkward. still, thanks for watching :slight_smile:

19 Likes

Excellent talk, and nice touch on the as above, so below ā€œatthairā€

The c++ critique was on point. :joy:
Great presentation as always.

2 Likes

Really good presentation.
My knowledge is a bit lacking when it comes to discerning the optimal methodology to get the best machine output, but this talk really made it simple to understand for this particular use-case, and answered the ā€œwhyā€ that my mind was failing to fully grasp.

1 Like

Great talk, organized, fun and delivered well.

I’m not sure I understand what you mean by this, or how it works.

Sink supports vectors and splats, including together. Now a splat means to repeat the last buffer n times, which In short summary means you can logically send a memset across a chain of syncs without redundantly writing out and copying that memory.

I understand that the memory can be pipelined so that memory isn’t copied… it kinda makes sense in my head, but I don’t think i have a good mental grasp on it. Perhaps playing around with the IO interface will give more intuition on it.

I always enjoy a little ostream of C++ trashing. Really cool presentation

2 Likes

really nice lecture!

1 Like

instructions unclear my stdio is now full of shit

3 Likes

Regarding the C + MUSL + putc example and not being able to inline across compilation units, I wonder what it looks like when MUSL is linked statically and LTO is enabled (which is good and ā€˜slim’ enough these days that I generally use it on all my C projects in release mode).

Usually LTO manages to strike a good balance between inlining across compilation units while at the same time reducing binary size (which sounds like an impossibility, but only until one considers that LTO also strips all dead code and data, and also allows to eliminate code that’s only to be discovered dead after inlining)…

Speaking of size, I currently have a curious binary size puzzle where this project (GitHub - floooh/sokol-zig-imgui-sample: Sample to use sokol-zig bindings with Dear ImGui) compiled with –release=small on macOS yields a native executable that is almost twice as big as the WebAssembly ā€˜executable’ (700 vs 400 KBytes all uncompressed sizes).

Both link statically with the C++ stdlib and Dear ImGui (which should be the biggest contributors to size). The main difference is that the WASM build is linked with the Emscripten linker (which does much more than linking, like also running wasm-opt), vs the Zig linker.

For comparison, the equivalent C project build with cmake MinSizeRel mode is 515 KBytes, but this links the Mac’s system C++ stdlib dynamically, so it makes sense that it is smaller than the Zig build (but not that it is bigger than the WASM build).

I’ll have to investigate why the WASM build is actually so surprisingly small…

…of course WASM byte code might simply be more compact than ARM64 machine code… but nearly twice as much?

1 Like

FWIW, I think one of the ā€œbugsā€ of Rust toolchain is that --release does not imply thin lto by default: Is "`#[inline(const)]`" possible? - #9 by matklad - compiler - Rust Internals

It just makes sense from the compilation-model point of view.

1 Like

Can we find the slides anywhere to fix it ourselves? :slight_smile:

2 Likes

I was thinking about doing a blog post version of the talk

27 Likes

I have a couple questions about the language comparison part:

  1. When talking about the C musl interface, you say

    I can confirm that the buffering at least happens before it calls the function pointers

    Then later you say this about many languages, including C:

    None of these mainstream languages manage to get buffering into their interfaces

    Is this a mistake?

  2. When analysing Rust’s Writer, you say

    I did notice that Rust was extraordinarily good at devirtualization […] That’s cheating though, this analysis is specifically for the cases where the stream implementation is runtime known.

    But you commend C, C++ and Go for avoiding virtual/indirect calls. How would these other languages be able to avoid indirect calls without devirtualization, and how would the perform devirtualization if the stream implementation were runtime known?

    Either this is an unfair comparison, or I don’t understand devirtualization well enough. I’m hoping it’s the latter.

1 Like

musl libc does get the buffer into the struct, which almost succeeds at being transparent to the optimizer, but the C language then falls short due to all the libc functions being across a compilation unit boundary. Related, we just saw in the news FILE became opaque in OpenBSD which takes it even a step further away from being in the interface.

The key consideration is about the hot path of I/O methods that only operate on the buffer and do not make vtable calls. The functions in the vtable will be runtime-known, however the hot path logic that operates on the buffer should be fully concrete, optimized code, with no virtual function calls.

In my analysis, Rust was good at devirtualization when it had access to all the code statically in the same crate. But if the stream implementation was across a crate boundary, then even the hot path accessing the buffer went through an indirect call.

However, there was a bigger thing that I missed in my analysis, pointed out by @matklad, which is io::BufWriter<dyn io::Write>. I guess people don’t really do this in Rust since devirtualization usually does the trick, but I imagine it could be a valuable pattern when used across crate boundaries.

10 Likes

This makes things clearer, thank you.

This still confuses me because Rust doesn’t buffer in the interface, but I’m guessing you’re talking about provided method like write_all which presumably uses indirect calls to write/flush but should itself be optimizable?

Shouldn’t that be taken care of by LTO though? I don’t think that the concept of ā€˜compilation units’ is all that relevant in C anymore, the only remaining ā€˜optimization boundaries’ are syscalls and calling into DLLs.

It would be if C compilers provided their standard libraries in source form like Zig does. However, they don’t! They ship them pre-compiled, which is also why they can’t cross-compile.

They also often dynamically link, and you certainly can’t inline a function that isn’t provided until runtime.

3 Likes

even if they did provide the source, it would be at link time, which I assume might be less capable of optimisations than the compiler. I assume they don’t do nearly as much analysis as a compiler.

Or am I underestimating linkers.