How to avoid implicit memcpys?

matklad · July 17, 2023, 2:14pm

In TigerBeetle, there’s one problem which bites us again, and again, and again — Zig makes it easy to implicitly copy large structs, and that hurts. Consider this example progam:

const Big = struct {
    ballast: [4096]u8 = undefined,
    small: Small,
};

const Small = struct {
    ballast: [128]u8 = undefined,
};

export fn foo(xs_ptr: [*]const Big, xs_len: usize) callconv(.C) void {
    const xs: []const Big = xs_ptr[0..xs_len];
    for (xs) |x| {
        bar(&x.small);
    }
}

noinline fn bar(x: *const Small) void {
    _ = x;
}

If I compile it with

$ zig build-lib stack-copies.zig -Drelease-fast -fstrip -femit-llvm-ir

then the resulting .ll file contains

Block5:                                           ; preds = %Then3
  %16 = extractvalue { ptr, i64 } %10, 0
  %17 = getelementptr inbounds %stack-copies.Big, ptr %16, i64 %12
  call void @llvm.memcpy.p0.p0.i64(ptr align 1 %3, ptr align 1 %17, i64 4224, i1 false)
  %18 = getelementptr inbounds %stack-copies.Big, ptr %3, i32 0, i32 1
  call fastcc void @stack-copies.bar(ptr %18)
  br label %Block6

That’s a needless copy of 4224 bytes! It can be fixed by changing |x| to |*x|, but its a subtle thing, hard to notice without looking at the asm / ll.

I know that there are some long-term approaches to improve the situation here, like pinned structs, but I am wondering if there are some short-term remedies here?

Q: Can we somehow lint our existing code to warn about large copies?

I think ideally such a lint should live in the Zig compiler, and implemented in a backend-independent way. However, I don’t think it is easy to plug custom code there — afaiu, zig doesn’t expose it’s internal IRs for analysis.

My next thought here is to look at LLVM IR. The .ll files are easy to produce, fairly stable, and easy to work with, because they are text.

Q: Would detecting memcpys on the LLVM level work? Or are there any non-obvious problem on that path, like false positives or something?

Preliminary investigation shows that look at .ll files can uncover at least some problems

$ λ rg -F 'llvm.memcpy' tigerbeetle.ll | rg '.*(i64 \d+).*' -r '$1' | sort -n -k 2 | tail -n 32
i64 11904
i64 11904
i64 11904
i64 11904
i64 11904
i64 12032
i64 12032
i64 12032
i64 12032
i64 12032
i64 12032
i64 13568
i64 13568
i64 13568
i64 13568
i64 13568
i64 37376
i64 37376
i64 47616
i64 47616
i64 58832
i64 60624
i64 101888
i64 101888
i64 125136
i64 255808
i64 255808
i64 255856
i64 255856
i64 255856
i64 265248
i64 265248

We do indeed seem to copy quarter-of-a-megabyte here and there! Manually looking at the last entry even attributes it to this specific line:

github.com

tigerbeetle/tigerbeetle/blob/2c7e0074ebcb69b2c5647213d787f056875ad26b/src/vsr/replica.zig#L832


      
          .quorum_nack_prepare = quorum_nack_prepare,
          .quorum_majority = quorum_majority,
          .nonce = options.nonce,
          // Copy the (already-initialized) time back, to avoid regressing the monotonic
          // clock guard.
          .time = self.time,
          .clock = self.clock,
          .journal = self.journal,
          .client_replies = client_replies,
          .message_bus = self.message_bus,
          .state_machine = self.state_machine,
          .superblock = self.superblock,
          .grid = self.grid,
          .grid_write_blocks = self.grid_write_blocks,
          .opened = self.opened,
          .view = self.superblock.working.vsr_state.view,
          .log_view = self.superblock.working.vsr_state.log_view,
          .op = undefined,
          .commit_min = self.superblock.working.vsr_state.commit_min,
          .commit_max = self.superblock.working.vsr_state.commit_max,
          .pipeline = .{ .cache = .{} },

What happens here is that we want to intialize a struct with many fields, so we use self.* = .{} syntax to statically check that all fields are set. But, because the struct is large, and we need stable addresses, some fields are initialized separately using self.state_machine = try StateMachine.init(...), and then are set-to-themselves in this last line which produces memcopies.

I am thinking about automating this process, for which I need to find all .ll functions which call memcpy with large comptime-know length argument, demangle them, and print their names and locations. I think llvm-ir-analysis can help here, but that is in Rust, and I’d rather we stick to a single language in TigerBeetle

Q: is there an existing Zig library which allows me to conveniently load .ll files, build call-graphs, and run ad-hoc analysis?

Another other thoughts or ideas here?

gonzo · July 17, 2023, 3:26pm

Great question Mat! I personally believe this should be declared more explicitly in the code. For example:

    for (xs) |const x| {
        bar(&x.small);
    }

vs

    for (xs) |var x| {
        bar(&x.small);
    }

Another possibility would be issuing warnings, or, as you mentioned, having a linter mode in the zig compiler. But there is well-known opposition to warnings, so I am not sure that would fly – but linter diagnostics could be allowed, since you explicitly requested for them?

HurricanKai · July 17, 2023, 3:33pm

For a current project of mine I’ve opted to change a lot of []const Big to []*const Big.
It seems like a usability win and often I find it easier to model data this way instead.

I also think that given the much more advanced control over allocations in general, doing things like using linked lists for small arrays of large objects become much more attractive.

I agree that in cases where this is necessary it’s annoying and quite the performance pitfall, but I don’t find myself modeling data this way so often. IMO standard slices of such large objects are commonly a bad idea, and we’re only used to it from other languages because other constructs are so badly integrated into those languages.

So yeah, try linked lists or lists of pointers, it’s often the more natural and you have control over the allocator anyways!

LucasSantos · July 17, 2023, 3:38pm

This is, indeed, a well hidden footgun.
Defaulting to always capturing by pointer would be suboptimal, as it would create unnecessary indirections, and may create loop carried dependencies.
I believe the most definitive solution would be for captures to follow the same logic as parameter passing. That is, zig should analyse if it’s best to copy the object or create a reference to it. Before you post, I actually assumed that it was already doing it.

squeek502 · July 17, 2023, 4:41pm

It seems very possible that this is effectively a bug, as there are similar issues with T vs *const T function parameters:

github.com/ziglang/zig

InternPool: pass by const pointer

committed 03:47AM - 11 Jun 23 UTC

andrewrk

+94 -94

The Zig language allows the compiler to make this optimization automatically. We… should definitely make the compiler do that, and revert this commit. However, that will not happen in this branch, and I want to continue to explore achieving performance parity with merge-base. So, this commit changes all InternPool parameters to be passed by const pointer rather than by value. I measured a 1.03x ± 0.03 speedup vs the previous commit compiling the (set of passing) behavior tests. Against merge-base, this commit is 1.17x ± 0.04 slower, which is an improvement from the previous measurement of 1.22x ± 0.02. Related issue: #13510 Related issue: #14129 Related issue: #15688

matklad · July 17, 2023, 5:14pm

llvm-ir-analyzer panicked on the bitcode that Zig produces, so I went down the inglorious road of “parsing” textual .ll:

github.com

tigerbeetle/tigerbeetle/blob/ee67e2ab95ed7ccf909be377dc613869738d48b4/src/copyhound.zig

//! Analyze LLVM IR to find large copies.
//!
//! To get a file with IR, use `-femit-llvm-ir` cli argument for `zig build-exe`, or set
//!
//!     exe.emit_llvm_ir = .{ .emit_to = "my-prog.ll" };
//!
//! in build.zig
//!
//! Pass the resulting file to copyhound:
//!
//!     $ zig run src/copyhound.zig -- --threshold-bytes 4096 < my-prog.ll 2>&1 | sort -n -k 3
//!
//! This only detects memory copies with comptime-know size (eg, when you copy a `T`, rather than a
//! `[]T`).

const std = @import("std");
const stdx = @import("./stdx.zig");
const assert = std.debug.assert;

const log = std.log;

This file has been truncated. show original

Seems to work somewhat:

warning: .vsr.vsr.replica.ReplicaType.init: 7480 bytes memcpy
warning: .vsr.vsr.replica.ReplicaType.init: 7480 bytes memcpy
warning: .vsr.lsm.tree.TreeType.init: 7968 bytes memcpy
warning: .vsr.lsm.tree.TreeType.init: 7968 bytes memcpy
warning: .vsr.lsm.tree.TreeType.reset: 7968 bytes memcpy
warning: .vsr.lsm.tree.TreeType.reset: 7968 bytes memcpy
warning: .vsr.lsm.tree.TreeType.init: 8064 bytes memcpy
warning: .vsr.lsm.tree.TreeType.reset: 8064 bytes memcpy
warning: .vsr.vsr.superblock.SuperBlockType.read_header_callback: 8192 bytes memcpy
warning: .vsr.vsr.superblock.SuperBlockType.read_header_callback: 8192 bytes memcpy
warning: .vsr.vsr.superblock.SuperBlockType.repair: 8192 bytes memcpy
warning: .vsr.vsr.superblock.SuperBlockType.write_staging: 8192 bytes memcpy
warning: .vsr.lsm.tree.TreeType.init: 9216 bytes memcpy
warning: .vsr.lsm.tree.TreeType.reset: 9216 bytes memcpy
warning: .vsr.lsm.groove.GrooveType.init: 11136 bytes memcpy
warning: .vsr.lsm.groove.GrooveType.init: 11136 bytes memcpy
warning: .vsr.lsm.groove.GrooveType.init: 11136 bytes memcpy
warning: .vsr.lsm.groove.GrooveType.init: 11136 bytes memcpy
warning: .vsr.lsm.groove.GrooveType.init: 11136 bytes memcpy

AndrewCodeDev · July 17, 2023, 5:31pm

Yeah, this same issue gets a lot of people in trouble with the “auto” keyword in C++. At least in the C++ case, the keyword expresses the intention there (as opposed to auto* or auto&&)… there’s not much of a “semantic tell” with the capture syntax besides just declaring it as a pointer.

There isn’t really a concept of “perfect forwarding” in Zig. It is very odd though given the [*] argument circumstances.