Efficient instatiation of union of pointers

The title is a bit confusing because I cannot figure out a better way to phrase this, I hope the post itself makes my issue a bit clearer.

I am writing something that operates over a buffer of bytecode. To facilitate iteration over this nicely, I’ve written something like the following:

const OpCode = enum {
    add,
    sub,
    mul,
    div,
    // [snip]
};

// These union fields store pointers to avoid potentially expensive copies for some instructions
const Instruction = union(OpCode) {
    add: *align(1) const Add,
    mul: *align(1) const Mul,
    div: *align(1) const Div,
    sub: *align(1) const Sub,
    // [snip]
    pub const Add = packed struct {
        a1: u8, // This code is generated, ignore the bad names
        a2: u8,
        a3: u8,
    };
    pub const Mul = packed struct {
        a1: u8,
        a2: u8,
        a3: u8,
    };
    pub const Div = packed struct {
        a1: u8,
        a2: u8,
        a3: u8,
    };
    pub const Sub = packed struct {
        a1: u8,
        a2: u8,
        a3: u8,
    };
    // [snip]
};

const BytecodeIterator = struct {
    bc: []const u8,
    ip: u32 = 0,

    pub fn next(self: *@This()) ?Instruction {
        if (self.ip >= self.bc.len) return null;
        const op: OpCode = @enumFromInt(self.bc[self.ip]);
        self.ip += 1;
        defer self.ip += size[@intFromEnum(op)]; // size is an array of instruction sizes
        return switch (op) {
            inline else => |o| @unionInit(Instruction, @tagName(o), @ptrCast(self.bc.ptr + self.ip)),
        };
    }
};

The issue here is hopefully apparent, the generated code for BytecodeIterator.next includes every single branch from the inline else, which is not optimized away.

Is there some way for me to prevent that from happening? My union fields are all the same thing, a pointer.
I have considered using an untagged union instead, however a big benefit to using a tagged union is being able to switch over Instruction and directly get the union’s value. I really do not want to lose that, it makes code a lot nicer and lessens the chance of an error sneaking in.
I have also considered manually constructing the union, however I am not certain what the memory layout of unions, or even packed unions, looks like and cannot find any examples/documentation about it at the moment.

Are you sure about that, did you actually use an optimized release mode?

Edit improved godbolt example:

1 Like

It is quite odd they aren’t optimised into no branches at all, what build mode are you using? And zig version?

you seem to have jumped straight to sketch memory manipulation, when you could just

const Args = struct { u8, u8, u8}
const Instrction = struct {
    op: OpCode,
    args: Args,
};

But still, that should be unnecessary, it should be getting optimised how you expect.

So, I played with it a bit and made the bytes array a runtime slice to prevent the optimiser from precomputing everything (I made a few changes and the behaviour changed to example always putting 0 into eax since bytes was compile time known): runtime bytes

Interestingly enough, this allowed this version to optimise it down even further compared to @Sze’s version to basically just calculating the offset into bytes and a branch depending on if it’s .add or not.

My guess would be that this is a case of the backend taking a wrong path in it’s decision tree and ending up in a place where the LLVM developers decided to stop letting it continue (after all, if the program is big enough one could “just” throw more computation time and removal of limits like that at a program and get a bit better output at the cost of multiple days of optimising).

Thank you all for your responses, I may have reduced my example a bit too much for it to make sense, allow me to clarify things.

I am using ReleaseFast, latest alpha at the time I was testing this, 0.16.0-dev.565+f50c64797.

I didn’t include any in my example, but there are instructions that have different amounts and/or types of arguments.

Here’s an updated example that showcases the issue: Compiler Explorer

1 Like

Okay, so I was looking at the LLVM IR and I noticed that next was allocating space for Instruction separately for every single branch, and that’s the part that is not being optimized.

I made a variable outside the switch that is then assigned to instead and this is being optimized correctly now.

This feels very unintuitive and wrong. Is this something that can be improved on in Zig’s compiler or is this a fault on LLVM’s part?

Here is a comparison of the two functions, next2 looks a lot more like what I’d expect.

2 Likes