The int from enum performance drama

To have direct access to an enum in my chess program I use some packed unions.
As a test I refactored 2 of these to ‘raw enums’ (CastleType and the same idea with Color which is also a u1)
(I tried that because packed unions have no easy native compare and std.mem.eql stuff etc.)
Like this:

// packed union (original)
pub const CastleType = packed union {
    pub const Enum = enum(u1) { short, long };
    /// The enum value
    e: Enum,
    /// The numeric value
    u: u1,
};

// raw enum (testing)
pub const CastleType = enum(u1) {
    short,
    long,

    pub fn u(self: CastleType) u1 {
        return @intFromEnum(self);
    }
}

I replaced simply everywhere, where i needed access to my arrays:
my_array[color.u] to my_array[color.u()].

The effect was a drastic performance boost downwards :slight_smile:

1362.1632 Mnodes/s // original
 696.4148 Mnodes/s // testing

How is that possible? And what can I do about it?
I had similar experiences earlier.
Are @intFromEnum and @enumFromInt not really the no-ops I expected?
Am I missing something?

2 Likes

only way to know is to look at the assembly!

I have 3 guesses:

  1. there might be a safety check? I think that’s unlikely.
  2. its function call overhead. you’d assume LLVM would inline that, and probably optimise it out entirely. But that’s not guaranteed. Forcing inline should solve that, or using @intFromEnum directly.
  3. Perhaps the function call, even if inlined, is preventing other optimisation.
5 Likes

I noticed that you can’t @bitCast() an enum(u1) to a u1, but you can @bitCast() a packed struct containing the same enum to a u1.
Perhaps this would grant a performance improvement? I doubt it’d be any better than the packed union equivalent, though.

True. But then packed union is easier.
I was trying to get rid of the extra layer involved. Somewhere the compiler fails to optimize I guess.

I tried a really simple assembly test in godbolt, and it seems that there is a small difference - Compiler Explorer

Tbh, I’m not an expert at reading assembly, but it looks like using @intFromEnum adds a bitmask operation and a weird dead-store (the line accessing the sil register). Maybe it’s a compiler bug? Or maybe I’m misreading it…

This example is probably completely different to how you use your code. I imagine you may even pack the u1’s together, which would look very different to this.

What can you do? Well, if you wait, someone will probably eventually fix it so that the enum conversions are actually zero-cost. I didn’t check if an issue was raised for this, maybe you could look at raise one?

If I were you, and I needed the extra performance now, I would go back to using the unions. You said the main reason to change was for operations like std.mem.eql - I would probably use std.mem.asBytes when using comparisons if that makes sense for your use-case.

3 Likes

In your code, does using pub inline fn u change anything? When you say “simply everywhere” I wonder if the optimizer made the wrong decision for inlining the calls.

I tried the truck version of the compilation results. In the new version, the compilation results of bar and bar2 are the same, both corresponding to bar2 of version 0.15.2.

Note you removed bar2 in the link

Also, in this example, inline does not change anything

Yes it wouldn’t. Any compiler will inline a function which is only used in one place.

I think it’s unlikely that going from 6 instructions to 7, without adding a branch, is responsible for degrading performance by a factor of two. That sounds more like a case where an omniscient compiler would inline, but alas, we do not have one of those.

Inlining did not change anything.
Can we imagine how slow it would get if I did the same refactor with Square?
I don’t understand how it could become slower in any way, because these simple enums should directly be translated into an index in my simple mind.

I also don’t think looking at generated code of one function on godbolt can explain things. It could be completely optimized there, because it is just 1 simple function.
In my Position and Search the indexes are used all over the place.

BTW: I did not look at the assembly (don’t now how).

1 Like

Explain this please. What do you mean?

out of curiosity (because it is not clear to me from the thread) are you measuring with the native backend or the llvm backend?

Now that I don’t know. I’m kinda noobish in this area.
I use windows, vscode, do a zig build, run my exe and do the standard speed- and correction test in my program.

You have at least verified you’re not measuring the (default) debug build?

absolutely

For the interested ones:

I looked at your godbolt. I would guess it’s this:

Yes, it is a compiler bug/missing optimization. You’re right that it’s a dead store.

8 Likes

Interesting. Thanks for solving the mystery.
Should I avoid types that are not u8, u16, u32, u64? I use them a lot.

Has an issue regarding Spurious stack store not eliminated · Issue #181529 · llvm/llvm-project been raised?

The fix is going to be on LLVM’s side. Support is planned, just not there yet.