Why doesn't Zig optimize this, whereas the C compiler does?

IntegratedQuantum · July 29, 2024, 12:26pm

I came across this while profiling my game. The code looks roughly like this:

const Example = struct {
    val: [*]u32,
    fun: u32,
};

pub export fn fun(self: *Example, idx: u32) void {
    self.val[idx] -= 1;
    if(self.val[idx] == 0) {
        self.fun -= 1;
    }
}

The important part is this:

self.val[idx] -= 1;
if(self.val[idx] == 0) {

This piece of code is modifying a value and then checking if the result is 0.
With the Zig compiler it produces the following assembly code (see godbolt):

...
        mov     rcx, qword ptr [rdi]
        dec     dword ptr [rcx + 4*rax]
        mov     rcx, qword ptr [rdi]
        cmp     dword ptr [rcx + 4*rax], 0
...

As you can see the address is fetched twice. Why doesn’t the compiler optimize this?
I guess the compiler is assuming that self and self.val may alias, but why does Zig make these assumptions while other languages don’t?

Also I checked with a C compiler, and in C this extra code isn’t there (see godbolt). Why does clang optimize this, but not Zig?

dimdin · July 29, 2024, 12:51pm

It looks like a bug.
Removing the mov and cmp statements gives the same results even with aliasing (e.g. val = &fun)

korke · July 29, 2024, 2:05pm

Just curious how much was the performance penalty had to pay. Since it’s a single extra mov instruction normally you wouldn’t expect it to spend so much time, moreover it’s accessing same location so cache line miss is less likely to happen. And if it’s a bug I wonder this type of extra unnecessary mov instruction were appended all over the place.

mlugg · July 29, 2024, 3:08pm

You’ve correctly identified that this is to do with aliasing. @dimdin seems to have assumed it’s about val pointing to fun, but that’s not true: the concern is with val pointing to val, through a cast pointer, such as like this:

var ex: Example = .{
    .val = undefined,
    .fun = 123,
};
ex.val = @ptrCast(&ex.val);

In this case, the assignment ex.val[0] -= 1 would change the pointer ex.val, so that the next comparison has to work differently. Thus, since for an export fn the compiler has no useful provenance data for any parameters, the optimizer has no choice but to load the pointer again, in case val aliased it.

The last piece of the puzzle, then: why does this not happen in C? The answer is a piece of UB in the C standard called Strict Aliasing. This is a form of TBAA (type-based alias analysis) – in short, it states that it is UB to use an A * which “actually” points in memory to a different type B (unless either A or B is a char type, which is an explicit exception). So, in your second Godbolt snippet, Clang notices that a pointer to val has type int ** while the pointer we’re accessing (self->val[idx]) has type int *; since the pointee types int * and int are different and neither is a character type, the optimizer is allowed to assume they won’t alias! Clang and GCC allow using the compiler flag -fno-strict-aliasing to disable this class of optimizations; if you pass that, you’ll see you get the same code as in the Zig case.

LucasSantos91 · July 29, 2024, 3:21pm

One question. val is of type [*]u32 (a 64-bit pointer), but the expression ex.val[0] -= 1 is modifying a 32-bit value. The compiler still thinks that this 32-bit value could overlap with a pointer? Because it sounds like the compiler is being excessively loose with types.

mlugg · July 29, 2024, 3:31pm

It does think that, because it literally can. Sure, it’d be referring to a 4-byte sub-region of the pointer value, but that still affects the pointer value. The example snippet I sent is exactly the code which would trigger this case (if you passed idx as 0).

LucasSantos91 · July 29, 2024, 5:55pm

I was under the impression that Zig’s type system was stricter than C’s. In particular, type punning is legal in C but illegal in Zig. So, from this point of view, using a [*]u32 as an alias to a sub-region of a pointer would be illegal, wouldn’t it? Shouldn’t Zig leverage that to enable the same optimizations that Clang did?

mlugg · July 29, 2024, 6:41pm

Where did you get that impression? Zig is generally more permissive than C regarding type punning. C is notoriously strict about this; the only well-defined way to type pun in C is with memcpy (unless the source or destination type is char, in which case pointer casts are fine per the aliasing rules I described above). This is because C has a somewhat bizarre memory model which kind of assumes that values aren’t actually “made of” bytes in general.

The only thing I can think of which you might be talking about is the fact that Zig doesn’t allow you to use a union to type pun. This is because union types in Zig have ill-defined layout. Using an extern union, which is the equivalent of a C union, allows you to type pun (which is notably not allowed in C, even if everyone does it!).

LucasSantos91 · July 29, 2024, 6:48pm

Yes, that’s exactly what gave me that impression. Thanks for your clarifications.

andrewrk · July 29, 2024, 7:01pm

I want to add to this that reworking zig’s rules about aliasing is still one of the big language design areas left. The challenge is to ensure that whatever rules are instated are enforceable as cheap safety checks.

Best tracking issue for this is probably noalias on all parameters by default (with debug safety); ability to specify mayalias · Issue #1108 · ziglang/zig · GitHub

permutationlock · July 29, 2024, 7:34pm

Yeah, type punning through unions is an odd case in C. There is a footnote in the C11 (and C99) standard that specifically states punning through unions is ok. But, as I understand it, footnotes are non-normative and the normative body of the standard seems ~~to state that such type punning is UB~~ a little unclear.

Edit: as pointed out below, the footnote doesn’t fully clarify type punning and the standard body is more clear than I recalled and doesn’t forbid it.

mnemnion · July 29, 2024, 8:10pm

In C89 the rule was completely clear:

if a member of a union object is accessed after a value has been stored in a different member of the object, the behavior is implementation-defined

As often happens, the specific standard-ese on this topic has muddied over time. As a matter of practice, no compiler would dare break type punning through unions, because too many programs depend on it functioning correctly.

The footnote you’re referring to, which does expliticly mention type punning, points to this section:

When a value is stored in a member of an object of union type, the bytes of the object representation
that do not correspond to that member but do correspond to other members take unspecified values.

Which is one of those aggravating negative-space statements which is just less clear than C89 was. But we can connect all of the dots: two members of a union both have well-defined memory layouts (essential in C), so the non-padding bytes common to both unions have a specified value, even when accessed through the non-active tag.

So not quite UB, although it’s possible to reach UB by, in particular, reading a padding byte from a field of another member of the union. But if all members of the union are dense, with no padding, then behavior when accessing any one through any other is implementation defined, which in practice means that it’s predictable given the known endianness of the platform.

LucasSantos91 · July 29, 2024, 9:36pm

We took a bit of a detour here, but returning to OP, it’s possible to generate the optimal machine code in two ways: storing the pointer as a const or using noalias. Godbolt

pub export fn fun2(self: *Example, idx: u32) void {
    const ptr = &self.val[idx]; 
    ptr.* -= 1;
    if(ptr.* == 0) {
        self.fun -= 1;
    }
}

pub export fn fun3(noalias self: *Example, idx: u32) void {
    self.val[idx] -= 1;
    if(self.val[idx] == 0) {
        self.fun -= 1;
    }
}

This shows that noalias also means “don’t alias yourself”.

IntegratedQuantum · July 30, 2024, 7:13am

Thanks for explaining, and thanks for the noalias trick. It works great. I think having noalias by default would be great for cases like this.

Well it was actually a good amount. In total it saved 6 seconds out of a 44 second function. And according to my profiler the second mov + cmp is actually almost as expensive as the jump:

I guess it maybe has to wait until the previous writes to memory are done before reading back the value.