Unexpected code generation with lazy evaluation

Corendos · August 26, 2023, 9:43am

Hey guys, I didn’t have much success on the Discord, so I’ll try here

Not long ago, I was messing around with Godbolt and noticed something that felt weird. Here is the link to the sample: Compiler Explorer

Basically, I wanted to see if writing code that could be easier to debug still produced the same output when compiled in ReleaseXX mode. Here are some examples so that you better understand what I mean (examples are dummy, but I had a real use-case so it’s still valid):

// Harder to debug because intermediary variables are not observables
fn compareHard(a: T, b: T) bool {
  return a.bypass or (a.field1.eql(b.field1) and a.field2.eql(b.field2));
}

// Easier to debug because intermediary variables are observables
fn compareEasy(a: T, b: T) bool {
  const are_fiedl1_equals = a.field1.eql(b.field1);
  const are_field2_equals = a.field2.eql(b.field2);
  return a.bypass or (are_field1_equals and are_field2_equals);
}

The compareHard version produces what you would expect, with an early return in cases where a.bypass is true. However, the compareEasy version produces code that will call the eql functions first and then evaluate the final conditions, and that even in Release mode. I would expect the compiler to be able to reduce the two functions to the same assembly but apparently not. Any idea about what I could do to get the result I want ?

Honestly, I’m also interested in getting an explanation about what is going on in the compiler. I have the feeling that this is an optimization that we could expect from it, and I’m curious about your thoughts as well !

biom4st3r · August 26, 2023, 2:24pm

The compiler would have no way to know if the other functions you call during the compare have desired effects or not

cgbur · August 26, 2023, 2:25pm

The compiler likely hesitates to optimize compareEasy like compareHard because it can’t assume eql is without side effects. Pureness matters because it allows the compiler to make assumptions for optimization, like reordering or eliminating calls. As far as I know, there’s no explicit way to mark a function as pure to enable more aggressive optimizations. Maybe you could try marking the functions that you’ve implemented eql for with inline and see if that changes something?

alp · August 26, 2023, 3:06pm

I tend to design my code this way so I may be biased, but I see no problem with:

// Still easy to debug
fn compareEasy(a: T, b: T) bool {
  if (a.bypass) {
    return true;
  }
  const are_fiedl1_equals = a.field1.eql(b.field1);
  const are_field2_equals = a.field2.eql(b.field2);
  return are_field1_equals and are_field2_equals;
}

Corendos · August 26, 2023, 8:28pm

Hum, didn’t think about that. It’s true that it’s obvious to me because I know that it doesn’t have side-effects, but not so much for the compiler.

Just after posting the message, I thought about writing it in a more “straightforward” way and basically came up with what @alp mentioned. That’s what I love about Godbolt, it’s an invaluable tool to calibrate ourselves to what the actual executed code will look like and build a feeling on what’s the best way to write code that won’t waste time.

Anyway, I think that answers my question, thanks all !

Validark · September 5, 2023, 2:40pm

I saw a YouTube video recently of Johnathan Blow describing his vision for how his Jai compiler should eventually have its own backends that can be used instead of LLVM. In it, he said there are two types of consumers of compilers:

The first type is the person who thinks the compiler should be able to give you whatever assembly it wants, so long as the effects are the same. To this person, how they write the code is mainly however they feel it looks the nicest, and they may not know that much about how to make a program efficient or they don’t care. They just want whatever free performance gains the compiler can offer them.

The second type is the person who does pretty much know what’s an efficient way to accomplish their computing, and really they just want the compiler to do all the tedious, mundane optimizations like finding the best combination of instructions which map to what you expressed in your code. Of course, you might only know what you’re doing once you’ve written the code before, but the point remains the same.

I personally prefer to be in the second category, and it’s nice for me when I can do my own experiments on code I want to run as fast as possible. I would expect/hope that your two code samples compile differently, as you are specifying (IMO) that the two potentially unnecessary statements should be executed unconditionally, regardless of need. This is a form of software speculation, which can be a performance win in some cases, and it’s a loss in other cases.

Others have touched on the issue of what the compiler can assume/prove for your particular example, but I want to touch on something more general:

One thing to keep in mind is that the compiler does not actually know the best way to compile all code, it simply uses heuristics and tries to apply optimizations that are good on average. However, in some scenarios, they backfire. Current compilers are also incapable of asking the programmer for more information about whether it’s allowed to do an optimization it is almost able to prove is valid, or whether it should even do an optimization in the first place. Aggressive auto SIMDizing and loop unrolling can get you major performance wins in many cases, however, imagine such techniques get applied to a function which expects very small inputs. Let’s say I have a “sum” function that adds all the numbers in a slice together. The compiler might choose to auto-SIMDize and unroll the SIMDized loop 4 or 8 times. On massive inputs this gives a major speedup, however, if it so happens that I’m often passing things smaller than or just barely over the size of the hardware vector, the optimization will actually make my code run much slower and my binary file needlessly bloated, which means my instruction caches are going to be needlessly bloated too.

I want more choice in these matters. I want to run my own tests and decide for myself how the final binary should be optimized. In the case of SIMD, if I was writing the “sum” function and wanted it to be SIMD, I’d write it as such. There are instances where I think auto-SIMDization should occur, primarily in cases where you have a bunch of statements that can be smushed into one or a few vector instructions. However, most times, if I wanted SIMD, I’d write it that way. When it comes to unrolling loops, I want to choose how much. I don’t like that changing the whitespace or a comment sometimes leads to a very different compilations with loops unrolled by different amounts, with different performance characteristics. I also don’t like when I tell the compiler to do an optimization and it automatically deoptimizes it back to what it was before. I can’t tell if my change will make a difference down the road when the code does more things and then the compiler finally goes along with my “optimization”, which might be a win or a loss.

I can understand the appeal of being in the first group I mentioned, but I heavily prefer having more control, and to me low-level control is one of the most important features of Zig. Therefore, I think we should argue for fewer optimizations for Zig code in the situations I’ve mentioned, not more. I think your example is another good example of when optimizations should not happen automatically.

gonzo · September 5, 2023, 5:12pm

I think there is a third group: those who want the compilation itself to be as fast as possible, even if the resulting binary is larger or slower – for example, when prototyping something.

This is where zig’s plan for its own backend, which includes incremental compilation (which pretty much requires leaving room in the binary for the deltas) comes in, and I am very excited about it. I am happy to use the LLVM backend to generate a production build when I want the fastest / smallest / bestest binary.

Validark · September 6, 2023, 1:38am

Yes, build speed is an important feature for prototyping. Eventually, I think Zig might try to compete on the optimization front too though. I still think you kind of side-stepped my overarching point.

LLVM does not produce the “fastest / smallest / bestest binary”. For one, GCC still does better in a lot of cases. And going back to my main point, compilers make decisions based on incomplete information in general, and at the end of the day it is simply guessing what might be more advantageous for you. Like I said about the sum example, the compiler often does not know how such a thing should be optimized. In the future, I think such auto-SIMDization optimizations should be provided as suggestions to modify the (Zig) source code, NOT something that happens behind the back of programmers. This would also enable us to suggest that the programmer change things which change semantics slightly. For example, we could suggest saturating addition rather than (implicit) modulo, and we could treat the maximum integer as an overflow, check for it at the end, and return an error.Overflow. Semantically, this decreases our universe size by 1, i.e. we now can’t have a valid case where all the numbers legitimately added up the maximum possible integer. However, this strategy enables us to maintain safety while still getting the performance of SIMD, so is definitely worth it. We could also suggest aligning slice sizes so that we don’t need a scalar implementation of our SIMD loop. Neither of these optimizations can be done automatically by the compiler.

With regards to the optimizations that a backend can perform automatically, I think the ultimate goal should be to strip the magic optimizations from the compiler, and allow the programmer to make more informed decisions about how to get better safety and performance. And this should result in dramatically better compile times too. Just because a program is being optimized doesn’t mean it has to take a super long time to compile.

AndrewCodeDev · September 6, 2023, 3:58am

I think you have a lot of interesting input here and I’d like to respond to it. May I suggest we continue this discussion on a new thread? This thread is currently solved and I think we can have a more robust discussion in the brainstorming category if we start a dedicated thread there.

An interesting title might be “What should your compiler do for you?” or something of that nature.

gonzo · September 6, 2023, 8:01am

More than that, I was stating my opinion on what you quoted Johnathan Blow as saying in regards to there being “two types of consumers of compilers”. I rather agree with your point in general.

Corendos · September 6, 2023, 4:59pm

Don’t get me wrong, I definitely agree with you and I’d even say that I’m generally in the second category you describe.

The way I see it, my question is just a side-effect of what we are used to in other languages. Often, similar optimizations are done and you kind of end up expecting them to happen. But in the end, I agree that it’s not necessarily a good thing. The snippet I ended up with is still simple and has the advantage of being closer to how things should be done by default (because it kind of sticks to the way you would write it if you were writing assembly directly). So, having a compiler that behaves this way is a good incentive to improve how you write code that will be efficient while not being particularly harder to read or more complex

Validark · September 6, 2023, 9:26pm

Another point I thought of later is that software speculation can be considered a performance feature, even if it’s slower. There’s a whole subfield of computer security that studies timing attacks and how inferences can be made about data by observing the performance characteristics. In this case, a.bypass is observable to consumers because the early bailout has such a large performance difference.

It probably doesn’t matter in this case, or most cases, but imagine a stupid implementation of a password or passkey matcher, that goes character by character, and when it finds a mismatch, returns false, if it makes it to the end, it returns true. This is a scenario where the closer I get to the correct answer, the longer it takes to validate the answer. So I just feed in a few random characters until I see that validation time takes slightly longer, then I build on that until it takes slightly longer again, until eventually I’ve inferred with complete accuracy what the secret data was. It’s literally a game of “you’re getting warmer” or “you’re getting colder”.

How do you protect against timing attacks? Well, one thing you can do is something like the change you made to your code. You can speculatively/eagerly do work that is conditionally unnecessary, to get closer to doing the same amount of work every time, regardless of the input.

This is another reason why I don’t want your code to be automatically optimized behind the programmer’s back! I think more optimizations should be suggestions to change the code, and not performed by the compiler itself.