Benchmarking isDigit

FObersteiner · June 8, 2024, 4:05pm

To clarify, I do not claim ownership of that code I used in the example. My point was to put things into perspective, not to propose a “better” solution.

If you think your variant is objectively better then this is a different topic to me.

mnemnion · June 8, 2024, 4:17pm

The synthesis here is that hand-tuned assembly usually can’t beat compilers in real programs, even though it’s relatively easy to write better ASM than a compiler for any given code path.

The problem being that compilers treat inserted ASM as a black box, because they have to. They can’t incorporate it into whole-program analysis, and it’s rarely practical to write the entire program in ASM (embedded is an exception here, demanding work in AVR is still routinely done entirely in pure assembly).

So some code which is “just ok”, but which can be inlined, or elided if it doesn’t actually get used, lifted out of hot loops, etc, is usually going to be better to have around than a razor-sharp hand-tuned ASM black box. It will also run on every platform the compiler supports, which is a major advantage.

pierrelgol · June 8, 2024, 5:26pm

Sorry maybe I expressed myself wrongly, english isn’t my first language, the point I was trying to make is simply that the bit twiddling is not necessary in most cases, and that usually simpler code will be translated better by compilers, because compilers are designed to optimize the common case. I know they aren’t perfect and sometimes in C it’s very frustrating to see the optimizer failing to do some very basic optimization. But at the end of the day for 90% of cases the compiler output is good enough. Especially when we take into consideration how modern hardware works, and where are the bottleneck, the number of instructions, or one or two branches aren’t what is limiting most CPU. As such more often than not, a good data structure, good cache locality, parallelization, etc are going to be more impactful, than 3 instruction less. Particularly because of OOOE.

Now having said that I’m sure in some cases in some hardware my statement is wrong, and as such yes go for the bit twiddling, and the micro-optimizations. But at this point you are rarely even touching something from the std. Which is why I was saying that it’s probably better to trust the compiler to do 80% of the job. My bad if my point didn’t come across the first time, maybe the wording is not exactly reflecting what I had in mind.

Validark · June 8, 2024, 8:22pm

Sorry if my point didn’t come across, I typed it in haste. My point is that it’s not fancy bit-twiddling. y % x is equal to y & (x - 1) when x is a power of two (i.e. has a popcount of 1). That’s just a basic equivalency that doesn’t actually affect the compiler emit either way. The reason why it can return after only 3 instructions does not have to do with the bit twiddling, it has to do with the ordering. You can move the y & 3 to be the first thing you check, just as you can do with y % 4.

If you wanted to find the best implementation for your benchmark, you’d test factored vs unfactored vs factored with mod 25 done first (branchless).

Sze · June 19, 2024, 1:59am

timestamped at 1:35:48

loosely transcribed:

When people say stuff like make it work, make it right, make it fast it is incorrect, because if you weren’t thinking about how to make it fast, at the make it work stage, you will never make it fast.
…
Make it work and figure out what it needs, to be fast, that is the first step.
Then you make it right, then you make it fast.
That is fairly accurate.
But if you didn’t do the structural work, to make sure that it could be accelerated later, during the first phase, you will be royally screwed.

I think profilers won’t help you if there is some fundamental structuring issue, they may help you with finding a bunch of bottlenecks, but eventually everything will just be kind of meh, where nothing is clearly “the slow part”. That doesn’t tell you if there is another way to structure everything, that would improve performance overall.
So it helps with finding local problems, but not necessarily with problems that have spread throughout the entire code base.

The coz project / causal profiler is able to tell you which parts are worth speeding up, but I think even that can’t tell you if the parts you are measuring should be structured differently so that they can be improved. It only helps you more than traditional profilers, in seeing which parts are big problems and would be worth focusing on. Not with avoiding creating these problems that are difficult to fix.

So having the plan in mind, for how you are able to speed it up, from early on is better.

AndrewCodeDev · June 19, 2024, 2:04am

I’ll add that I agree with @gonzo’s point if you’re not planning on keeping the first version of the code. If you build something with the intention of throwing it away just to learn, then I don’t think Casey’s point applies.

Where Casey’s point really applies is (especially in the industry) when a first-draft becomes the only version and a rewrite doesn’t happen - we effectively can get stuck with the slow thing because “there’s not enough time” or “it works well enough”. In those circumstances, I think designing for performance needs to be a front and center concern.

gonzo · June 19, 2024, 9:04am

I think we are talking about two different scopes here.

For small-scope / quick & dirty things, “make it work / right / fast” means (to me) what everybody (here) thinks: iterate over your codebase, without any major changes in the design, until it works / it is correct / it is fast enough.

For large-scope / long-term codebases, the full iterations should also be larger-scoped: I would rather build a quick prototype (by making it work / right / fast), and then go back, throw away the whole v1 and redesign it with all the lessons learned.

Once you have done this enough times within a given domain (as I am sure Casey has with games / engines), the first “prototype” phase of “make it work / right / fast” can certainly be skipped – you are applying your hard-earned experience from previous years / projects.

Many people seem to equate “make it work / right / fast” to “don’t care about your design, the hardware will make up for any shortcomings”. That is not how I interpret “make it work / right / fast”.

gonzo · June 19, 2024, 9:09am

This is also a matter of interpretation. When I say “profile” I don’t necessarily mean “use a low-level sampling profiler”, I mean “measure”. It could mean using printf to quickly identify (large-scale) bottlenecks; it could mean using an actual low-level profiler. And again, if the scope is large enough, and the system merits it, this profiling might lead me to decide “good-bye v1, let’s redesign these parts of the system for v2”.