Emitting advanced opcodes vs. compatibility

What would you go for in a situation where you want to ship binaries to end users and you want advanced CPU instructions (SIMD) to be used whenever they’re available, yet without breaking the binary for users without them?

Say, you’re developing an application that does a lot of matrix / vector multiplication. It could really benefit from those advanced SIMD instructions modern CPUs have. So you use @Vector’s, @mulAdd and all that good stuff that Zig provides (yay!).

However, if you build for a CPU that has AVX, AVX2, FMA, etc., your binary won’t run on older / low-end CPU’s (Pentium and Celeron) that don’t have the goodies.

I guess you could have several versions of matrix multiplication, etc., functions in your code, one compiled for advanced CPUs, one for less advanced, and you would have to call them via pointers, probably. But this would add a level of indirection, slowing the calls. In addition, you would not be able to inline these functions.

So I wonder - is there a “best practices” approach to this?

1 Like

Some time ago I did not know yet that ZIg compiler uses specific instructions of a CPU it is working on… and reported a bug :slight_smile: , see

2 Likes

Thank you for posting.
My goal, though, is to be able to emit a universal “best-of-both-worlds” binary.

I think currently there are only two options:

  • compile for generic target and ship this binary
  • let people compile themselves if they want more performance

Well, I can see a third option:

Gather all performance-sensitive “primitives”, like matrix x matrix multiplication, matrix x vector multiplication, etc., in a dedicated zig source file (file1.zig). Make a copy of this file (as file2.zig) and modify the names of your functions in the second file, so that they don’t clash.

Build your project targeting “lowest common denominator” CPU, but for the second file, target “fancy” CPU with SIMD instructions that improve performance.

When your program starts, in runtime, check CPU instructions availability. If necessary SIMD instructions are available, initialize function pointers with addresses of the functions in the second file. If not, initialize these pointers with addresses of the functions in the first file.

Use function pointers to call the functions from this point. Do not call them directly.

2 Likes

You know what, I just had some shower thoughts, and perhaps the best option would be to ship two binaries, one for low-end CPUs, one for advanced CPUs.

I would do a runtime CPU check at startup, either telling the user to get a “simple” version, if advanced instructions are not found, or to get “advanced version”, if they’re running a “simple” version on an advanced CPU.

Making them compile the software would be a deal-breaker for non-IT people, I’m afraid.

1 Like

there’s an issue for doing feature detection at startup and selecting the appropriate version of a function based on whether avx is supported or not, for example. IIRC the proposal is also already approved.

1 Like
2 Likes

Yes, this is pretty much what I want :slight_smile:

However, this function versioning would still have to be implemented via function pointers “under the hood”, right?
I mean, I hope I’m missing something and there’s a better way.

No, there are a few ways to implement it. For example, each call to the multi-versioned functions could be patched on program startup to point to the appropriate one.

Here’s the GCC documentation on how they do it. In summary it’s with IFUNCs. There are pros and cons to this way of doing it. It’s probably a safe null hypothesis, but I’m sure we’ll at least consider some other possibilities before committing to it.

2 Likes

Hm, self-modifying code. I like this, but I wonder how modern operating systems / AV software would view such binaries.

1 Like

BTW, what would be a good baseline x86_64 CPU to target?
I don’t think keeping 32-bit CPU compatibility for computationally intensive applications is prudent.

So, I was considering Pentium4 or Athlon64, but it looks like they do have mutually incompatible features.

Pentium4 has MMX, which Athlon64 lacks (or seems to be lacking, according to Zig, see the update below).
Athlon64 sports “3dnowa” and “64bit”, not found in Pentium4.

Because of 3dnow / 3dnowa instructions, not found in old Intel CPUs or any modern CPU, Athlon64 doesn’t seem to be a good baseline CPU.

On the other hand, MMX instructions found in Pentium4 (as well as P3, P2 and pentium_mmx, going backwards) are only found in AMD’s k6, but are missing in k6_2, k6_3 and so on, all the way until zenver1 (1st gen Ryzen), where they appear again - this is according to Zig compiler, see the update below.

So if I go with Pentium4, I risk being incompatible with a broad swath of AMD processors.

x86_64, which looks like it could be “A generic CPU with 64-bit extensions” lists MMX as enabled.

So, what is left? Simple, non-mmx pentium?

Update:
strangely enough, while https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html lists MMX for many AMD CPUs, and AMD claim MMX compatibility in their Athlon64, std.Target.Cpu.Arch.x86_64 does not list mmx as enabled for CPUs between k6 and zenver1.
Is it because MMX is slow on AMD? - I’m speculating here, I have no idea.

Wikipedia lists MMX support starting with the first version of generic x86-64.

Update 2:
-mcpu=pentium4 causes my f32 vector manipulating code to break compilation with
LLVM ERROR: 64-bit code requested on a subtarget that doesn't support it!.
Can’t fathom what made compiler think I ‘requested 64-bit code’.

Anyway, I guess x86_64 should be OK for baseline target, and something like x86_64_v3 (or v2) for advanced target/