Emitting advanced opcodes vs. compatibility

Durobot · July 14, 2023, 11:58am

What would you go for in a situation where you want to ship binaries to end users and you want advanced CPU instructions (SIMD) to be used whenever they’re available, yet without breaking the binary for users without them?

Say, you’re developing an application that does a lot of matrix / vector multiplication. It could really benefit from those advanced SIMD instructions modern CPUs have. So you use @Vector’s, @mulAdd and all that good stuff that Zig provides (yay!).

However, if you build for a CPU that has AVX, AVX2, FMA, etc., your binary won’t run on older / low-end CPU’s (Pentium and Celeron) that don’t have the goodies.

I guess you could have several versions of matrix multiplication, etc., functions in your code, one compiled for advanced CPUs, one for less advanced, and you would have to call them via pointers, probably. But this would add a level of indirection, slowing the calls. In addition, you would not be able to inline these functions.

So I wonder - is there a “best practices” approach to this?

dee0xeed · July 14, 2023, 12:37pm

Some time ago I did not know yet that ZIg compiler uses specific instructions of a CPU it is working on… and reported a bug , see

github.com/ziglang/zig

Program compiled on AMD machine crashes on Intel Celeron machine with 'illegal instruction' exception

opened 03:03PM - 29 May 23 UTC

closed 08:17PM - 29 May 23 UTC

dee0xeed

question

### Zig Version 0.10.1 ### Steps to Reproduce and Observed Behavior A progr…am can be any. I took this simple after observing the behavior with a complex one: ```zig const std = @import("std"); pub fn main() void { std.debug.print("Hello!\n", .{}); } ``` It was compiled on a machine with `AMD E2-9000e RADEON R2, 4 COMPUTE CORES 2C+2G` CPU (Linux Mint 19.1) and then run on a machine with `Intel(R) Celeron(R) N4020` CPU (also Linux Mint 19.1). The result was abnormal termination (illegal instruction). ### Expected Behavior No crash, should run susccessfully.

Durobot · July 14, 2023, 12:43pm

Thank you for posting.
My goal, though, is to be able to emit a universal “best-of-both-worlds” binary.

dee0xeed · July 14, 2023, 12:46pm

I think currently there are only two options:

compile for generic target and ship this binary
let people compile themselves if they want more performance

Durobot · July 14, 2023, 1:40pm

Well, I can see a third option:

Gather all performance-sensitive “primitives”, like matrix x matrix multiplication, matrix x vector multiplication, etc., in a dedicated zig source file (file1.zig). Make a copy of this file (as file2.zig) and modify the names of your functions in the second file, so that they don’t clash.

Build your project targeting “lowest common denominator” CPU, but for the second file, target “fancy” CPU with SIMD instructions that improve performance.

When your program starts, in runtime, check CPU instructions availability. If necessary SIMD instructions are available, initialize function pointers with addresses of the functions in the second file. If not, initialize these pointers with addresses of the functions in the first file.

Use function pointers to call the functions from this point. Do not call them directly.

Durobot · July 14, 2023, 1:52pm

You know what, I just had some shower thoughts, and perhaps the best option would be to ship two binaries, one for low-end CPUs, one for advanced CPUs.

I would do a runtime CPU check at startup, either telling the user to get a “simple” version, if advanced instructions are not found, or to get “advanced version”, if they’re running a “simple” version on an advanced CPU.

Making them compile the software would be a deal-breaker for non-IT people, I’m afraid.

kristoff · July 14, 2023, 7:22pm

there’s an issue for doing feature detection at startup and selecting the appropriate version of a function based on whether avx is supported or not, for example. IIRC the proposal is also already approved.

andrewrk · July 15, 2023, 5:50am

github.com/ziglang/zig

Proposal: Function multi-versioning

opened 06:20PM - 17 May 18 UTC

bheads

proposal accepted

A really interesting concept is function multi-versioning. The general idea is t…o support implementing multiple versions of a function for different hardware and having the correct version of the function selected at run time. Made up sample code: ```C pub fn someMathFunction(vec: Vector) Vector [target: sse4.2] { // optimized for SSE 4.2 } pub fn someMathFunction(vec: Vector) Vector [target: avx2] { // optimized for avx2 } pub fn someMathFunction(vec: Vector) Vector [target: default] { // no asm/intrinsics optimization } // later on const v = giveMeAVect(); const v2 = someMathFunction(v); // calls the best version based on run time selection ``` There are ways to simulate this using function pointers, but the compiler would be better at optimizing this, plus implementing that over and over by hand would suck. LLVM https://llvm.org/docs/LangRef.html#ifuncs GCC https://lwn.net/Articles/691932/

Durobot · July 15, 2023, 2:57pm

Yes, this is pretty much what I want

However, this function versioning would still have to be implemented via function pointers “under the hood”, right?
I mean, I hope I’m missing something and there’s a better way.

andrewrk · July 16, 2023, 2:46am

No, there are a few ways to implement it. For example, each call to the multi-versioned functions could be patched on program startup to point to the appropriate one.

Here’s the GCC documentation on how they do it. In summary it’s with IFUNCs. There are pros and cons to this way of doing it. It’s probably a safe null hypothesis, but I’m sure we’ll at least consider some other possibilities before committing to it.

Durobot · July 16, 2023, 10:00pm

Hm, self-modifying code. I like this, but I wonder how modern operating systems / AV software would view such binaries.

Durobot · July 18, 2023, 8:23am

BTW, what would be a good baseline x86_64 CPU to target?
I don’t think keeping 32-bit CPU compatibility for computationally intensive applications is prudent.

So, I was considering Pentium4 or Athlon64, but it looks like they do have mutually incompatible features.

Pentium4 has MMX, which Athlon64 lacks (or seems to be lacking, according to Zig, see the update below).
Athlon64 sports “3dnowa” and “64bit”, not found in Pentium4.

Because of 3dnow / 3dnowa instructions, not found in old Intel CPUs or any modern CPU, Athlon64 doesn’t seem to be a good baseline CPU.

On the other hand, MMX instructions found in Pentium4 (as well as P3, P2 and pentium_mmx, going backwards) are only found in AMD’s k6, but are missing in k6_2, k6_3 and so on, all the way until zenver1 (1st gen Ryzen), where they appear again - this is according to Zig compiler, see the update below.

So if I go with Pentium4, I risk being incompatible with a broad swath of AMD processors.

x86_64, which looks like it could be “A generic CPU with 64-bit extensions” lists MMX as enabled.

So, what is left? Simple, non-mmx pentium?

Update:
strangely enough, while https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html lists MMX for many AMD CPUs, and AMD claim MMX compatibility in their Athlon64, std.Target.Cpu.Arch.x86_64 does not list mmx as enabled for CPUs between k6 and zenver1.
Is it because MMX is slow on AMD? - I’m speculating here, I have no idea.

Wikipedia lists MMX support starting with the first version of generic x86-64.

Update 2:
-mcpu=pentium4 causes my f32 vector manipulating code to break compilation with
LLVM ERROR: 64-bit code requested on a subtarget that doesn't support it!.
Can’t fathom what made compiler think I ‘requested 64-bit code’.

Anyway, I guess x86_64 should be OK for baseline target, and something like x86_64_v3 (or v2) for advanced target/