I have this struct down here, which works fine.
But wondering if it is possible to replace the two fields with one i32 value to optimize the operations, putting mg and eg into the lower and higher 16 bits.
When doing that I need 2 extra functions: fn mg(self: Pair) i16 and fn eg(self: Pair) i16.
The math operations (inc, dec, sub, add) will be called many many milliions of times.
The 2 required new functions not so often (0.1% of that).
That said: mg and eg will never overflow or underflow (by design).
And wondering if there will be speedup before diving into the bit tricks…
Or will this be “SIMDed” automatically by zig??
I don’t think vectorizing would help here, 32 bits comfortably fit into a single gp register so the overhead of setting up the simd registers and extracting only the needed 32 bits probably outweighs any benefit.
Looking at the generated assembly, the compiler seems to be pretty smart about everything and basically already does what you’ve suggested, it’s using a single register and just shifting things around a bit. inc and dec look pretty optimal, maybe a similiar approach for add and sub is a little faster/slower than the bitmasking that the compiler does but you’d have to benchmark that, it’s probably equally as fast.
I would look into performing operations on multiple pairs in batches if you want a potential SIMD speedup, though the compiler probably already does that for you if you perform an operation in a loop.
This seems to be a good use case for packed struct. Note that it is different from extern struct in memory layout. The field layout of packed struct is logically low to high, which means it is always equivalent to (@as(i32, eg) << 16 | mg. extern struct is always from low byte to high byte, so it may be represented differently from packed struct on big-endian systems.
That’s called SWAR (SIMD within a register), usually you’d keep one bit reserved between items to stop overflow from propagating into the next item:
It gets tricky for underflows though… and whether it’s worth the hassle for just two operations is very questionable… as long as there are no interdependencies, modern superscalar CPU should be able to run a couple of such simple operations in parallel anyway.