What differences in performance can I expect from using either vectors or arrays?

markus · October 3, 2024, 5:36pm

I am trying to make a little game using raylib and made my own vector type with functions such as mul, add, magnitude, etc.
The type is structured like this:

pub fn Vector(comptime T: type, comptime length: comptime_int) type {
    return struct {
        const Self = @This();

        vec: [length]T,

        pub fn init(vals: [length]T) Self {
            return .{ .vec = vals };
        }

        pub fn splat(val: T) Self {
            var new: Self = undefined;
            @memset(&new.vec, val);
            return new;
        }

        pub fn add(self: Self, other: Self) Self {
            var new: Self = undefined;
            for (&new.vec, self.vec, other.vec) |*nv, sv, ov|
                nv.* = sv + ov;
            return new;
        }
        
        ...
    };
}

So my questions are

How well can I expect the optimizer to work on this compared to an @Vector type? The functions are all very very simple, almost all like add in terms of structure.
What tradeoff are there between vectors and arrays? I am quite familiar with the language but not too familiar with how the compiler works with these types, or if there are any restrictions in terms of function.
A function such as splat, does it first allocate stack space for new which it then copies to the return destination, or does it just “directly write to the return destination without copying, mutating all the memory in place”?
Should I inline the for loop in add? I have no information on why or why not besides my gut feeling, which as everybody knows is never a good indicator when it comes to perf.

IntegratedQuantum · October 3, 2024, 6:20pm

I really cannot answer that. It would be best to just check godbolt for that. Generally I’d say vectors work slightly better when the operations map well to the hardware (e.g. addition, multiplication, …), but in complex cases like @mod (hardware only supports @rem for floats) the compiler generates poorly optimized code.
The disadvantage of vectors vs arrays is probably their alignment in memory.
An array always has the alignment of its basic type. It’s size is then just the multiple of alignment and length. In the of [3]f32 the alignment is 4 and the size is 12.
vectors seem to have the next higher power of 2. So for example @Vector(3, f32) has a an alignment of 16. Because of that it will occupy 16 bytes in memory, despite only using 12 of it. This is to allow them to be better loaded into the CPU, but this does greatly increase memory usage for odd vector types.
Splat directly maps to a hardware function. It is not storing a value on the stack, it can just broadcasts the value directly into a vector register in a single instruction.
My intuition also tells to use the inline variant, but I don’t think it matters.

One advantage I personally see in @Vector, is that you can use operators on it. In my opinion this makes them so much easier to use than the array variant.

markus · October 3, 2024, 7:42pm

Thanks for the info! Regarding 3, I didnt express myself clearly. I wasnt referring to the builtin but the function i defined in the code snippet. My question thus also applies to init and every other type of init function that first creates something, modifies and then returns it as a copy.

markus · October 3, 2024, 7:55pm

Also one more question, If i for example make an array of an @Vector, will that be packed or still have the padding.
I think the latter is more likely since the padding is there for loading it into a register, without having to load only a part of it, but I dont know. Asking cant hurt.
I think I may have just answered my own question though.

IntegratedQuantum · October 4, 2024, 6:52am

All those small functions will most likely be inlined by the compiler.

It will always add the padding. It will only be packed if you use a packed struct.

markus · October 4, 2024, 6:53am

Thank you, your answers answered some of my longest lasting questions lol
I appreciate the help!

Validark · October 4, 2024, 9:55am

You can always ask the compiler about the alignment of types by doing @compileLog(@alignOf(T)) for some type T. As @IntegratedQuantum said, types have an alignment to make it easier to load, both in terms of address calculation and in terms of the CPU having to do extra work under the hood for loads on pointers which are not multiples of the number of bytes you intend to load.

E.g. if I want to load a 64-byte vector and my pointer is not 64-byte aligned, the CPU is going to have to pull data from multiple cache lines and that process introduces a small penalty in many cases.

As far as relying on auto-vectorization goes, you just have to look at the assembly that gets generated. It might be good, it might be okay, or it might be abysmal. If you don’t look at the assembly, you really have no idea.

If I were you, I would just use Zig’s builtin types, and add helper functions as needed to get the assembly you want. But ultimately it boils down to aesthetic preferences.

You can use LLVM intrinsics like so: st4.zig · GitHub

Note that in some cases the signature of a LLVM intrinsic does not conform to the C ABI and you can fix that issue by defining the callConv as something else, I think .unknown is fine but I don’t recall off hand. There’s also inline assembly for coercing the compiler to give you the exact instructions you want, but that sometimes backfires and it will do other stupid things for some reason, I guess just because there’s a certain level of opaqueness to inline assembly for the compiler. If anybody has any questions about intrinsics, assembly, or wants help with performance design or techniques, contact me here or on Discord (.validark). Or you can make a post on here or on Discord and @ me so I’ll see it.

markus · October 4, 2024, 5:47pm

For now ill focus more on getting something to run than on optimizing the heck out of it. This is some really good and appreciated advice, though. @compileLog(...) seems obvious, I dont know why I didnt think of it. Also I am aware of the fact that without looking at the asm, you never really know. Regardless, I appreciate the reminder as this can never be mentioned enough.

Do you by any chance know of some website that lists the intel asm instructions and their respective number of cycles?

IntegratedQuantum · October 4, 2024, 5:59pm

Here you go: https://www.agner.org/optimize/instruction_tables.pdf

Validark · October 4, 2024, 8:29pm

Nowadays I don’t use Agner Fog’s tables as much. I primarily use uops.info

There is also https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html which can be good if you want to search for an intrinsic

markus · October 4, 2024, 8:43pm

Thank yall for all the help, genuinely priceless. Sometimes I make tiny projects and just focus a bit on performance, ill definitely come back to these resources the next time I do so. Also a lot of this info is just sort of foundational to every day programming, without thinking about it. Specifically referring to the init functions inlining here.