Suggested SIMD vector length is slower?

dude_the_builder · August 24, 2024, 5:47pm

In the following code, if I comment out the call to simd.suggestVectorLength(T) and use the hard-coded value 32, the run time is nearly 3x faster (60ms vs 175ms). Is this expected?
(mac m1, zig 0.14.0-dev.367+a57479afc, ReleaseFast)

const std = @import("std");
const heap = std.heap;
const mem = std.mem;
const print = std.debug.print;
const simd = std.simd;
const time = std.time;

fn unrollSimdMatrixMultiply(
    comptime T: type,
    c: anytype,
    a: anytype,
    b: anytype,
) void {
    const n = a.len;
    // const vec_len = simd.suggestVectorLength(T) orelse @panic("No SIMD?");
    const vec_len = 32;

    for (c, a) |*c_row, a_row| {
        var j: u32 = 0;

        while (j <= n - vec_len) : (j += vec_len) {
            for (0..n) |k| {
                const u: @Vector(vec_len, T) = b[k][j..][0..vec_len].*;
                const y: @Vector(vec_len, T) = c_row.*[j..][0..vec_len].*;
                const w: @Vector(vec_len, T) = @splat(a_row[k]);
                const slice: [vec_len]T = (u * w) + y;
                @memcpy(c_row.*[j .. j + vec_len], &slice);
            }
        }

        while (j < n) : (j += 1) {
            for (0..n) |k| {
                c_row.*[j] += a_row[k] * b[k][j];
            }
        }
    }
}

fn Matrix(comptime T: type) type {
    return struct {
        m: [][]T,

        const Self = @This();

        fn init(
            allocator: mem.Allocator,
            n: usize,
            v: T,
        ) !Self {
            var to_free: usize = 0;
            var m = try allocator.alloc([]T, n);
            errdefer {
                for (0..to_free) |i| allocator.free(m[i]);
                allocator.free(m);
            }

            for (0..n) |i| {
                m[i] = try allocator.alloc(T, n);
                to_free += 1;
                for (0..n) |j| m[i][j] = v;
            }

            return .{ .m = m };
        }

        fn deinit(self: *Self, allocator: mem.Allocator) void {
            for (self.m) |row| allocator.free(row);
            allocator.free(self.m);
        }
    };
}

pub fn main() !void {
    var gpa = heap.GeneralPurposeAllocator(.{}){};
    defer _ = gpa.deinit();
    const allocator = gpa.allocator();

    const T = u32;
    const n = 500;

    var a = try Matrix(T).init(allocator, n, 1);
    defer a.deinit(allocator);
    var b = try Matrix(T).init(allocator, n, 1);
    defer b.deinit(allocator);
    var c = try Matrix(T).init(allocator, n, 0);
    defer c.deinit(allocator);

    var timer = try time.Timer.start();
    unrollSimdMatrixMultiply(T, c.m, a.m, b.m);
    print("{} {d} ms\n", .{ c.m[50][50], timer.lap() / time.ns_per_ms });
}

dude_the_builder · August 24, 2024, 6:31pm

Hmm, just printed out the suggested vec length and it’s 4. Looking up the info for the M1, it says it’s Neon 128 bit, so the suggestion should be 16?

dimdin · August 24, 2024, 6:42pm

On my cpu I am getting 32 = @sizeOf(T=u32)=4 * suggestVectorLength(T=32)=8
And it takes 99ms.

const vec_len = @sizeOf(T) * (simd.suggestVectorLength(T) orelse @panic("No SIMD?"));

EDIT: but this is wrong, the correct is @Vector(4, u32).

const vec_len = simd.suggestVectorLength(T) orelse @panic("No SIMD?");

And this takes 155ms!!!

dude_the_builder · August 24, 2024, 6:48pm

This works, thanks @dimdin ! So if I understand correctly then, the function returns the number of “slots” needed; each slot to be filled by the number of bytes that the specific type uses.

dude_the_builder · August 24, 2024, 6:51pm

Now we’re both confused! lol

dimdin · August 24, 2024, 6:56pm

That was debug (cpu i7-14700).
In releaseFast I am getting 14ms(vec_len=32) / 22ms(vec_len=4) .

dude_the_builder · August 24, 2024, 7:03pm

So the suggested vec length is correct, but using a larger vec length than the hardware actually has space for performs a lot better. Interesting… Seems like this parameter has to be fine tuned with actual benchmarking then.

dimdin · August 24, 2024, 7:14pm

I am not sure that the suggested vector length is correct.

See the source of simd.suggestVectorLengthForCpu
For aarch64 there is a comment:

// SVE allows up to 2048 bits in the specification, as of 2022 the most powerful machine has implemented 512-bit
// I think is safer to just be on 128 until is more common
// TODO: Check on this return when bigger values are more common

dimdin · August 24, 2024, 7:16pm

Yes, that is correct. Because suggestVectorLength calculates the total bits for the vector and it divides it by the bits of the type.

return @divExact(vector_bit_size, element_bit_size);

IntegratedQuantum · August 24, 2024, 9:30pm

As you increase the vector length, your algorithm becomes more cache friendly.
b[k][j..][0..vec_len] will always need to load a new cache line, as k increases.
But k is your inner loop, so you are doing a cache miss once per iteration.
Now if you increase the vector length, you will do more stuff within the k loop, so the cache miss is less severe.

Now in this case you can just swap the loops to get rid of the inefficient memory accesses (another 3× speedup):

    for (c, a) |*c_row, a_row| {
        for (0..n) |k| {
            var j: u32 = 0;
            while (j <= n - vec_len) : (j += vec_len) {
                const u: @Vector(vec_len, T) = b[k][j..][0..vec_len].*;
                const y: @Vector(vec_len, T) = c_row.*[j..][0..vec_len].*;
                const w: @Vector(vec_len, T) = @splat(a_row[k]);
                c_row.*[j..][0..vec_len].* = (u * w) + y;
            }
            while (j < n) : (j += 1) {
                c_row.*[j] += a_row[k] * b[k][j];
            }
        }
    }

Also note that I removed your useless @memcpy.
Now it still seems that the larger vectors are faster (10 ms vs 9 ms), but this is not significant because it could also come from differences in code alignment.

dee0xeed · August 24, 2024, 9:47pm

Just a side note: if you want some real “realtime” (pun intended) and “critical mission applications”, just do not use CPUs with d/i caches, they are by theirs nature unpredictable in the sense that it is extremely hard to estimate how long a given piece of code will take to execute.

dude_the_builder · August 24, 2024, 11:08pm

This is excellent, thanks! I guess that the fine tuning then moves from SIMD register size to cache line size fitting.