Massive slowdown with Vectors as well as slightly different loop, can't figure out why

Full code at bottom.

I have a nested array of arrays of ints and am iterating over the pairwise combinations of them and doing some math. My first attempt simply iterated over the length of a nested array and did element-wise operations. The commented out code is a branchless version that avoids branching by casting bools to ints and performing arithmetic. I was testing this if it was faster (the compiler already does the optimization it seems, at least with -O ReleaseFast):

const a = arrs.getRow(i);
const b = arrs.getRow(j);
var sum_a: u16 = 0;
var sum_b: u16 = 0;
for (0..arrs.height) |x| {
    if (a[x] > b[x]) {
        sum_a += 1;
    } else if (a[x] < b[x]) {
        sum_b += 1;
    // const a_gt_b = a[x] > b[x];
    // const b_gt_a = a[x] < b[x];
    // sum_a += 1 * @intFromBool(a_gt_b);
    // sum_b += 1 * @intFromBool(b_gt_a);
    // }
}

This takes about 1 millisecond on my 9800X3D. I then tried a Vector version, but this was a lot slower:

const a: @Vector(8000, u8) = arrs.getRow(i)[0..8000].*;
const b: @Vector(8000, u8) = arrs.getRow(j)[0..8000].*;
const a_gt_b: @Vector(8000, u16) = @intFromBool(a > b);
const b_gt_a: @Vector(8000, u16) = @intFromBool(a < b);
const sum_a: u16 = @reduce(.Add, a_gt_b);
const sum_b: u16 = @reduce(.Add, b_gt_a);

Unfortunately, this takes about 19ms. I was also curious to see if @reduce also uses SIMD (like the < and > operators), so after the SIMD compares I summed with a for loop, but this was wildly slow. This takes around 33s:

var sum_a2: u16 = 0;
var sum_b2: u16 = 0;
for (0..arrs.height) |x| {
    sum_a2 += a_gt_b[x];
    sum_b2 += b_gt_a[x];
}

I can’t see why this loop would be so terribly slow. The fastest version I posted at the top does a similar for loop the same number of times with similar operations.

I’ve tried to view the assembly of the SIMD version and the SIMD/loop version to see what is different from the fast one, but am having a hard time getting rid of all debug info etc, so the assembly is ginormous and hard to parse.

Any advice on what’s up here? Thanks.
I’ve been running zig build-exe -O ReleaseFast src/main.zig && ./main on v0.15.0-dev.516+abbead1fb and here is my full code:

const std = @import("std");
const print = std.debug.print;
const Allocator = std.mem.Allocator;

pub fn Array2D(comptime T: type) type {
    return struct {
        const Self = @This();
        mem: []T,
        width: u32,
        height: u32,

        pub fn init(allocator: Allocator, width: u32, height: u32) !Self {
            return .{
                .mem = try allocator.alloc(T, width * height),
                .width = width,
                .height = height,
            };
        }

        pub fn deinit(self: Self, allocator: Allocator) void {
            allocator.free(self.mem);
        }

        pub fn get(self: Self, x: usize, y: usize) T {
            std.debug.assert(x < self.width and y < self.height);
            return self.mem[x * self.height + y];
        }

        pub fn getRow(self: Self, x: usize) []T {
            std.debug.assert(x < self.width);
            return self.mem[x * self.height ..][0..self.height];
        }

        pub fn set(self: Self, x: usize, y: usize, val: T) void {
            self.mem[x * self.height + y] = val;
        }
    };
}

pub fn main() !void {
    const prng = std.crypto.random;
    var arena = std.heap.ArenaAllocator.init(std.heap.smp_allocator);
    defer arena.deinit();
    const alloc = arena.allocator();
    const width = 100;
    const height = 8000;
    var arrs = try Array2D(u8).init(alloc, width, height);
    for (0..100) |i| {
        for (0..8000) |j| {
            arrs.set(i, j, std.Random.int(prng, u8));
        }
    }
    var timer = try std.time.Timer.start();
    var idx: u16 = 0;
    var outputs: [20_000]f32 = undefined;
    for (0..arrs.width) |i| {
        const a: @Vector(8000, u8) = arrs.getRow(i)[0..8000].*;
        for (0..arrs.width) |j| {
            const b: @Vector(8000, u8) = arrs.getRow(j)[0..8000].*;
            const a_gt_b: @Vector(8000, u16) = @intFromBool(a > b);
            const b_gt_a: @Vector(8000, u16) = @intFromBool(a < b);
            // const sum_a: u16 = @reduce(.Add, a_gt_b);
            // const sum_b: u16 = @reduce(.Add, b_gt_a);

            var sum_a2: u16 = 0;
            var sum_b2: u16 = 0;
            for (0..arrs.height) |x| {
                sum_a2 += a_gt_b[x];
                sum_b2 += b_gt_a[x];
            }
            // try std.testing.expect(sum_a == sum_a2);

            // const a = arrs.getRow(i);
            // const b = arrs.getRow(j);
            // var sum_a: u16 = 0;
            // var sum_b: u16 = 0;
            // for (0..arrs.height) |x| {
            //     const a_gt_b = a[x] > b[x];
            //     const b_gt_a = a[x] < b[x];
            //     sum_a += 1 * @intFromBool(a_gt_b);
            //     sum_b += 1 * @intFromBool(b_gt_a);
            //     // if (a[x] > b[x]) {
            //     //     sum_a += 1;
            //     // } else if (a[x] < b[x]) {
            //     //     sum_b += 1;
            //     // }
            // }
            const norm_a = @as(f32, @floatFromInt(sum_a)) / @as(f32, @floatFromInt(sum_a + sum_b));
            const norm_b = @as(f32, @floatFromInt(sum_b)) / @as(f32, @floatFromInt(sum_a + sum_b));

            outputs[idx] = norm_a;
            outputs[idx + 1] = norm_b;
            idx += 2;
        }
    }
    const end: f64 = @floatFromInt(timer.read());

    std.debug.print("{d}\n", .{end / 1_000_000_000.0});
}

Large vectors are not well supported by Zig/llvm.
It generates a large amount of instructions which generally makes things slower.
Generally you shouldn’t use larger vectors than supported by your target hardware. You can use std.simd.suggestVectorSize(u16) to figure out the optimal size.

8 Likes

Is there value in nesting the arrays one level further, where the new deepest level is the original 8000 length vector, but split into N smaller vectors according to the suggested vector size?

1 Like

You might find that your original loop is already being automatically vectorized:

And, to elaborate on what @stratts said, the vectorization done by the compiler is often better than the vectorization done by hand. There are usually multiple ways to vectorize a loop, but we, humans, usually only see one of them, which sometimes isn’t the best. When you manually vectorize the code, the compiler follows your instruction literally, it doesn’t do its own vector optimizations.

That is exactly what they are suggesting.
As others have said, you should check that the compiler didn’t vectorise your loop already.

As a sidenote. Watch out with 2D arrays. Access can be terribly slow as well.

Any advice on the best way to view the assembly? I was having a hard time getting rid of all debug symbols etc. My .asm files were massive and hard to parse.

is it best to just use 1D arrays and offset arithmetic?

I don’t dare use 2D arrays anymore. Somewhere is a chess programming thread about it. I had the same problem and moved to 1D only.
(edit: some accessing triggered huge @memcpy’s)

this is that thread

1 Like

But this bug exists independent of the array being one- or multi-dimmensional. So I don’t really get why it would make you avoid 2D arrays.

Oh… Was it when the array is in a struct? I forgot the details.

In certain situations when the array is big and not accessed via a pointer (explicitly).

My knowledge about that problem isn’t super in-depth, but I think this workaround works reliably, when you encounter the problem:

I faced this problem not so long ago and the reason is that when you access an array with an is indexed that is computed and could have side effects on the array, the array is first copied in memory and then the index computed (related issue).

If you access the array via a pointer, only the pointer is copied to memory, avoiding huge memcpy

The bug is quite strange I must say. Good array access would be my version 0.0.1 when building a language.

I am just starting to look into Zig myself, but when I want to quickly find what changed in my code emit two assembley files. One for the original and one for the new version. Usually I use -O ReleaseSmall because it outputs the least code, but I am not sure how different it will be from -O ReleaseFast.