Understanding SIMD

Hey everyone, I was trying to understand SIMD a bit better and played around with vectors in zig.

I have the following two examples and noticed that I don’t really get much of a performance improvement. It seems the compiler is vectorising the non SIMD example already (because of it being structs of arrays?). My knowledge of assembly is not that great though so I have trouble validating it. Am I using SIMD the right way or is this also just a wrong use case?

The SIMD example:

const std = @import("std");

pub const Entity = struct {
    pos_x: f32,
    pos_y: f32,
    vel_x: f32,
    vel_y: f32,
};
pub const EntityList = std.MultiArrayList(Entity);

const SIMD_REGISTER_SIZE: u32 = std.simd.suggestVectorLength(f32).?;
const VF32 = @Vector(SIMD_REGISTER_SIZE, f32);
const VG: VF32 = @splat(-9.81);

fn update(entity_list: EntityList, dt: f32) void {
    const entity_slice = entity_list.slice();

    const N = SIMD_REGISTER_SIZE;
    const vdt: VF32 = @splat(dt);

    var i: usize = 0;
    while (i + N <= entity_slice.len) : (i += N) {
        // Load SIMD chunks
        const px: VF32 = entity_slice.items(.pos_x)[i..][0..N].*;
        const py: VF32 = entity_slice.items(.pos_y)[i..][0..N].*;
        const vx: VF32 = entity_slice.items(.vel_y)[i..][0..N].*;
        const vy: VF32 = entity_slice.items(.vel_x)[i..][0..N].*;

        // Store back
        entity_slice.items(.pos_x)[i..][0..N].* = px + vx * vdt;
        entity_slice.items(.pos_y)[i..][0..N].* = py + (vy + VG) * vdt;
    }
}

The reference example:

const std = @import("std");

pub const Entity = struct {
    pos_x: f32,
    pos_y: f32,
    vel_x: f32,
    vel_y: f32,
};
pub const EntityList = std.MultiArrayList(Entity);

fn update(entity_list: EntityList, dt: f32) void {
    const entity_slice = entity_list.slice();
    for (
        entity_slice.items(.pos_x),
        entity_slice.items(.pos_y),
        entity_slice.items(.vel_x),
        entity_slice.items(.vel_y),
    ) |*pos_x, *pos_y, vel_x, vel_y| {
        pos_x.* += vel_x * dt;
        pos_y.* += (vel_y + -9.81) * dt;
    }
}
1 Like

if the instructions have super long names, its probably simd.

5 Likes

Compilers are really good at vectorization of loops where

  • control flow of loop does not depend on loop body. (e.g. no if (condition()) break)
  • loop iterations are independent from each other. (e.g. no x[i] = x[i - 1] * 10;)
  • loop body always takes same control flow. (e.g. no if’s in body)

You example fits all boxes so it is well optimized.

Sometimes compilers can even overcome mentioned problems but its not reliable.

You may ask. Why this are problems for the compiler?

  • If early exit from loop is possible vectorizing can cause code to access memory it shouldn’t have accessed. Classic example is search of 0 byte in zero terminated string. Compiler has no choice but to do dumb iterations since reading past 0 byte can cause problems. Compiler Explorer
  • Compiler would need to know closed formula to be able to overcome it. Here is doesn’t know fibbonachi sequence Compiler Explorer for example. But it knows about algebraic sequences Compiler Explorer
  • SIMD stands for Single Instruction Multiple Data so if instructions are different on different iterations you can’t really use it. There are some SIMD instructions which can be used to emulate different control flow but not always.
5 Likes

Yes, the compiler will auto-vectorize any trivial loop like this, and it is rather difficult to get any better with manual vectorization.
Honestly the biggest gains of SIMD here come from changes in memory layout, which you already laid out perfectly for the compiler to optimize it in this case.

To give you an example where manual SIMD is more relevant, I’d suggest to look at std.mem.eql.
The original code looked something like this

for (a, b) |a_elem, b_elem| {
    if (a_elem != b_elem) return false;
}

Here the compiler isn’t allowed to access memory that wouldn’t be accessed by the unoptimized function, since that may in the worst case cause a page fault and crash the program.
That means, if you know that it is legal to access that memory, then you can write much faster code.
Here the manual SIMD code roughly looks like this (plus a ton of extra code to handle unaligned inputs and odd lengths)

    for (0..(a.len - 1) / Scan.size) |i| {
        const a_chunk: Scan.Chunk = @bitCast(a[i * Scan.size ..][0..Scan.size].*);
        const b_chunk: Scan.Chunk = @bitCast(b[i * Scan.size ..][0..Scan.size].*);
        if (@reduce(.Or, chunk_a != chunk_b)) return false;
    }
4 Likes

Thanks for the replies! Very insightful. So from what I understand I should probably just try to keep my loops tight and branchless if possible. If I can’t I should try to incorporate SIMD if possible? It kinda feels like an “it depends” topic and is obviously not as straight forward as I though.

1 Like

Also trying to learn simd, what I do is adapted from Using SIMD to Tell if the High Bit is Set but with a build options
to use simd so I can compare performance easily with and without. Sometimes it’s faster, sometimes not.


var remaining = input;
    if (build_options.use_simd) if (comptime std.simd.suggestVectorLength(u8)) |vector_len| {
        while (remaining.len > vector_len) {
            const chunk: @Vector(vector_len, u8) = remaining[0..vector_len].*;
            // do something 
            remaining = remaining[vector_len..];
        }
    }

     for (0..remaining.len) |i| {
       remaining[i] = …;
    }

2 Likes

That’s actually a pretty good idea!