SIMD: is there an "equivalent" to _mm_shuffle_ep*()?

Mathias · November 23, 2023, 1:11am

Hi everyone,
The compiler intrinsic (for C/C++) _mm_shuffle_ep*() performs vector shuffling with run-time known parameters. Zigs’ @shuffle() requires a comp-time known mask.
Is there any way I can re-order elements within a vector in Zig with a run-time known mask?
-Thanks

squeek502 · November 23, 2023, 1:56am

Might be helpful to provide a specific C/C++ code example that you’d like to get working in Zig.

Mathias · November 23, 2023, 4:57am

Well… I was trying to recreate what Andreas Fredrikkson presented in this link (see 27:43):
https://www.reddit.com/r/programming/comments/qy2hp1/andreas_fredriksson_context_is_everything/

He’s loading text (JSON) into a SIMD register, removing the whitespace characters (compressing the contents), then writing it out to a buffer.

The lookup tables contain 16 byte arrays (or vectors) that contain the necessary mask for the shuffle.
The mask consist of integer indexes into the vector which are non-whitespace.
In my implementation, hinted at below, the mask will have -1 in all places where the result of the shuffle ought to have 0’s.

The below lines of Zig would correspond to the line in the video where he’s declaring and initializing the variable c0.
const zero_vector: @Vector(16, u8) = @splat(0);
const left_shuffle = @shuffle(u8, chs, zero_vector, mask_from_tbl_lookup);

The tables are generated at compile time, that’s no problem. The problem is that the index that I need to lookup the mask in the table needs to be created at run-time, as I parse the text/json.

Sorry, this is a terribly complicated example. Maybe I should just embed the assembly?

squeek502 · November 23, 2023, 5:21am

I’m very inexperienced with SIMD, so I could be totally off base, but since @shuffle takes 4 arguments rather than 2, maybe something like @shuffle(u8, chs, mask_from_tbl_lookup, ???) would work, as long as there’s something that ??? could be to match the functionality of _mm_shuffle_epi8.

Mathias · November 23, 2023, 5:29am

Shuffle’s 2nd and 3rd parameters are vectors which will contain elements from which the resulting vector will be derived. The fourth parameter is a comp-time mask, a vector of indexes (what my tables contain), that specify the index from the source vectors to use for the result. For example, if the mask was [1, 2, -1], then the first element of the result would be index 1 from the first vector parameter. Negative indexes correspond to elements in the 3rd parameter.

squeek502 · November 23, 2023, 5:31am

Right, ignore me. Hopefully someone with some SIMD experience can chime in.

Validark · November 24, 2023, 5:56am

First-class language support is not here yet, but it’s planned.
Relevant issue: Indexing arrays with vectors (gather) · Issue #12815 · ziglang/zig · GitHub

If all else fails, you can use inline assembly. Here’s some code adapted from simdjzon. You’ll probably need to fix it up a little but hopefully this will get you started:

    const u8x32 = @Vector(32, u8);
    const u8x64 = @Vector(64, u8);
    const u64x4 = @Vector(4, u64);
    const u32x4 = @Vector(4, u32);
    const u8x16 = @Vector(16, u8);
    const u8x8 = @Vector(8, u8);

    pub const chunk_len = switch (builtin.cpu.arch) {
        .x86_64 => 32,
        .aarch64, .aarch64_be => 16,
        else => 16,
    };
    pub const Chunk = @Vector(chunk_len, u8);

    // end from https://gist.github.com/sharpobject/80dc1b6f3aaeeada8c0e3a04ebc4b60a
    pub fn mm_shuffle_epi8(x: Chunk, mask: Chunk) Chunk {
        return asm (
            \\vpshufb %[mask], %[x], %[out]
            : [out] "=x" (-> Chunk),
            : [x] "+x" (x),
              [mask] "x" (mask),
        );
    }

    // https://developer.arm.com/architectures/instruction-sets/intrinsics/vqtbl1q_s8
    pub fn lookup_16_aarch64(x: u8x16, mask: u8x16) u8x16 {
        return asm (
            \\tbl  %[out].16b, {%[mask].16b}, %[x].16b
            : [out] "=&x" (-> u8x16),
            : [x] "x" (x),
              [mask] "x" (mask),
        );
    }

    fn lookup_chunk(comptime a: [16]u8, b: Chunk) Chunk {
        switch (builtin.cpu.arch) {
            .x86_64 => return mm_shuffle_epi8(a ** (chunk_len / 16), b),
            .aarch64, .aarch64_be => return lookup_16_aarch64(b, a),
            else => {
                var r: Chunk = @splat(0);
                for (0..chunk_len) |i| {
                    const c = b[i];
                    assert(c <= 0x0F);
                    r[i] = a[c];
                }
                return r;

                // var r: Chunk = @splat(0);
                // for (0..16) |i| {
                //     inline for ([2]comptime_int{ 0, 16 }) |o| {
                //         if ((b[o + i] & 0x80) == 0) {
                //             r[o + i] = a[o + b[o + i] & 0x0F];
                //         }
                //     }
                // }
                // return r;
            },
        }
    }

Mathias · November 24, 2023, 8:15am

Thank you @Validark. This is helpful. I was able to use the shuffle function above.