Shuffle @Vector with variable control

There is an accepted proposal Indexing arrays with vectors (gather) #12815 which if i understand correctly would allow one to shuffle (or permute) an @Vector with a runtime control.

However I cannot find any active development on that issue. Meantime the posted workaround on that page is no longer working with the latest zig version.

I tried to implement a runtime shuffle in C for my situation (x86_64 with instruction set up to AVX2) to then use it as a library in zig.

I found it was not entirely trivial; For avx2 there are only 2 intrinsics that allow a cross-lane permute with a runtime variable: _mm256_permutevar8x32_ps and _mm256_permutevar8x32_epi32. So shuffling anything else requires more work. Some examples show below.

__m256d my_shuffle256_D(__m256d V, __m256i mask) {
  // Runtime shuffle of 4xdouble by using intrinsic for 8xfloat
  // Need to double the mask indices and create the neighbouring index
  // to keep the 2xf32 slots together that form an f64
  __m256i masklo = _mm256_add_epi64(mask, mask);
  __m256i maskhi = _mm256_add_epi64(masklo, _mm256_set1_epi64x(1));

  // Shift result left, OR them together with original,
  __m256i shufmask = _mm256_or_si256(masklo, _mm256_slli_epi64(maskhi, 32));

  // do the shuffle
  __m256d out = (__m256d)(_mm256_permutevar8x32_ps((__m256)V, shufmask));

  return out;
}

__m256i my_shuffle256_B(__m256i in, __m256i index) {
  // Runtime shuffle of 32*byte
  // create second vector with values from the other lane
  __m256i in_hihi = _mm256_permute2x128_si256(in, in, 2 << 4 | 2 << 0);
  __m256i in_lolo = _mm256_permute2x128_si256(in, in, 1 << 4 | 1 << 0);

  // shuffle hi and lo
  __m256i ins = _mm256_shuffle_epi8(in_hihi, index);
  __m256i nis = _mm256_shuffle_epi8(in_lolo, index);

  // blend values from correct section
  __m256i mask = _mm256_cmpgt_epi8(index, _mm256_set1_epi8(0x0F));
  __m256i out = _mm256_blendv_epi8(ins, nis, mask);
  return out;
}

This approach however currently doesn’t work as translate-c is unable to process the required <immintrin.h> header (as asked in my other question here Translate C lib using <immintrin.h>).

I created an issue at translate-c, but does anybody have any other ideas how to do this?

If you’re trying to do this in zig, you might try some things mentioned in @Validark’s article: Eine Kleine Vectorized Classification - Validark's Blog

EDIT: looking again at what you’re trying to do, i may have been hasty in my response without understanding what your code does. apologies for that. that said, i’ve found lots of helpful info wrt algorithms, simd and zig in Validark’s blog. maybe check out some of the other articles if the one i linked isn’t helpful, you might find something interesting there.

Yeah, I’ve seen his blog and some of his presentations on the Utah Zig channel. Very interesting stuff. That would be able to do the same as what i’m trying to do and be slightly more integrated in zig i think, although it just passing straight to llvm.

Main (skill) issue on my side is i don’t know where to find the

extern fn @"llvm.x86.avx2.pshuf.b"(@Vector(32, u8), @Vector(32, u8)) @Vector(32, u8);

and other functions and their signatures, so it was just easier to go to the C functions for me.

Maybe if some can point me to the place where i can look these up would be nice.

The LLVM intrinsics for x86 (and other architectures in their respective files) are defined here: llvm-project/llvm/include/llvm/IR/IntrinsicsX86.td at main · llvm/llvm-project · GitHub

The llvm.x86.avx2.pshuf.b is this one for instance: llvm-project/llvm/include/llvm/IR/IntrinsicsX86.td at bb20724ecb743be0d67270934830510f0149ed15 · llvm/llvm-project · GitHub

Search for llvm.x86.avx2.pshuf.b Found this list of avx2 which does contain pshufb.

I haven’t tried many of them but i’d guess stuff from here should also be available
LLVM Language Reference Manual — LLVM 23.0.0git documentation.

Ok, that is a start.

But i still don’t know how to translate:

def int_x86_avx2_pshuf_b : GCCBuiltin<“__builtin_ia32_pshufb256”>,
Intrinsic<[llvm_v32i8_ty], [llvm_v32i8_ty,
llvm_v32i8_ty], [IntrNoMem]>;

into:

extern fn @"llvm.x86.avx2.pshuf.b"(@Vector(32, u8), @Vector(32, u8)) @Vector(32, u8);

Also that page does not seem to have the permute operations i need (but i just did a very quick search so i maybe wrong)

edit: Although i have other algorithms that use pshufb, for this one i’m only interested in translating the permute (VPERMPS and VPERMQ) and blend functions. For pshufb i can just copy Validarks example. (pshufb doesn’t do a lane crossing shuffle)

the general pattern seems to be:
def it_x86_name : GCCBuiltin<...>, Intrinsic<[Output], [Param1, Param2, ...], [Memory stuff]>;
Then there’s the comment at the top of the block that says // All intrinsics start with "llvm.x86.".

So for the name int_x86_avx2_pshuf_b, you’d do llvm.x86., then the name of the intrinsic with . instead of _, so llvm.x86.avx2.pshuf.b

llvm_v32i8_ty is “32 item vector of int8”, which is equivalent to @Vector(32, u8)/@Vector(32, i8) (bitwise equivalent).

Another example translation (not related to pshufb, i just picked one at random):

int_x86_avx512_vpermilvar_ps_512 becomes llvm.x86.avx512.vpermilvar.ps.512
The return value llvm_v16f32_ty becomes @Vector(16, f32)
The params llvm_v16f32_ty, llvm_v16i32_ty become @Vector(16, f32), @Vector(16, u32) (or i32 for the second)
So the result would be

extern fn @"llvm.x86.avx512.vpermilvar.ps.512"(@Vector(16, f32), @Vector(16, u32)) @Vector(16, f32);

(maybe someone motivated enough could write an automated tool for this, the whole process seems pretty mechanical)

Nice!

Thank you all

const std = @import("std");

extern fn @"llvm.x86.avx2.permps"(@Vector(8, f32), @Vector(8, i32)) @Vector(8, f32);

const V = @Vector(4, f64);
const I = @Vector(4, i64);

export fn my_shuffle256_D(in: V, index: I) V {
  const masklo = index + index;
  const maskhi = masklo + @as(I, @splat(1));
  const shufmask = masklo | (maskhi << @splat(32));
  const out = @"llvm.x86.avx2.permps"(@bitCast(in), @bitCast(shufmask));

  return @bitCast(out);
}
.LCPI0_0:
        .quad   4294967296
example.my_shuffle256_D:
push    rbp
mov     rbp, rsp
vpaddq  ymm2, ymm1, ymm1
vpsllq  ymm1, ymm1, 33
vpor    ymm1, ymm1, ymm2
vpbroadcastq    ymm2, qword ptr [rip + .LCPI0_0]
vpor    ymm1, ymm1, ymm2
vpermd  ymm0, ymm1, ymm0
pop     rbp
ret
3 Likes

I wish zig would have a proper @select that can do runtime permutations. Currently it doesn’t support those because some platform don’t, but I’d rather have this replaced by a for loop on some platform that not being able to write the code all together.

I’ve also tried implementing this on aarch64 but the instruction requires two following registers, which no inline assembly dialect supports, so I had to harcode register names.

Yeah, as i am the sole consumer of my code i can be quite selfish in targeting my own platform.

Maybe if your system has an equivalent to setr you can do something like the following. (please note the first comment)

__m256i my_shuffle256_QW3(__m256i V, __m256i index) {
// dont use, gives awfull assembly code

// clang-format off
return _mm256_setr_epi32(V[index[0] + 1], V[index[0]],
                         V[index[1] + 1], V[index[1]],
                         V[index[2] + 1], V[index[2]],
                         V[index[3] + 1], V[index[3]]);
// clang-format on
}

Also posting the solution for shuffling bytes:

extern fn @"llvm.x86.avx2.pshuf.b"(@Vector(32, i8), @Vector(32, i8)) @Vector(32, i8);
extern fn @"llvm.x86.avx2.pblendvb"(@Vector(32, i8), @Vector(32, i8), @Vector(32, i8)) @Vector(32, i8);

const V32x8i = @Vector(32, i8);
const I32x8i = @Vector(32, i8);

export fn my_shuffle256_B(in: V32x8i, index: I32x8i) V32x8i {
  const in_hihi = @shuffle(i8,in,undefined,@Vector(32, i8){ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 });
  const in_lolo = @shuffle(i8,in,undefined,@Vector(32, i8){ 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31 });

  const inhi = @"llvm.x86.avx2.pshuf.b"(in_hihi, index);
  const inlo = @"llvm.x86.avx2.pshuf.b"(in_lolo, index);

  // need the minus to get from int=1 to a mask, which needs msb set
  const mask = -@as(I32x8i, @intFromBool(index > @as(I32x8i, @splat(0x0F))));
  const out = @"llvm.x86.avx2.pblendvb"(inhi, inlo, mask);

  return out;
}
.LCPI0_0:
        .zero   32,15
example.my_shuffle256_B:
        push    rbp
        mov     rbp, rsp
        vpermq  ymm2, ymm0, 68
        vpermq  ymm0, ymm0, 238
        vpshufb ymm2, ymm2, ymm1
        vpshufb ymm0, ymm0, ymm1
        vpcmpgtb        ymm1, ymm1, ymmword ptr [rip + .LCPI0_0]
        vpblendvb       ymm0, ymm2, ymm0, ymm1
        pop     rbp
        ret

Both produce the same assembly as the C versions. This one is easier to extend to wider vectors.

I guess this is the most efficient way to shuffle a vector with a variable control (runtime control) at least for x86_64 with avx2 only.

2 Likes

The extern fn @"llvm.x86.avx2.pblendvb"(@Vector(32, i8), @Vector(32, i8), @Vector(32, i8)) @Vector(32, i8);Can be replaced by just @selectas it takes a runtime control.

extern fn @"llvm.x86.avx2.pshuf.b"(@Vector(32, i8), @Vector(32, i8)) @Vector(32, i8);

const V32x8i = @Vector(32, i8);
const I32x8i = @Vector(32, i8);

export fn my_shuffle256_B2(in: V32x8i, index: I32x8i) V32x8i {
  const in_hihi = @shuffle(i8, in, undefined, @Vector(32, i8){ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 });
  const in_lolo = @shuffle(i8, in, undefined, @Vector(32, i8){ 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31 });

  const inhi = @"llvm.x86.avx2.pshuf.b"(in_hihi, index);
  const inlo = @"llvm.x86.avx2.pshuf.b"(in_lolo, index);

  const out = @select(i8, index > @as(I32x8i, @splat(0x0F)), inlo, inhi);

  return out;
}

Assembly output: :slight_smile:

my_shuffle256_B2 = example.my_shuffle256_B

LLVM doesn’t support a runtime shufflevector statement. Zig is just using LLVM’s vector support.

Here are some stubs you might find useful.

const std = @import("std");

fn aarch64_tbl1(table: @Vector(16, u8), indices: @Vector(16, u8)) @Vector(16, u8) {
    return struct {
        extern fn @"llvm.aarch64.neon.tbl1"(@Vector(16, u8), @Vector(16, u8)) @Vector(16, u8);
    }.@"llvm.aarch64.neon.tbl1"(table, indices);
}

fn tbl4(table_part_1: @Vector(16, u8), table_part_2: @Vector(16, u8), table_part_3: @Vector(16, u8), table_part_4: @Vector(16, u8), indices: @Vector(16, u8)) @Vector(16, u8) {
    return struct {
        extern fn @"llvm.aarch64.neon.tbl4"(@TypeOf(table_part_1), @TypeOf(table_part_2), @TypeOf(table_part_3), @TypeOf(table_part_4), @TypeOf(indices)) @TypeOf(indices);
    }.@"llvm.aarch64.neon.tbl4"(table_part_1, table_part_2, table_part_3, table_part_4, indices);
}

// This one is for 32-bit arm
fn vtbl2(table_part_1: @Vector(8, u8), table_part_2: @Vector(8, u8), indices: @Vector(8, u8)) @Vector(8, u8) {
    // comptime assert(builtin.cpu.arch == .arm and std.Target.arm.featureSetHas(builtin.cpu.features, .neon));

    return struct {
        extern fn @"llvm.arm.neon.vtbl2"(@TypeOf(table_part_1), @TypeOf(table_part_2), @TypeOf(indices)) @TypeOf(table_part_1);
    }.@"llvm.arm.neon.vtbl2"(table_part_1, table_part_2, indices);
}

It’s funny that while LLVM doesn’t, clang the C/C++ frontend does. (as does gcc)

__builtin_shufflevector is comptime too. It’s the equivalent of Zig’s @shuffle

Did you read the link? Clang does not emit the llvm shufflevector in case there’s runtime mask as second argument for the builtin.

Oh, I see. You’re right that there is a version of the builtin that emits a series of extract and insert operations that the x86-64 backend can reconstruct into permutes sometimes. I’m not sure why they are using the term “mask” to refer to indices though. You can emit those same instructions in Zig 0.15 too. (Godbolt)

export fn foo(x: @Vector(4, u32), y: @Vector(4, u32)) @Vector(4, u32) {
    var result: @Vector(4, u32) = undefined;
    inline for (0..4) |i|
        result[i] = x[y[i]];
    return result;
}

This code is not good for obvious reasons though, so I think we need a better solution regardless.

I think @shuffle should accept runtime mask and do best effort when that is not supported by hw directly.