Shuffle @Vector with variable control

ODON1 · June 1, 2026, 12:43pm

There is an accepted proposal Indexing arrays with vectors (gather) #12815 which if i understand correctly would allow one to shuffle (or permute) an @Vector with a runtime control.

However I cannot find any active development on that issue. Meantime the posted workaround on that page is no longer working with the latest zig version.

I tried to implement a runtime shuffle in C for my situation (x86_64 with instruction set up to AVX2) to then use it as a library in zig.

I found it was not entirely trivial; For avx2 there are only 2 intrinsics that allow a cross-lane permute with a runtime variable: _mm256_permutevar8x32_ps and _mm256_permutevar8x32_epi32. So shuffling anything else requires more work. Some examples show below.

__m256d my_shuffle256_D(__m256d V, __m256i mask) {
  // Runtime shuffle of 4xdouble by using intrinsic for 8xfloat
  // Need to double the mask indices and create the neighbouring index
  // to keep the 2xf32 slots together that form an f64
  __m256i masklo = _mm256_add_epi64(mask, mask);
  __m256i maskhi = _mm256_add_epi64(masklo, _mm256_set1_epi64x(1));

  // Shift result left, OR them together with original,
  __m256i shufmask = _mm256_or_si256(masklo, _mm256_slli_epi64(maskhi, 32));

  // do the shuffle
  __m256d out = (__m256d)(_mm256_permutevar8x32_ps((__m256)V, shufmask));

  return out;
}

__m256i my_shuffle256_B(__m256i in, __m256i index) {
  // Runtime shuffle of 32*byte
  // create second vector with values from the other lane
  __m256i in_hihi = _mm256_permute2x128_si256(in, in, 2 << 4 | 2 << 0);
  __m256i in_lolo = _mm256_permute2x128_si256(in, in, 1 << 4 | 1 << 0);

  // shuffle hi and lo
  __m256i ins = _mm256_shuffle_epi8(in_hihi, index);
  __m256i nis = _mm256_shuffle_epi8(in_lolo, index);

  // blend values from correct section
  __m256i mask = _mm256_cmpgt_epi8(index, _mm256_set1_epi8(0x0F));
  __m256i out = _mm256_blendv_epi8(ins, nis, mask);
  return out;
}

This approach however currently doesn’t work as translate-c is unable to process the required <immintrin.h> header (as asked in my other question here Translate C lib using <immintrin.h>).

I created an issue at translate-c, but does anybody have any other ideas how to do this?

Travis · June 2, 2026, 5:29am

If you’re trying to do this in zig, you might try some things mentioned in @Validark’s article: Eine Kleine Vectorized Classification - Validark's Blog

EDIT: looking again at what you’re trying to do, i may have been hasty in my response without understanding what your code does. apologies for that. that said, i’ve found lots of helpful info wrt algorithms, simd and zig in Validark’s blog. maybe check out some of the other articles if the one i linked isn’t helpful, you might find something interesting there.

ODON1 · June 2, 2026, 6:44am

Yeah, I’ve seen his blog and some of his presentations on the Utah Zig channel. Very interesting stuff. That would be able to do the same as what i’m trying to do and be slightly more integrated in zig i think, although it just passing straight to llvm.

Main (skill) issue on my side is i don’t know where to find the

extern fn @"llvm.x86.avx2.pshuf.b"(@Vector(32, u8), @Vector(32, u8)) @Vector(32, u8);

and other functions and their signatures, so it was just easier to go to the C functions for me.

Maybe if some can point me to the place where i can look these up would be nice.

falsepattern · June 2, 2026, 6:58am

The LLVM intrinsics for x86 (and other architectures in their respective files) are defined here: llvm-project/llvm/include/llvm/IR/IntrinsicsX86.td at main · llvm/llvm-project · GitHub

The llvm.x86.avx2.pshuf.b is this one for instance: llvm-project/llvm/include/llvm/IR/IntrinsicsX86.td at bb20724ecb743be0d67270934830510f0149ed15 · llvm/llvm-project · GitHub

Travis · June 2, 2026, 7:11am

Search for llvm.x86.avx2.pshuf.b Found this list of avx2 which does contain pshufb.

I haven’t tried many of them but i’d guess stuff from here should also be available
LLVM Language Reference Manual — LLVM 23.0.0git documentation.

ODON1 · June 2, 2026, 7:14am

Ok, that is a start.

But i still don’t know how to translate:

def int_x86_avx2_pshuf_b : GCCBuiltin<“__builtin_ia32_pshufb256”>,
Intrinsic<[llvm_v32i8_ty], [llvm_v32i8_ty,
llvm_v32i8_ty], [IntrNoMem]>;

into:

extern fn @"llvm.x86.avx2.pshuf.b"(@Vector(32, u8), @Vector(32, u8)) @Vector(32, u8);

Also that page does not seem to have the permute operations i need (but i just did a very quick search so i maybe wrong)

edit: Although i have other algorithms that use pshufb, for this one i’m only interested in translating the permute (VPERMPS and VPERMQ) and blend functions. For pshufb i can just copy Validarks example. (pshufb doesn’t do a lane crossing shuffle)

falsepattern · June 2, 2026, 7:28am

the general pattern seems to be:
def it_x86_name : GCCBuiltin<...>, Intrinsic<[Output], [Param1, Param2, ...], [Memory stuff]>;
Then there’s the comment at the top of the block that says // All intrinsics start with "llvm.x86.".

So for the name int_x86_avx2_pshuf_b, you’d do llvm.x86., then the name of the intrinsic with . instead of _, so llvm.x86.avx2.pshuf.b

llvm_v32i8_ty is “32 item vector of int8”, which is equivalent to @Vector(32, u8)/@Vector(32, i8) (bitwise equivalent).

Another example translation (not related to pshufb, i just picked one at random):

github.com/llvm/llvm-project

llvm/include/llvm/IR/IntrinsicsX86.td

6f44c750a


      
          def int_x86_avx512_vpermilvar_ps_512 :
                ClangBuiltin<"__builtin_ia32_vpermilvarps512">,
                DefaultAttrsIntrinsic<[llvm_v16f32_ty],
                                      [llvm_v16f32_ty, llvm_v16i32_ty], [IntrNoMem]>;

int_x86_avx512_vpermilvar_ps_512 becomes llvm.x86.avx512.vpermilvar.ps.512
The return value llvm_v16f32_ty becomes @Vector(16, f32)
The params llvm_v16f32_ty, llvm_v16i32_ty become @Vector(16, f32), @Vector(16, u32) (or i32 for the second)
So the result would be

extern fn @"llvm.x86.avx512.vpermilvar.ps.512"(@Vector(16, f32), @Vector(16, u32)) @Vector(16, f32);

(maybe someone motivated enough could write an automated tool for this, the whole process seems pretty mechanical)

ODON1 · June 2, 2026, 9:07am

Nice!

Thank you all

const std = @import("std");

extern fn @"llvm.x86.avx2.permps"(@Vector(8, f32), @Vector(8, i32)) @Vector(8, f32);

const V = @Vector(4, f64);
const I = @Vector(4, i64);

export fn my_shuffle256_D(in: V, index: I) V {
  const masklo = index + index;
  const maskhi = masklo + @as(I, @splat(1));
  const shufmask = masklo | (maskhi << @splat(32));
  const out = @"llvm.x86.avx2.permps"(@bitCast(in), @bitCast(shufmask));

  return @bitCast(out);
}

.LCPI0_0:
        .quad   4294967296
example.my_shuffle256_D:
push    rbp
mov     rbp, rsp
vpaddq  ymm2, ymm1, ymm1
vpsllq  ymm1, ymm1, 33
vpor    ymm1, ymm1, ymm2
vpbroadcastq    ymm2, qword ptr [rip + .LCPI0_0]
vpor    ymm1, ymm1, ymm2
vpermd  ymm0, ymm1, ymm0
pop     rbp
ret

gwenzek · June 2, 2026, 9:32am

I wish zig would have a proper @select that can do runtime permutations. Currently it doesn’t support those because some platform don’t, but I’d rather have this replaced by a for loop on some platform that not being able to write the code all together.

I’ve also tried implementing this on aarch64 but the instruction requires two following registers, which no inline assembly dialect supports, so I had to harcode register names.

ODON1 · June 2, 2026, 9:38am

Yeah, as i am the sole consumer of my code i can be quite selfish in targeting my own platform.

Maybe if your system has an equivalent to setr you can do something like the following. (please note the first comment)

__m256i my_shuffle256_QW3(__m256i V, __m256i index) {
// dont use, gives awfull assembly code

// clang-format off
return _mm256_setr_epi32(V[index[0] + 1], V[index[0]],
                         V[index[1] + 1], V[index[1]],
                         V[index[2] + 1], V[index[2]],
                         V[index[3] + 1], V[index[3]]);
// clang-format on
}

ODON1 · June 3, 2026, 11:41am

Also posting the solution for shuffling bytes:

extern fn @"llvm.x86.avx2.pshuf.b"(@Vector(32, i8), @Vector(32, i8)) @Vector(32, i8);
extern fn @"llvm.x86.avx2.pblendvb"(@Vector(32, i8), @Vector(32, i8), @Vector(32, i8)) @Vector(32, i8);

const V32x8i = @Vector(32, i8);
const I32x8i = @Vector(32, i8);

export fn my_shuffle256_B(in: V32x8i, index: I32x8i) V32x8i {
  const in_hihi = @shuffle(i8,in,undefined,@Vector(32, i8){ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 });
  const in_lolo = @shuffle(i8,in,undefined,@Vector(32, i8){ 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31 });

  const inhi = @"llvm.x86.avx2.pshuf.b"(in_hihi, index);
  const inlo = @"llvm.x86.avx2.pshuf.b"(in_lolo, index);

  // need the minus to get from int=1 to a mask, which needs msb set
  const mask = -@as(I32x8i, @intFromBool(index > @as(I32x8i, @splat(0x0F))));
  const out = @"llvm.x86.avx2.pblendvb"(inhi, inlo, mask);

  return out;
}

.LCPI0_0:
        .zero   32,15
example.my_shuffle256_B:
        push    rbp
        mov     rbp, rsp
        vpermq  ymm2, ymm0, 68
        vpermq  ymm0, ymm0, 238
        vpshufb ymm2, ymm2, ymm1
        vpshufb ymm0, ymm0, ymm1
        vpcmpgtb        ymm1, ymm1, ymmword ptr [rip + .LCPI0_0]
        vpblendvb       ymm0, ymm2, ymm0, ymm1
        pop     rbp
        ret

Both produce the same assembly as the C versions. This one is easier to extend to wider vectors.

I guess this is the most efficient way to shuffle a vector with a variable control (runtime control) at least for x86_64 with avx2 only.

ODON1 · June 4, 2026, 1:05pm

The extern fn @"llvm.x86.avx2.pblendvb"(@Vector(32, i8), @Vector(32, i8), @Vector(32, i8)) @Vector(32, i8);Can be replaced by just @selectas it takes a runtime control.

extern fn @"llvm.x86.avx2.pshuf.b"(@Vector(32, i8), @Vector(32, i8)) @Vector(32, i8);

const V32x8i = @Vector(32, i8);
const I32x8i = @Vector(32, i8);

export fn my_shuffle256_B2(in: V32x8i, index: I32x8i) V32x8i {
  const in_hihi = @shuffle(i8, in, undefined, @Vector(32, i8){ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 });
  const in_lolo = @shuffle(i8, in, undefined, @Vector(32, i8){ 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31 });

  const inhi = @"llvm.x86.avx2.pshuf.b"(in_hihi, index);
  const inlo = @"llvm.x86.avx2.pshuf.b"(in_lolo, index);

  const out = @select(i8, index > @as(I32x8i, @splat(0x0F)), inlo, inhi);

  return out;
}

Assembly output:

my_shuffle256_B2 = example.my_shuffle256_B

Validark · June 15, 2026, 11:05pm

LLVM doesn’t support a runtime shufflevector statement. Zig is just using LLVM’s vector support.

Here are some stubs you might find useful.

const std = @import("std");

fn aarch64_tbl1(table: @Vector(16, u8), indices: @Vector(16, u8)) @Vector(16, u8) {
    return struct {
        extern fn @"llvm.aarch64.neon.tbl1"(@Vector(16, u8), @Vector(16, u8)) @Vector(16, u8);
    }.@"llvm.aarch64.neon.tbl1"(table, indices);
}

fn tbl4(table_part_1: @Vector(16, u8), table_part_2: @Vector(16, u8), table_part_3: @Vector(16, u8), table_part_4: @Vector(16, u8), indices: @Vector(16, u8)) @Vector(16, u8) {
    return struct {
        extern fn @"llvm.aarch64.neon.tbl4"(@TypeOf(table_part_1), @TypeOf(table_part_2), @TypeOf(table_part_3), @TypeOf(table_part_4), @TypeOf(indices)) @TypeOf(indices);
    }.@"llvm.aarch64.neon.tbl4"(table_part_1, table_part_2, table_part_3, table_part_4, indices);
}

// This one is for 32-bit arm
fn vtbl2(table_part_1: @Vector(8, u8), table_part_2: @Vector(8, u8), indices: @Vector(8, u8)) @Vector(8, u8) {
    // comptime assert(builtin.cpu.arch == .arm and std.Target.arm.featureSetHas(builtin.cpu.features, .neon));

    return struct {
        extern fn @"llvm.arm.neon.vtbl2"(@TypeOf(table_part_1), @TypeOf(table_part_2), @TypeOf(indices)) @TypeOf(table_part_1);
    }.@"llvm.arm.neon.vtbl2"(table_part_1, table_part_2, indices);
}

Cloudef · June 16, 2026, 2:10am

It’s funny that while LLVM doesn’t, clang the C/C++ frontend does. (as does gcc)

github.com/llvm/llvm-project

Confusing/missing documentation for __builtin_shufflevector

opened 04:23PM - 23 Dec 22 UTC

lawben

documentation clang:codegen

I've recently been playing around with `__builtin_shufflevector` (and `__builtin…_shuffle` in GCC). GCC's `_builtin_shuffle` allows me to pass a vector and a mask `__builtin_shuffle(vec, mask);`. The [LLVM documentation on `_builtin_shufflevector`](https://clang.llvm.org/docs/LanguageExtensions.html#builtin-shufflevector) says that clang requires two vectors and then indixes to shuffle from them. Only signature in documentation: > ```__builtin_shufflevector(vec1, vec2, index1, index2, ...)``` Strictly following this signature and the examples further down in the documentation severely limits the use of `__builtin_shufflevector`, as all indexes must be known at compile time. However, the documentation omits the fact that `vec2` can be the mask and not just an input. I only found this by chance while looking at the [code that emits the vector shuffle code](https://github.com/llvm/llvm-project/blob/28b52abeec3c90c9826803c2caa03ed2096a7830/clang/lib/CodeGen/CGExprScalar.cpp#L1643-L1679). So a valid signature would also be: ```__builtin_shufflevector(vec, mask); // vec and mask must have same type``` With this signature, the use is a lot closer to GCC's `__builtin_shuffle`, which allows me to write the following code snipptet to support both compilers. ```cpp // 16 Byte vector of 4x uint32_t using VecT __attribute__((vector_size(16))) = uint32_t; VecT shuffle_vector(VecT vec, VecT mask) { #if GCC_COMPILER return __builtin_shuffle(vec, mask); #else return __builtin_shufflevector(vec, mask); #endif } ``` Here is a short [godbolt example](https://godbolt.org/#z:OYLghAFBqd5QCxAYwPYBMCmBRdBLAF1QCcAaPECAMzwBtMA7AQwFtMQByARg9KtQYEAysib0QXACx8BBAKoBnTAAUAHpwAMvAFYTStJg1DIApACYAQuYukl9ZATwDKjdAGFUtAK4sGISQDspK4AMngMmAByPgBGmMQgAGykAA6oCoRODB7evv5BaRmOAmER0SxxCcl2mA5ZQgRMxAQ5Pn6Btpj2xQwNTQSlUbHxSbaNza15HQrjA%2BFDFSOJAJS2qF7EyOwc5gDM4cjeWADUJrtuMYQKAPQz6KaW1gB0CGfYJhoAgh%2BfXhlGxwAarVdmZVJJjgB9SFMAgEYh4GJeAiYaFQAButSIxEhGQAXpgIFwVstlqddgARY5ecIEUGQghnKxfH7A5Cg8HHBQILxUKj0SGYhwkCBsjkQoWkIEgsEQlhMBQAazJJgCzO%2BZn2VGOECwNAi6Ag0IA4pE5G5oSqzIlzIljmAwHr5oboYdDMBLcsfscfcdiJgCBsGFDIUi6I4GLieXz6BBJcd5Uqvbt1Xsukpvb7/YHiMHoWHaBGo7z%2BZghdi47UpYnlUyfmmGPgqPWAhT619aQmmOEICq1ZmfWLZccheSqaqLFwNFKzNPjrs55INKq2ynTl9fdL2cOa2PTmrdjOj8cuCumevPpuh5z1gQUsi99ySwLyyL4zXk%2BrN9drschAh1lodBjhSBFBBAY4lylBdoLnWcBy5Ah0BAFBb3JNwzjcY5b3vRkAFYLGXPDx3OTDTjMMwpXMMwL03Oi6MwsicOREwCNPYj0LI6iqIohD6M3Rjzmw5FcNYiwzFYkiMKE7jyIkjd%2BPowSsOY/CLF2STOKEu4UNcWg6xZVsOFWWhODw3g/A4LRSFQTgMMeSwuXWTZMHI3YeFIAhNGM1ZFRAXZdieALgpC0LklMjhJAs7ybM4XgFBAacvKs4zSDgWAYEQVCWBSOh4nISg0ByvKEjdIwuDwjRpywdE8C2QE8EwAB3AB5FJGE4DyaELeIEogGIYsuZhiAAT063girYQQWoYWgxpS0gsHlIxxAW/B/TqTEEoWzBVFqZFtms2kuhi2hEWIJoRo8LAYvhPAWHG1Z%2BSYYAFAa5q2o67heH4QQRDEdgpBkQRFBUdQFt0SiDCMFBrGsfREQSyBVlQFIem2gBaFrdmODH5U2V5KVEBRRCwAxFVc3G7jOCkHisSxZ3irosSyFxG0mPwuGCRtBnKSo9EKTIBA5gX0iFhheeGBIuZqOoBD6CZPDaPRZZ6BW5jKKWVdmEWZdmSXFml1YFGcrYJBMszooW2yOGOVQAA5EgxxIITK4ATzwp4NC9nVcEIEg3K4ZZeGSrRSVIPyAqC0KY4C/ROCi0gHsq0hLOsm34sSzzvNWdKstUgqICK3L6GISJWG2B2nZd443Y9r2vd4TB8GxPBkK537hFEcQgc70G1BiyHSCai6Uke%2BOOHM1OYptlqRIfVBtSr53Xeh92KobjQdQ8YrS7csxg%2BzlLw4QTAmCwBJewnxPk%2BnNPeAz2ws9DnyI/8wLY9j8LOF2K307io%2BYc0qZTzhAJAxcSqFwgaXEAwAuAUUWmWOqmB3qtXapZLq4ZeqUAGgtIal1xqkEmowAgM05oxSWu6Va1l1osy2jFXa%2B0UQxWOhFayZ0YgXVGtdQ6IcEQPW%2Bk9Awr1UGfQwT9WQ/0e7SD7koAeEM9Br1hg5GwHCkZX1RujTgWMcZ4yaMgQmtMFSky6EwCmPoMbUyJnDBm1lVaswgK4XW3N0AG35lzQWPRnGeKyG4kYMtmZy16DrJWeQAndHqPreYfN/FjH6M4mY/Q/FGzWBsM2QcJ5T3vrFW2y8a7AGQMgE8Zgng0QgH7bEgdD4v3DpHD%2Bn9QrX14Lfae1sAEJSSjnYB8AwEgALhQIuqBd7xHLmwTgeSIQFKKfA0pTcW4kDbnoTuUjAYyNkP3cG1ldBcxHkwMegjMl/wfpwOed4F5L0divY4UzimlO3kMku8QqkhxzqsU%2B58RhXwijfEAKdsmPw6YA1%2BdTo4NOChPX%2BrT/4cBecfC2HAzBHJybCsOqxMTEAyM4SQQA%3D%3D) that shows that this vector+mask signature is actually correct. It would be really helpful to add this to the documentation to make clear that clang also supports runtime masks in the shuffle.

Validark · June 16, 2026, 8:39am

__builtin_shufflevector is comptime too. It’s the equivalent of Zig’s @shuffle

Cloudef · June 16, 2026, 9:56am

Did you read the link? Clang does not emit the llvm shufflevector in case there’s runtime mask as second argument for the builtin.

Validark · June 16, 2026, 10:59am

Oh, I see. You’re right that there is a version of the builtin that emits a series of extract and insert operations that the x86-64 backend can reconstruct into permutes sometimes. I’m not sure why they are using the term “mask” to refer to indices though. You can emit those same instructions in Zig 0.15 too. (Godbolt)

export fn foo(x: @Vector(4, u32), y: @Vector(4, u32)) @Vector(4, u32) {
    var result: @Vector(4, u32) = undefined;
    inline for (0..4) |i|
        result[i] = x[y[i]];
    return result;
}

This code is not good for obvious reasons though, so I think we need a better solution regardless.

Cloudef · June 16, 2026, 12:34pm

I think @shuffle should accept runtime mask and do best effort when that is not supported by hw directly.