Scatter, gather, reduce w/ @Vector

maedoc · September 19, 2023, 3:25pm

hi, I am exploring a bit the use of Zig’s Vector which I initially thought would be great for some algorithms I’m using ISPC for, but ran immediately into some ops which didn’t fit with the existing support. As a simple example, according to godbolt, a kernel in ISPC such as

uniform float sum_square_at(uniform float *num, varying int idx) {
    varying float num_idx = num[idx];
    return reduce_add(num_idx);
}

results in fairly compact assembly compared to an equivalent C++ snippet forced to use SIMD, and the intent is clearer IMO.

I was curious if Zig’s Vector will get similar sets of operations or if inline assembly would be the required approach?

AndrewCodeDev · September 19, 2023, 5:48pm

Hello @maedoc - good question.

I did some searching through the github issues to check. I found better search results when I looked up “SIMD” instead of “Vector” but you’ll probably find examples of both.

Primarily, we’re talking about proposals here. A big chunk of the Zig effort at the moment is centered around things like incremental compilation, but there is a few open proposals for vectorized operations. One, for instance, is:

github.com/ziglang/zig

Add packing and unpacking selection operations for Vectors

opened 12:36AM - 24 May 23 UTC

Validark

proposal

This is a proposal to add `@packSelect` and `@unpackSelect`. These operations ar…e analogous to pdep/pext/@extractBits/@depositBits (see: https://github.com/ziglang/zig/issues/14995), except this proposal is for Vectors, not fixed-length bitvectors (i.e. integers). # packSelect I define `@packSelect(mask: @Vector(VEC_SIZE, bool), vector: @Vector(VEC_SIZE, VEC_TYPE))` which packs the vector into the left-hand side according to `mask`. It's basically like `pext` but it operates on vector lanes instead of bits. This is equivalent to `VPCOMPRESS` on new x86_64 machines. However, even without `VPCOMPRESS` support, this is a very common operation that can be performed in a wide variety of ways. Here are some stackoverflow questions about this: https://stackoverflow.com/questions/36932240/avx2-what-is-the-most-efficient-way-to-pack-left-based-on-a-mask https://stackoverflow.com/questions/28735461/shift-elements-to-the-left-of-a-simd-register-based-on-boolean-mask https://stackoverflow.com/questions/25074197/compact-avx2-register-so-selected-integers-are-contiguous-according-to-mask https://stackoverflow.com/questions/7886628/optimizing-array-compaction ## Motivating example When parsing a file, we often want to copy some of a file over to a buffer. Let's say we are reading a JSON file into vectors of size `64` and the first `64` characters are `"name": "Validark", "is_programmer": true, "favorite_color": "re`. Here is this information as Zig code: ```zig const VEC_SIZE = 64; const buffer = "\"name\": \"Validark\", \"is_programmer\": true, \"favorite_color\": \"re"; ``` Let's say we want to copy all of the characters between quotes into a buffer. For simplicity we are going to assume there are no escaped quotation marks within quoted strings. Here is the simplified inner loop of the first iteration: ```zig const vec: @Vector(VEC_SIZE, u8) = buf[0..VEC_SIZE].*; // generate a bitmask where all tabs in the vector correspond to a 1 in the bitmask const quote_mask = @bitCast(std.meta.Int(.unsigned, VEC_SIZE), vec == @splat(VEC_SIZE, @as(u8, '"'))); // create a mask where all the characters between quotes and the open quote are marked. const quoted_and_open_quote_mask = prefix_xor(quote_mask); // unset (open) quotes so the mask only has the bits between quotes set const quoted_mask = quoted_and_open_quote_mask & ~quote_mask; // Pack the quoted strings into the front of the vector const packed_vector = @packSelect(@bitCast(@Vector(VEC_SIZE, bool), quoted_mask), vec); // Write the quoted strings into some `chars` buffer. chars[0..VEC_SIZE].* = packed_vector; // Make sure to allocate at least VEC_SIZE extra slots in chars! // Advance the chars slice by how many valid characters we wrote. chars = chars[@popCount(quoted_mask)..]; ``` Here are the vectors/bitvectors: ``` LSB<-------------------------------------------------------->MSB vec: "name": "Validark", "is_programmer": true, "favorite_color": "re quote_mask: 1000010010000000010010000000000000100000000100000000000000100100 quoted_mask: 0111100001111111100001111111111111000000000011111111111111000011 packed_vec: nameValidarkis_programmerfavorite_colorre....................... ``` <details> <summary>Click here for more information on the prefix_xor operation</summary> ## prefix_xor See this article: https://branchfree.org/2019/03/06/code-fragment-finding-quote-pairs-with-carry-less-multiply-pclmulqdq/ Here is an implementation of it with https://github.com/ziglang/zig/issues/9631: ```zig /// Given a bitmask, returns a bitmask where every pair of 1's is filled in. /// The rightmost bit of each pair is set, but the leftmost is unset. /// e.g.: /// MSB<-------------------------------------------------------->LSB /// bitmask: 0000000001000000010000010000010000010000010000000100000100000000 /// returns: 0000000000111111110000001111110000001111110000000011111100000000 fn prefix_xor(bitmask: anytype) @TypeOf(bitmask) { const all_ones = std.math.maxInt(@TypeOf(bitmask)) + std.math.minInt(@TypeOf(bitmask)); return @mulCarryless(bitmask, all_ones); } ``` Here is another implementation that does not rely on carryless multiply. Hopefully one day LLVM will know this is equivalent when carryless multiply is not supported (or very slow) on a particular machine: ```zig fn prefix_xor(bitmask: anytype) @TypeOf(bitmask) { var x = bitmask; inline for (0..(@typeInfo(std.math.Log2Int(@TypeOf(bitmask))).Int.bits)) |i| x ^= x << comptime (1 << i); return x; } ``` (no guarantees when passing in integers that are not powers of 2) --- </details> ## What about the fill values? In case it comes in handy for optimization purposes, it might be a good idea to make it undefined what the non-packed values are in a packed vector. Hopefully that also means it is undefined what bytes will end up in `chars[@popCount(quoted_mask)..VEC_SIZE];` in the code above. Even writing nothing at all to those bytes should be valid in the case above. x86_64 has a variant which fills the rest of the vector with values from another source and a variant which fills it with zeroes. It *could* be nice to be able to specify which behavior you want. E.g., one could pass in `src`, or `@splat(VEC_SIZE, @as(u8, 0))`, or `@splat(VEC_SIZE, @as(u8, undefined))` if you aren't relying on any particular behavior. However, this effect can already be achieved by creating a mask with `std.simd.iota(u8, VEC_SIZE) >= @splat(VEC_SIZE, @as(u8, @popCount(quoted_mask)))` and then doing a `@select` to move either `src` or `@splat(VEC_SIZE, @as(u8, 0))` in the right places. Hopefully the optimizer will be smart enough at some point to know that that pattern still only needs 1 `VPCOMPRESSB` on AVX512_VBMI2 + AVX512VL x86_64 machines. Alternately, `@packSelect` could be made to take in a scalar value with which to fill in the empty spaces, but I think this might lead to someone filling with 0's, then doing a `@select` which turns all the 0's into the element from `src`. That would be very bad for the optimizer because it would then have to prove that 0's could not have been selected by `@packSelect`, which would be impossible to prove in cases like the example given above. Hence, I think making it fill with undefined's is the best move. ## Other uses Here are some more problems that can be solved with `@packSelect`: http://0x80.pl/notesen/2019-01-05-avx512vbmi-remove-spaces.html https://lemire.me/blog/2017/04/10/removing-duplicates-from-lists-quickly/ Here is a fun snippet that would print the indices in a vector where tabs occur: ```zig const vec: @Vector(VEC_SIZE, u8) = buf[0..VEC_SIZE].*; const tabs = vec == @splat(VEC_SIZE, @as(u8, '\t')); const num_tabs = std.simd.countTrues(tabs); // could also use @popCount with @bitCast const tab_indices = @packSelect(tabs, std.simd.iota(u8, VEC_SIZE)); for (@as([VEC_SIZE]u8, tab_indices)[0..num_tabs]) |index| { std.debug.print("index: {}\n", .{index}); } ``` Daniel Lemire has some articles on how to efficiently iterate over set bits too. Hopefully code like the above could one day be optimized as well as the C++ code which uses intrinsics: https://lemire.me/blog/2022/05/10/faster-bitset-decoding-using-intel-avx-512/ https://lemire.me/blog/2022/05/06/fast-bitset-decoding-using-intel-avx-512/ https://lemire.me/blog/2019/05/15/bitset-decoding-on-apples-a12/ # unpackSelect The second part of this proposal is for `@unpackSelect`, which corresponds to `VPEXPAND` on x86_64, which is basically like PDEP but operates on vector lanes rather than bits. It's the opposite of `@packSelect`, so in the example above, you could spread out the bytes in the `packed_vec` back into the same positions as in `vec` by doing `@unpackSelect(@bitCast(@Vector(VEC_SIZE, bool), quoted_mask), packed_vec)`. Again, even without direct `VPEXPAND` support this operation can be done in a number of ways on different architectures. Note: I am using the same signature as `@packSelect` above. Here is a stackoverflow with one method given for accomplishing this operation: https://stackoverflow.com/questions/48174640/avx2-expand-contiguous-elements-to-a-sparse-vector-based-on-a-condition-like-a ## Uses `@unpackSelect` is useful for a few situations I can think of: 1. When you want to make more room in a vector - e.g. [adding escape characters to a string](https://lemire.me/blog/2022/09/14/escaping-strings-faster-with-avx-512/) - "apple" -> \\"apple\\" 2. When you want to place [successive values in a vector into specific places in another vector](https://stackoverflow.com/questions/48174640/avx2-expand-contiguous-elements-to-a-sparse-vector-based-on-a-condition-like-a) - e.g. Let's say you have `vec1` containing `[a, b, c, d, e, f, g, h]`, and a `mask` that indicates you want to convert `vec1` into `[_, b, c, _, e, _, _, h]`, with each `_` replaced by successive values in `vec2` which contains `[1, 2, 3, 4, 5, 6, 7, 8]`. You need to get a vector with `[1, _, _, 2, _, 3, 4, _]` so a `@select` can use `mask` and `vec1` to produce `[1, b, c, 2, e, 3, 4, h]`. The missing vector can be generated with `@unpackSelect(@bitCast(@Vector(VEC_SIZE, bool), mask), vec2)`. ## What about the fill values? I think the same logic applies to `@unpackSelect` as `@packSelect` that it should be undefined what the non-relevant values are. ## Other uses More problems solvable with `@unpackSelect`/`VPEXPAND`: http://0x80.pl/notesen/2022-01-24-avx512vbmi2-varuint.html https://zeux.io/2022/09/02/vpexpandb-neon-z3/ --- Bikeshedding welcome.

If you don’t see what you’re looking for, then I’d say that there isn’t a plan at this moment unless there’s more internal conversation about it.

There’s a lot of inline assembly in the Thread module, so there’s definitely precedence for having inline assembly calls for standard utilities. An issue that people seem to be having in that particular link I posted is whether or not the proposal would qualify as a builtin function. Since @Vector deals with builtins quite extensively, I can see this as an ongoing point of contention.

What are your top utilities that you would like to see added? Also, have you considered doing C bindings for the functionality that you would like? I see that ISPC is a C extension, so I can’t speak to how easy that would be in this case.

tensorush · September 19, 2023, 7:25pm

Here’re two examples of how that kernel could be written in Zig right now:

const std = @import("std");

const DIM = 4;
const Mat = [DIM]Vec;
const Vec = @Vector(DIM, f64);
const VecIdxComp = @Vector(DIM, i32);
const VecIdx = @Vector(DIM, VecIdxUint);
const VecIdxUint = std.simd.VectorIndex(Vec);

fn sumSquareAtComp(num: Mat, comptime idx: VecIdxComp) f64 {
    const mask = comptime idx * std.simd.repeat(DIM, [_]i32{ 1, -1 });
    const num_idx1 = @shuffle(f64, num[0], num[1], std.simd.extract(mask, 0, 2));
    const num_idx2 = @shuffle(f64, num[2], num[3], std.simd.extract(mask, 2, 2));
    const num_idx = std.simd.join(num_idx1, num_idx2);
    return @reduce(.Add, num_idx);
}

fn sumSquareAt(num: Mat, idx: VecIdx) f64 {
    var num_idx: Vec = undefined;
    for (num, 0..) |sub_num, i| {
        num_idx[i] = sub_num[idx[i]];
    }
    return @reduce(.Add, num_idx);
}

test "sumSquareAt & sumSquareAtComp" {
    const num = Mat{
        @splat(1.0),
        @splat(12.0),
        @splat(123.0),
        @splat(1234.0),
    };
    const idx = std.simd.iota(VecIdxUint, DIM);
    try std.testing.expectEqual(@as(f64, 1370.0), sumSquareAt(num, idx));
    try std.testing.expectEqual(@as(f64, 1370.0), sumSquareAtComp(num, idx));
}

The simpler sumSquareAt function already works on inputs with arbitrary number of dimensions, while sumSquareAtComp doesn’t scale as easily, plus, it requires the idx (input index vector) to consist of i32 and be known at compile-time.

maedoc · September 20, 2023, 8:26pm

I should have been looking through the proposals indeed and I somehow missed the std.simd module completely (how/where are docs on that sort of thing?), so I will do some of that catching up…

that’s an interesting suggestion I hadn’t thought; ISPC is a way to do CUDA/OpenCL style code for the CPU w/o a runtime all while generating very compact assembly (e.g. better than autovec in Intel’s commercial compilers), and similarly to Zig, it makes attempts to ensure integration with C/C++ is easy: once the ISPC code is compiled it is immediately callable via C ABI, so Zig could just call specific bits. I still like the idea of not needing a second compiler, if possible.

I’m not versed enough with Zig and its history but there may be some useful lessons to grab from ISPC since it’s worked out a fairly complete & usable SIMD story built on/with LLVM.

Thanks for the examples, but there’s a compile error on godbolt with Zig v0.10, what version do I need to compile those? I agree that I completely missed @reduce(.Add but the @shuffle doesn’t seem like a scalable approach for larger vector sizes. That’s why I was looking for an equivalent of num[idx] which would be a (performance-warning-worthy) gather op for which ISPC issues vgatherqps (which neither GCC nor Clang will do).

Thanks for the replies and I will keep reading!

AndrewCodeDev · September 20, 2023, 10:43pm

That’s a pain-point we have at the moment. The standard library is full of utilities (people often don’t realize how much is actually in there… it’s quite extensive). Let me grab a few threads for you here (expand the subsections to see links)…

There’s always the “std” documentation itself (experimental): Zig Documentation

I’m sure there’s more but I think that’s probably a good start to understand some of the thinking towards the standard library.

In the words of Professor Farnsworth “A man can dream… a man can dream…”

That said, you could probably add a build step for this - I haven’t done much of multi-stage building/linking yet (and heads up, we get more questions about that by an order of magnitude than anything else). At the very least, there’s a good amount of conversation regarding the build system and now some popular youtubers are starting to go in on tutorials so hopefully we’ll break even at some point.

This may be beneath your level, but here it is for the sake of argument:

tensorush · September 21, 2023, 4:11am

I compiled those examples with Zig v0.11.0. I agree @shuffle doesn’t cut it here, I decided to include it just as a showcase. Currently Zig doesn’t have a gather function like the one ISPC uses, it would definitely be a worthy addition and a more appropriate function to use in that example.

maedoc · September 21, 2023, 7:00am

I really appreciate the effort to pull up all those links, I’ll definitely dig in.

I will definitely try this, since it would be only a handful of handy functions (maybe a slippery slope let’s see).

Once I have a good read through open proposals I may take a stab at it.

jnordwick · October 19, 2023, 10:27am

I’d stick wit the op names or else it is a pain to find them in the docs and you always wonder why they changed the name - is it the same or different?