This is a proposal to add `@packSelect` and `@unpackSelect`. These operations ar…e analogous to pdep/pext/@extractBits/@depositBits (see: https://github.com/ziglang/zig/issues/14995), except this proposal is for Vectors, not fixed-length bitvectors (i.e. integers).
# packSelect
I define `@packSelect(mask: @Vector(VEC_SIZE, bool), vector: @Vector(VEC_SIZE, VEC_TYPE))` which packs the vector into the left-hand side according to `mask`. It's basically like `pext` but it operates on vector lanes instead of bits. This is equivalent to `VPCOMPRESS` on new x86_64 machines. However, even without `VPCOMPRESS` support, this is a very common operation that can be performed in a wide variety of ways. Here are some stackoverflow questions about this:
https://stackoverflow.com/questions/36932240/avx2-what-is-the-most-efficient-way-to-pack-left-based-on-a-mask
https://stackoverflow.com/questions/28735461/shift-elements-to-the-left-of-a-simd-register-based-on-boolean-mask
https://stackoverflow.com/questions/25074197/compact-avx2-register-so-selected-integers-are-contiguous-according-to-mask
https://stackoverflow.com/questions/7886628/optimizing-array-compaction
## Motivating example
When parsing a file, we often want to copy some of a file over to a buffer. Let's say we are reading a JSON file into vectors of size `64` and the first `64` characters are `"name": "Validark", "is_programmer": true, "favorite_color": "re`. Here is this information as Zig code:
```zig
const VEC_SIZE = 64;
const buffer = "\"name\": \"Validark\", \"is_programmer\": true, \"favorite_color\": \"re";
```
Let's say we want to copy all of the characters between quotes into a buffer. For simplicity we are going to assume there are no escaped quotation marks within quoted strings. Here is the simplified inner loop of the first iteration:
```zig
const vec: @Vector(VEC_SIZE, u8) = buf[0..VEC_SIZE].*;
// generate a bitmask where all tabs in the vector correspond to a 1 in the bitmask
const quote_mask = @bitCast(std.meta.Int(.unsigned, VEC_SIZE), vec == @splat(VEC_SIZE, @as(u8, '"')));
// create a mask where all the characters between quotes and the open quote are marked.
const quoted_and_open_quote_mask = prefix_xor(quote_mask);
// unset (open) quotes so the mask only has the bits between quotes set
const quoted_mask = quoted_and_open_quote_mask & ~quote_mask;
// Pack the quoted strings into the front of the vector
const packed_vector = @packSelect(@bitCast(@Vector(VEC_SIZE, bool), quoted_mask), vec);
// Write the quoted strings into some `chars` buffer.
chars[0..VEC_SIZE].* = packed_vector;
// Make sure to allocate at least VEC_SIZE extra slots in chars!
// Advance the chars slice by how many valid characters we wrote.
chars = chars[@popCount(quoted_mask)..];
```
Here are the vectors/bitvectors:
```
LSB<-------------------------------------------------------->MSB
vec: "name": "Validark", "is_programmer": true, "favorite_color": "re
quote_mask: 1000010010000000010010000000000000100000000100000000000000100100
quoted_mask: 0111100001111111100001111111111111000000000011111111111111000011
packed_vec: nameValidarkis_programmerfavorite_colorre.......................
```
<details>
<summary>Click here for more information on the prefix_xor operation</summary>
## prefix_xor
See this article: https://branchfree.org/2019/03/06/code-fragment-finding-quote-pairs-with-carry-less-multiply-pclmulqdq/
Here is an implementation of it with https://github.com/ziglang/zig/issues/9631:
```zig
/// Given a bitmask, returns a bitmask where every pair of 1's is filled in.
/// The rightmost bit of each pair is set, but the leftmost is unset.
/// e.g.:
/// MSB<-------------------------------------------------------->LSB
/// bitmask: 0000000001000000010000010000010000010000010000000100000100000000
/// returns: 0000000000111111110000001111110000001111110000000011111100000000
fn prefix_xor(bitmask: anytype) @TypeOf(bitmask) {
const all_ones = std.math.maxInt(@TypeOf(bitmask)) + std.math.minInt(@TypeOf(bitmask));
return @mulCarryless(bitmask, all_ones);
}
```
Here is another implementation that does not rely on carryless multiply. Hopefully one day LLVM will know this is equivalent when carryless multiply is not supported (or very slow) on a particular machine:
```zig
fn prefix_xor(bitmask: anytype) @TypeOf(bitmask) {
var x = bitmask;
inline for (0..(@typeInfo(std.math.Log2Int(@TypeOf(bitmask))).Int.bits)) |i|
x ^= x << comptime (1 << i);
return x;
}
```
(no guarantees when passing in integers that are not powers of 2)
---
</details>
## What about the fill values?
In case it comes in handy for optimization purposes, it might be a good idea to make it undefined what the non-packed values are in a packed vector. Hopefully that also means it is undefined what bytes will end up in `chars[@popCount(quoted_mask)..VEC_SIZE];` in the code above. Even writing nothing at all to those bytes should be valid in the case above. x86_64 has a variant which fills the rest of the vector with values from another source and a variant which fills it with zeroes. It *could* be nice to be able to specify which behavior you want. E.g., one could pass in `src`, or `@splat(VEC_SIZE, @as(u8, 0))`, or `@splat(VEC_SIZE, @as(u8, undefined))` if you aren't relying on any particular behavior. However, this effect can already be achieved by creating a mask with `std.simd.iota(u8, VEC_SIZE) >= @splat(VEC_SIZE, @as(u8, @popCount(quoted_mask)))` and then doing a `@select` to move either `src` or `@splat(VEC_SIZE, @as(u8, 0))` in the right places. Hopefully the optimizer will be smart enough at some point to know that that pattern still only needs 1 `VPCOMPRESSB` on AVX512_VBMI2 + AVX512VL x86_64 machines. Alternately, `@packSelect` could be made to take in a scalar value with which to fill in the empty spaces, but I think this might lead to someone filling with 0's, then doing a `@select` which turns all the 0's into the element from `src`. That would be very bad for the optimizer because it would then have to prove that 0's could not have been selected by `@packSelect`, which would be impossible to prove in cases like the example given above. Hence, I think making it fill with undefined's is the best move.
## Other uses
Here are some more problems that can be solved with `@packSelect`:
http://0x80.pl/notesen/2019-01-05-avx512vbmi-remove-spaces.html
https://lemire.me/blog/2017/04/10/removing-duplicates-from-lists-quickly/
Here is a fun snippet that would print the indices in a vector where tabs occur:
```zig
const vec: @Vector(VEC_SIZE, u8) = buf[0..VEC_SIZE].*;
const tabs = vec == @splat(VEC_SIZE, @as(u8, '\t'));
const num_tabs = std.simd.countTrues(tabs); // could also use @popCount with @bitCast
const tab_indices = @packSelect(tabs, std.simd.iota(u8, VEC_SIZE));
for (@as([VEC_SIZE]u8, tab_indices)[0..num_tabs]) |index| {
std.debug.print("index: {}\n", .{index});
}
```
Daniel Lemire has some articles on how to efficiently iterate over set bits too. Hopefully code like the above could one day be optimized as well as the C++ code which uses intrinsics:
https://lemire.me/blog/2022/05/10/faster-bitset-decoding-using-intel-avx-512/
https://lemire.me/blog/2022/05/06/fast-bitset-decoding-using-intel-avx-512/
https://lemire.me/blog/2019/05/15/bitset-decoding-on-apples-a12/
# unpackSelect
The second part of this proposal is for `@unpackSelect`, which corresponds to `VPEXPAND` on x86_64, which is basically like PDEP but operates on vector lanes rather than bits. It's the opposite of `@packSelect`, so in the example above, you could spread out the bytes in the `packed_vec` back into the same positions as in `vec` by doing `@unpackSelect(@bitCast(@Vector(VEC_SIZE, bool), quoted_mask), packed_vec)`. Again, even without direct `VPEXPAND` support this operation can be done in a number of ways on different architectures. Note: I am using the same signature as `@packSelect` above. Here is a stackoverflow with one method given for accomplishing this operation:
https://stackoverflow.com/questions/48174640/avx2-expand-contiguous-elements-to-a-sparse-vector-based-on-a-condition-like-a
## Uses
`@unpackSelect` is useful for a few situations I can think of:
1. When you want to make more room in a vector
- e.g. [adding escape characters to a string](https://lemire.me/blog/2022/09/14/escaping-strings-faster-with-avx-512/)
- "apple" -> \\"apple\\"
2. When you want to place [successive values in a vector into specific places in another vector](https://stackoverflow.com/questions/48174640/avx2-expand-contiguous-elements-to-a-sparse-vector-based-on-a-condition-like-a)
- e.g. Let's say you have `vec1` containing `[a, b, c, d, e, f, g, h]`, and a `mask` that indicates you want to convert `vec1` into `[_, b, c, _, e, _, _, h]`, with each `_` replaced by successive values in `vec2` which contains `[1, 2, 3, 4, 5, 6, 7, 8]`. You need to get a vector with `[1, _, _, 2, _, 3, 4, _]` so a `@select` can use `mask` and `vec1` to produce `[1, b, c, 2, e, 3, 4, h]`. The missing vector can be generated with `@unpackSelect(@bitCast(@Vector(VEC_SIZE, bool), mask), vec2)`.
## What about the fill values?
I think the same logic applies to `@unpackSelect` as `@packSelect` that it should be undefined what the non-relevant values are.
## Other uses
More problems solvable with `@unpackSelect`/`VPEXPAND`:
http://0x80.pl/notesen/2022-01-24-avx512vbmi2-varuint.html
https://zeux.io/2022/09/02/vpexpandb-neon-z3/
---
Bikeshedding welcome.