I was implementing a voxel ray tracer over sparse 64-trees, and part of it requires me to expand a u4 to u16 (and u16 to u64) where each input bit maps to 0b0000 and 0b0001 bits in the output.
(e.g. 0b1101 → 0b0001000100000001)
A more “traditional” implementation might be something like the following:
fn bitExpand(x: u4) u16 {
var y: u16 = 0;
const xu16: u16 = @intCast(x);
inline for(0..4) |i| {
y |= (xu16 >> i & 1) << i * 4;
}
return res;
}
but I also figured out an approach using vectors:
fn bitExpandVec(x: u4) u16 {
const v4x1: @Vector(4, u1) = @bitCast(x);
const v4x4: @Vector(4, u4) = @intCast(vec4x1);
return @bitCast(vec4x4);
}
I wonder in practice, which one will perform better? How exactly will zig deal with vectors of elements smaller than a byte, and could it enjoy the benefits of SIMD when, for example, expanding u16 to u64?
All help appreciated!
Legume
November 8, 2025, 10:48am
2
On my CPU (znver5) with ReleaseFast optimization the first version generates this code (output from llvm-mca)
Iterations: 100
Instructions: 1300
Total Cycles: 604
Total uOps: 1400
Dispatch Width: 6
uOps Per Cycle: 2.32
IPC: 2.15
Block RThroughput: 2.3
Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)
[1] [2] [3] [4] [5] [6] Instructions:
1 1 1.00 * push rbp
1 0 0.17 mov rbp, rsp
1 0 0.17 mov eax, edi
1 1 0.25 and eax, 15
1 1 0.50 shl edi, 9
2 2 0.25 lea ecx, [8*rax]
1 1 0.25 or ecx, eax
1 1 0.50 shl eax, 6
1 1 0.25 or eax, edi
1 1 0.25 or eax, ecx
1 1 1.00 and eax, 4369
1 5 0.33 * pop rbp
1 5 0.50 U ret
The second version generates this instead
Iterations: 100
Instructions: 1700
Total Cycles: 605
Total uOps: 2000
Dispatch Width: 6
uOps Per Cycle: 3.31
IPC: 2.81
Block RThroughput: 4.0
Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)
[1] [2] [3] [4] [5] [6] Instructions:
1 1 1.00 * push rbp
1 0 0.17 mov rbp, rsp
1 1 0.50 kmovd k0, edi
1 0 0.17 vpmovm2d xmm0, k0
1 1 0.50 vpsrld xmm0, xmm0, 31
2 1 1.00 vpextrd eax, xmm0, 1
2 1 1.00 vpextrd edx, xmm0, 2
1 1 1.00 vmovd ecx, xmm0
1 1 0.50 shl eax, 4
1 1 0.50 shl edx, 8
1 1 0.25 or eax, ecx
1 1 0.25 or edx, eax
2 1 1.00 vpextrd eax, xmm0, 3
1 1 0.50 shl eax, 12
1 1 0.25 or eax, edx
1 5 0.33 * pop rbp
1 5 0.50 U ret
So at least on my CPU the first one has better latency and puts less pressure on the instruction cache, so it should be faster although I can’t say by how much.
1 Like
For such small problems (u4->u16) it is possible to just brute-force your way to a fast version by just trying out a few random small functions with variable parameters.
e.g. I did this and arrived at (520*x ^ 65*x) & 0b1000100010001, should be around 5 cycles latency in total on AMD’s Zen architecture.
Also for small functions it also makes sense to consider a lookup table. I’d suggest to benchmark a few things and then pick what works best for you.
Also if you are willing to dive into inline assembly, then carryless multiplication or pext could likely solve this problem much faster.
For the larger one, u16→u64, I’d suggest to take a look at this block post on bit-interleaving: How fast can you bit-interleave 32-bit integers? – Daniel Lemire's blog there is also a follow-up on it that uses SIMD.
3 Likes
Oh this just inspired me to come up with this solution:
fn bitExpand(x: u4) u16 {
const map1: u16 = 0b0000_0000_0100_0001;
const map2: u16 = 0b0000_0010_0000_1000;
return (map1 * x) | (map2 * x) & 0x1111;
}
It looks so obvious now that I have seen it
2 Likes
Should I mark this as solved? I opted for using the @select builtin for where the u16 to u64 expansion would have being used.
The article @IntegratedQuantum shared is indeed fascinating, I wonder how can the pdep instruction be used in zig in a cross-platform way.
Edit: btw, after some quick and sketchy benchmarks, the impls with bit-shifting and multiplications performs similarly in ReleaseFast, but the impl using a comptime generated lookup table actually has a worse max time.
There’s an accepted proposal to add builtins for that:
opened 01:22PM - 18 Mar 23 UTC
proposal
accepted
Of the all the bit manipulation instructions available on a platform like x86_64… , most of them are an optimization over fundamental operations.
https://en.wikipedia.org/wiki/X86_Bit_manipulation_instruction_set
For example, the BLSR x86_64 instruction is equivalent to `x & (x - 1)`, and can easily be written inline or put in a function at the discretion of the programmer. We would never need a `@blsr` builtin because the fundamental operation of resetting the lowest set bit is already in the language.
Some of the other operations are newer fundamental operations and are builtin to Zig already, through `@popCount`, `@ctz` and `@clz`. I would also consider `@bitReverse` to be a fundamental operation, even though for some reason x86_64 lacks an instruction for it.
`pdep` and `pext` on the other hand, are not in the language, (relatively) not trivial to implement manually, and to my knowledge will only appear in the output binary as PDEP/PEXT by using inline assembly. In my view `pdep` and `pext` are now fundamental bitwise operations, just like the aforementioned operations.
Since they were first proposed by Yedidya Hilewitz and Ruby B. Lee [[1](https://www.lirmm.fr/arith18/papers/hilewitz-PerformingBitManipulations.pdf), [2](http://palms.ee.princeton.edu/PALMSopen/hilewitz06FastBitCompression.pdf)] they [have seen a myriad of uses](https://stackoverflow.com/questions/69966389/what-are-assembly-instructions-like-pext-actually-used-for). Perhaps one of the biggest uses that I am aware of is in implementing the `select` function of `rank`/`select` data structures, which are the fundamental operations of [succinct data structures](https://en.wikipedia.org/wiki/Succinct_data_structure) and certain other bitwise structures like the [Counting Quotient Filter](https://users.cs.utah.edu/~pandey/courses/cs6530/fall22/papers/htfilters/3035918.3035963.pdf) (used for Approximate Membership Queries). [Here is an article](https://www.adrian.idv.hk/2018-11-05-pbj17-fastx86/) on how to implement `select` using `pdep`.
They are also used to implement bitboards, which are important for chess programming or sudoku solvers: https://news.ycombinator.com/item?id=19137798
More praise of `pdep` and `pext`: https://news.ycombinator.com/item?id=20205743
I have personally used `pdep` recently to check that delimiting characters are in the proper positions with SIMD. Specifically, I had a file that was of the format `word\t123<digits>123\nword2\t456<digits2>456\n`, etc. I moved 64 byte chunks into a vector, got a bitmap for where the '\t' and '\n' characters are, or'ed them together, and used `pdep` to extract every other set bit, which I can then check against the newline or tab mask to make sure the file alternates between tabs and newlines as delimiters. To grab every other set bit, one can do:
```zig
inline fn maskEvenBits(bitstring: u64) u64 {
return pdep(0xaaaaaaaaaaaaaaaa, bitstring);
}
```
Here is an excerpt of my code:
```zig
const vec: @Vector(VEC_SIZE, u8) = buf[0..VEC_SIZE].*;
// Note: we cannot just match digit characters because those characters can also appear in the string section.
const tab_mask = @bitCast(std.meta.Int(.unsigned, VEC_SIZE), vec == @splat(VEC_SIZE, @as(u8, '\t')));
const newline_mask = @bitCast(std.meta.Int(.unsigned, VEC_SIZE), vec == @splat(VEC_SIZE, @as(u8, '\n')));
const newline_tab_mask = tab_mask | newline_mask;
// Validate that the file alternates between tab and newlines.
if (maskEvenBits(newline_tab_mask) != if (state == 0) tab_mask else newline_mask)
return error.InvalidFile;
// Note: state is flipped depending on whether the 64-byte chunk is supposed to be matching strings or digits at the start of the next loop
```
Conveniently, I have heard that LLVM supports `pdep` and `pext` as fundamental operations. If Zig becomes the new C for the next few decades, supporting operations like `pdep` and `pext` is important for bit twiddlers to ~~choose~~ love Zig. It is also important from the language's perspective to define the set of fundamental bit manipulations, which should be supported on new ISA's.
Also important: https://github.com/ziglang/zig/issues/9631
2 Likes
Sorry to bother you fellows again, just figured I should share where I am using the code in this thread
It returns a bitmask of the sub region within a 4x4x4 cube between point p1 and p2.
Also by some very inaccurate benchmarking, it turns out using a look up table is probably about 10% faster
pub fn subReg(p1: vec(3,2), p2: vec(3,2)) u64 {
// comptime
const map16, const map64 = comptime scope: {
var res1: [16]u16 = undefined;
var res2: [16]vec(4,16) = undefined;
for (0..16) |i| {
const pred16: u16 = 0b1001001001;
const pred64: u64 = 0b0010000000000000_0100000000000000_1000000000000001;
res1[i] = ((pred16 & 0x0F0F) * i) | ((pred16 & 0xF0F0) * i) & 0x1111;
res2[i] = @bitCast(pred64 * i & 0x0001000100010001);
}
break :scope .{ res1, res2 };
};
// runtime
const masks = (@as(vec(3,4), @splat(0xF)) << @min(p1, p2)) &
(@as(vec(3,4), @splat(0xF)) >> (vec(3,2) {3, 3, 3} - @max(p1, p2)));
var res: vec(4,16) = @splat(masks[0]);
res *= @splat(map16[masks[1]]);
res *= map64[masks[2]];
return @bitCast(res);
}
vec(,):
fn vec(elms: comptime_int, bits: comptime_int) type {
return @Vector(elms, u(bits));
}
fn u(bits: comptime_int) type {
return @Type(.{.int = .{.bits = bits, .signedness = .unsigned}});
}