Vector of sub-byte-sized integers behavior

I was implementing a voxel ray tracer over sparse 64-trees, and part of it requires me to expand a u4 to u16 (and u16 to u64) where each input bit maps to 0b0000 and 0b0001 bits in the output.
(e.g. 0b1101 → 0b0001000100000001)

A more “traditional” implementation might be something like the following:

fn bitExpand(x: u4) u16 {
    var y: u16 = 0;
    const xu16: u16 = @intCast(x);
    inline for(0..4) |i| {
        y |= (xu16 >> i & 1) << i * 4;
    }
    return res;
}

but I also figured out an approach using vectors:

fn bitExpandVec(x: u4) u16 {
    const v4x1: @Vector(4, u1) = @bitCast(x);
    const v4x4: @Vector(4, u4) = @intCast(vec4x1);
    return @bitCast(vec4x4);
}

I wonder in practice, which one will perform better? How exactly will zig deal with vectors of elements smaller than a byte, and could it enjoy the benefits of SIMD when, for example, expanding u16 to u64?

All help appreciated!

On my CPU (znver5) with ReleaseFast optimization the first version generates this code (output from llvm-mca)

Iterations:        100
Instructions:      1300
Total Cycles:      604
Total uOps:        1400

Dispatch Width:    6
uOps Per Cycle:    2.32
IPC:               2.15
Block RThroughput: 2.3


Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
 1      1     1.00           *            push	rbp
 1      0     0.17                        mov	rbp, rsp
 1      0     0.17                        mov	eax, edi
 1      1     0.25                        and	eax, 15
 1      1     0.50                        shl	edi, 9
 2      2     0.25                        lea	ecx, [8*rax]
 1      1     0.25                        or	ecx, eax
 1      1     0.50                        shl	eax, 6
 1      1     0.25                        or	eax, edi
 1      1     0.25                        or	eax, ecx
 1      1     1.00                        and	eax, 4369
 1      5     0.33    *                   pop	rbp
 1      5     0.50                  U     ret

The second version generates this instead

Iterations:        100
Instructions:      1700
Total Cycles:      605
Total uOps:        2000

Dispatch Width:    6
uOps Per Cycle:    3.31
IPC:               2.81
Block RThroughput: 4.0


Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
 1      1     1.00           *            push	rbp
 1      0     0.17                        mov	rbp, rsp
 1      1     0.50                        kmovd	k0, edi
 1      0     0.17                        vpmovm2d	xmm0, k0
 1      1     0.50                        vpsrld	xmm0, xmm0, 31
 2      1     1.00                        vpextrd	eax, xmm0, 1
 2      1     1.00                        vpextrd	edx, xmm0, 2
 1      1     1.00                        vmovd	ecx, xmm0
 1      1     0.50                        shl	eax, 4
 1      1     0.50                        shl	edx, 8
 1      1     0.25                        or	eax, ecx
 1      1     0.25                        or	edx, eax
 2      1     1.00                        vpextrd	eax, xmm0, 3
 1      1     0.50                        shl	eax, 12
 1      1     0.25                        or	eax, edx
 1      5     0.33    *                   pop	rbp
 1      5     0.50                  U     ret

So at least on my CPU the first one has better latency and puts less pressure on the instruction cache, so it should be faster although I can’t say by how much.

1 Like

For such small problems (u4->u16) it is possible to just brute-force your way to a fast version by just trying out a few random small functions with variable parameters.
e.g. I did this and arrived at (520*x ^ 65*x) & 0b1000100010001, should be around 5 cycles latency in total on AMD’s Zen architecture.
Also for small functions it also makes sense to consider a lookup table. I’d suggest to benchmark a few things and then pick what works best for you.
Also if you are willing to dive into inline assembly, then carryless multiplication or pext could likely solve this problem much faster.

For the larger one, u16→u64, I’d suggest to take a look at this block post on bit-interleaving: How fast can you bit-interleave 32-bit integers? – Daniel Lemire's blog there is also a follow-up on it that uses SIMD.

3 Likes

Oh this just inspired me to come up with this solution:

fn bitExpand(x: u4) u16 {
    const map1: u16 = 0b0000_0000_0100_0001;
    const map2: u16 = 0b0000_0010_0000_1000;
    return (map1 * x) | (map2 * x) & 0x1111;
}

It looks so obvious now that I have seen it :slight_smile:

2 Likes

Should I mark this as solved? I opted for using the @select builtin for where the u16 to u64 expansion would have being used.
The article @IntegratedQuantum shared is indeed fascinating, I wonder how can the pdep instruction be used in zig in a cross-platform way.

Edit: btw, after some quick and sketchy benchmarks, the impls with bit-shifting and multiplications performs similarly in ReleaseFast, but the impl using a comptime generated lookup table actually has a worse max time.

There’s an accepted proposal to add builtins for that:

2 Likes

Sorry to bother you fellows again, just figured I should share where I am using the code in this thread :smiley:
It returns a bitmask of the sub region within a 4x4x4 cube between point p1 and p2.

Also by some very inaccurate benchmarking, it turns out using a look up table is probably about 10% faster

pub fn subReg(p1: vec(3,2), p2: vec(3,2)) u64 {
    // comptime
    const map16, const map64 = comptime scope: {
        var res1: [16]u16 = undefined;
        var res2: [16]vec(4,16) = undefined;
        for (0..16) |i| {
            const pred16: u16 = 0b1001001001;
            const pred64: u64 = 0b0010000000000000_0100000000000000_1000000000000001;
            res1[i] = ((pred16 & 0x0F0F) * i) | ((pred16 & 0xF0F0) * i) & 0x1111;
            res2[i] = @bitCast(pred64 * i & 0x0001000100010001);
        }
        break :scope .{ res1, res2 };
    };

    // runtime
    const masks = (@as(vec(3,4), @splat(0xF)) << @min(p1, p2)) &
                  (@as(vec(3,4), @splat(0xF)) >> (vec(3,2) {3, 3, 3} - @max(p1, p2)));

    var res: vec(4,16) = @splat(masks[0]);
    res *= @splat(map16[masks[1]]);
    res *= map64[masks[2]];

    return @bitCast(res);
}

vec(,):

fn vec(elms: comptime_int, bits: comptime_int) type {
    return @Vector(elms, u(bits));
}

fn u(bits: comptime_int) type {
    return @Type(.{.int = .{.bits = bits, .signedness = .unsigned}});
}