Understanding arbitrary bit-width integers

FObersteiner · October 24, 2023, 3:44pm

The zig language supports arbitrary bit-width for integers. I see that this can be useful for instance to specify the boundaries of a parameter: e.g. i7 => this only goes from -64 to 63.

However, besides having a “more precise typing system”, what’s the purpose / application? To my understanding, an i7 will still occupy a full byte in memory (packed structs seem to be an exception here)? Does it for example help in arithmetic operations (performance-wise) to know there are only so many bits to consider?

p.s. I just started learning zig for fun, and normally, I don’t dabble on the bit-level

IntegratedQuantum · October 24, 2023, 5:40pm

I use u31 pretty often because it can coerce to an i32 as well as a usize without requiring explicit casting.

Apart from that compiler could theroretically do more optimizations on them. For example a ?u31 or !u31 can be put into 4 bytes, whereas a ?u32 or !u32 requires 8 bytes.
But that optimization isn’t implemented yet.

And then I think it’s also useful when you need big numbers. I have used u128 a couple of times when u64 was too small.

Additionally low-bit numbers are sometimes useful for their overflow behavior

var inu8: u8 = ...;
inu8 = (inu8 + 1)%16;
var inu4: u4 = ...;
inu4 +%= 1; // This is much simpler to use.

matklad · October 24, 2023, 5:52pm

Couple of things we use oddly-sized integers at TigeBeetle:

Packed structs:
tigerbeetle/src/lsm/schema.zig at 216c196cd5eb27f36449b5c22889658ce512bc70 · tigerbeetle/tigerbeetle · GitHub

You often want to use bit efficiently, especially when they define data layout on disk / on the wire. There, information becomes durable outside of the process, so it is critically important to ensure that every bit knows its place, as any change would be a pain
Non-zero signed integers

tigerbeetle/src/io/darwin.zig at 216c196cd5eb27f36449b5c22889658ce512bc70 · tigerbeetle/tigerbeetle · GitHub

Sometimes the underling API would take an i64, but it also must be non-zero, so you can use u63 in user code.
Bit flags:

tigerbeetle/src/lsm/tree_fuzz.zig at 216c196cd5eb27f36449b5c22889658ce512bc70 · tigerbeetle/tigerbeetle · GitHub

Sometimes you need a bool, but are too thrifty to waste a whole byte. You can use u1 to eat into something you don’t need as much of

dee0xeed · October 24, 2023, 6:10pm

Sometimes you have to use integers with bit length < 8 and this is forced by the compiler.
Consider this code:

    pub fn init(a: Allocator, ctx_len: u5) !BitPredictor {
        var bp = BitPredictor{};
        bp.p0 = try a.alloc(u16, @as(u32, 1) << ctx_len);
        @memset(bp.p0, P0MAX / 2);
        return bp;
    }

Any type of ctx_len longer than u5 can potentially result in overflow, so the compiler performs some smart checks. Let’s try use u6 instead of u5. We’ll get this error:

src/bit-predictor.zig:17:49: error: expected type 'u5', found 'u6'
        bp.p0 = try a.alloc(u16, @as(u32, 1) << ctx_len);
                                                ^~~~~~~
src/bit-predictor.zig:17:49: note: unsigned 5-bit int cannot represent all possible unsigned 6-bit values

Compiler deduced that for shifting u32 1 to the left we can use u5 as a maximum and you have to use u5 (or shorter integers).

FObersteiner · October 26, 2023, 4:30pm

Thanks a lot for your replies and examples! I didn’t actively code in a language before that gave me so much control - still getting used to it. More control means more responsibility I guess

Validark · October 27, 2023, 1:00pm

Non-zero or non-negative?

chung-leong · October 27, 2023, 7:34pm

Here’s a neat trick you can pull at comptime with the help of arbitrary-width integer:

Suppose you want to assign a number to a set of functions based on their signatures. Functions with different arguments or return values would get different numbers. Functions with the same arguments and return values would get the same number.

First, the sample input:

const ns = struct {
    fn apple(arg1: u32, arg2: u32) void {
        _ = arg1;
        _ = arg2;
    }

    fn orange(arg1: u32, arg2: u32) u32 {
        return arg1 + arg2;
    }

    fn banana(arg1: u32, arg2: u32) void {
        _ = arg1;
        _ = arg2;
    }
};

As you can see, apple and banana have the same signature. If you run this code:

std.debug.print("apple: {s}\n", .{@typeName(@TypeOf(ns.apple))});
std.debug.print("orange: {s}\n", .{@typeName(@TypeOf(ns.orange))});
std.debug.print("banana: {s}\n", .{@typeName(@TypeOf(ns.banana))});

you would get:

apple: fn(u32, u32) void
orange: fn(u32, u32) u32
banana: fn(u32, u32) void

Now, the code for the counter:

const counter = create: {
    comptime var next = 0;

    break :create struct {
        fn get(comptime anything: anytype) comptime_int {
            _ = anything;
            const slot = next;
            next += 1;
            return slot;
        }
    };
};

Due to comptime memoization, counter.get() will only increment the counter if the argument given is something it hasn’t seen before. Since @typeName() return the same text string for apple and banana, you should get the same number, right?

const apple_slot = counter.get(@typeName(@TypeOf(ns.apple)));
const orange_slot = counter.get(@typeName(@TypeOf(ns.orange)));
const banana_slot = counter.get(@typeName(@TypeOf(ns.banana)));
std.debug.print("{d} {d} {d}\n", .{ apple_slot, orange_slot, banana_slot });

Output:

0 1 2

Nope. This is because strings are fat-pointers. Two identical strings stored at different memory location will be considered different by Zig. Here’s where arbitrary bit-with integer comes in. By convert strings into giant integers, we can force Zig to compare at comptime the actual data that the pointers point to:

fn signature(comptime f: anytype) comptime_int {
    const name = @typeName(@TypeOf(f));
    comptime var int = 0;
    inline for (name) |c| {
        int = (int << 8) | @as(comptime_int, @intCast(c));
    }
    return int;
}

const apple_slot = counter.get(signature(ns.apple));
const orange_slot = counter.get(signature(ns.orange));
const banana_slot = counter.get(signature(ns.banana));
std.debug.print("\n{d} {d} {d}\n", .{ apple_slot, orange_slot, banana_slot });

Result:

0 1 0

dude_the_builder · October 27, 2023, 9:44pm

Mind officially blown! Thanks for sharing this.

permutationlock · October 27, 2023, 9:55pm

You can also get this functionality by using the type itself unless I am missing something.

const apple_slot = counter.get(@TypeOf(ns.apple));
const orange_slot = counter.get(@TypeOf(ns.orange));
const banana_slot = counter.get(@TypeOf(ns.banana));
std.debug.print("{d} {d} {d}\n", .{ apple_slot, orange_slot, banana_slot });

0 1 0

chung-leong · October 28, 2023, 8:14am

Curious. I didn’t realize that @typeName() will give you a different pointer even when the input is the same.

    std.debug.print("{d} {d}\n", .{ @intFromPtr(@typeName(@TypeOf(ns.apple)).ptr), @intFromPtr(@typeName(@TypeOf(ns.apple)).ptr) });

2164496 2164514

In any event, in the original code I was using sub-strings of the @typeName() as key. The code above was simplified from that.

FObersteiner · April 10, 2024, 10:45am

I think this is worth linking here @cancername - great example why at the moment, arbitrary bit width comes with a downside (performance penalty): LLVM seems “confused”

Issue on github: LLVM: Non power of two integer arithmetic emits slow assembly · Issue #19616 · ziglang/zig · GitHub

nyc · April 11, 2024, 7:00pm

I wouldn’t call them smart. I hit this too often where I know the value and the compiler tries to be smart but I just wind up with casts all over the place to placate it.

nyc · April 11, 2024, 7:06pm

How do you get compiler enforced non-zero numbers with bit widths?