Generation of bernoulli bits with prob = p

Hi everyone, I have been learning Zig for the last few weeks, and I would like feedback on some code below. I want to make it as fast as possible.

To give some context:
In my application, I need to populate a BitSet with i.i.d binary variables, each with probability prob.

My first approach was something like this:

for (0..bitset.bit_length) |i| {
    const int_thresh: u32 = @intFromFloat(prob * std.math.maxInt(u32));
    const val = rng.random().int(u32) < int_thresh;
    bitset.setValue(i, val);
}

where rng is a Xoshiro256 instance.

After some initial testing I found that calling the function with this loop was a performance bottleneck. The alternative I came up with was with the following algorithm:

// fun(p: p is of the form 0.b_{n-1}b_{n-2}..b_0):
//  acc = false
//  for digit d in least significant to most significant:
//      if d = true:
//          acc = acc | coinFlip
//      else:
//          acc = acc & coinFlip
//  return acc

where coinFlip gives you an unbiased binary variable.
Finally, in Zig:

/// Generates (supposedly) iid random bernoullis with prob
pub fn randomBitsU64(prob: f32, buf: []u64, rng: anytype) void {
    assert(prob >= 0.0 and prob <= 1.0);

    if (prob == 1.0) {
        @memset(buf, std.math.boolMask(u64, true));
    }
    const frexp = floatDec(prob);
    var bits = frexp.significand;
    var exp = frexp.exponent;

    bits = bits >> (@ctz(bits) + 1);
    for (buf) |*v| {
        v.* = rng.next();
    }
    while (bits > 0) : (bits >>= 1) {
        if (bits % 2 == 1) {
            for (buf) |*v| {
                v.* |= rng.next();
            }
        } else {
            for (buf) |*v| {
                v.* &= rng.next();
            }
        }
    }
    while (exp < -1) : (exp += 1) {
        for (buf) |*v| {
            v.* &= rng.next();
        }
    }
}

I did some quick benchmarking and it seems that the second approach can be 10x faster.

My questions:

  1. Did I miss anything? Was my initial approach inappropriate?
  2. Did I do anything ‘non-idiomatic’ in my solution? Is there room for further improvements?
  3. I did notice that using std.rand.Random.float to get a [0, 1] uniform variable was quite slow. Maybe too many layers of indirection? float() → int() → fill() → next()?

Thanks a lot for the help!

PS: For floatDec I took inspiration in the standard library’s frexp:

pub fn FloatDec(comptime T: type) type {
    return struct {
        sign: u1,
        exponent: CExpInt,
        significand: SigInt,

        const bits: comptime_int = @typeInfo(T).Float.bits;
        const exp_bits: comptime_int = std.math.floatExponentBits(T);
        const frac_bits: comptime_int = std.math.floatFractionalBits(T);
        const mant_bits: comptime_int = std.math.floatMantissaBits(T);

        const Int: type = std.meta.Int(.unsigned, bits);
        const ExpInt: type = std.meta.Int(.unsigned, exp_bits);
        const CExpInt: type = std.meta.Int(.signed, exp_bits + 1);
        const FracInt: type = std.meta.Int(.unsigned, frac_bits);
        const SigInt: type = std.meta.Int(.unsigned, frac_bits + 1);
    };
}

//float decomposition
pub fn floatDec(fl: anytype) FloatDec(@TypeOf(fl)) {
    const T = @TypeOf(fl);
    const FT = FloatDec(T);

    const exp_bias: comptime_int = (1 << FT.exp_bits - 1) - 1;
    const v: std.meta.Int(.unsigned, FT.bits) = @bitCast(fl);

    const sign: u1 = @truncate(v >> FT.bits - 1);
    const exponent: FT.ExpInt = @truncate(v >> FT.mant_bits);
    const frac: FT.FracInt = @truncate(v);

    const imp_bool: FT.SigInt = @as(FT.SigInt, @intFromBool(exponent != 0)) << (FT.frac_bits);
    const significand: FT.SigInt = @as(FT.SigInt, frac) + imp_bool;
    const cexp: FT.CExpInt = exponent;
    return FT{ .sign = sign, .exponent = cexp - exp_bias, .significand = significand };
}

Did you build for ReleaseFast when you ran your benchmark? That makes a big difference.

1 Like

Thanks, It indeed made a big difference:

This is my (crude benchmarking code)


pub fn randomBitSetSlow(prob: f32, buf: []u64, rng: anytype) std.bit_set.DynamicBitSetUnmanaged {
    var bitset = std.bit_set.DynamicBitSetUnmanaged{ .bit_length = buf.len * 64, .masks = buf.ptr };
    for (0..bitset.bit_length) |i| {
        const int_thresh: u32 = @intFromFloat(prob * std.math.maxInt(u32));
        const val = rng.random().int(u32) < int_thresh;
        bitset.setValue(i, val);
    }
    return bitset;
}
pub fn randomBitSetFast(prob: f32, buf: []u64, rng: anytype) std.bit_set.DynamicBitSetUnmanaged {
    randomBitsU64(prob, buf, rng);
    const bitset = std.bit_set.DynamicBitSetUnmanaged{ .bit_length = buf.len * 64, .masks = buf.ptr };
    return bitset;
}


const prob: f32 = 0.4;
var buf: [500000]u64 = undefined;
var rng = std.Random.DefaultPrng.init(42);

 const start_slow = std.time.milliTimestamp();
_ = stats.randomBitSetSlow(prob, &buf, &rng);
const end_slow = std.time.milliTimestamp();

var setBits: f32 = 0;
for (buf) |u| {
    setBits += @floatFromInt(@popCount(u));
}
const meanBits = setBits / (buf.len * 64);

try stdout.print("exec: {any} ms, mean bits: {any}\n", .{ end_slow - start_slow, meanBits });

const start_fast = std.time.milliTimestamp();
_ = stats.randomBitSetFast(prob, &buf, &rng);
const end_fast = std.time.milliTimestamp();

var setBits2: f32 = 0;
for (buf) |u| {
     setBits2 += @floatFromInt(@popCount(u));
}
const meanBits2 = setBits2 / (buf.len * 64);

try stdout.print("exec: {any} ms, mean bits: {any}\n", .{ end_fast - start_fast, meanBits2 });

When running without the optimization flag, I get
exec: 1186 ms, mean bits: 3.999825e-1 (slow)
exec: 94 ms, mean bits: 3.998426e-1 (fast)

with -Doptimize=ReleaseFast, I get
exec: 101 ms, mean bits: 3.999825e-1 (slow)
exec: 14 ms, mean bits: 3.998426e-1 (fast)

So I still get an order of magnitude speedup with the second method, but both are quite fast! Thanks!

Just a heads up in case anyone wants to use the function I added, there is undefined behaviour if the prob argument is 0.5.
That’s because of

bits = bits >> (@ctz(bits) + 1); // panic: shift amount is greater than the type size

It should be split into two lines instead:

bits >>= @ctz(bits) + 1;
bits >>= 1;

Final function:

pub fn randomBitsU64(prob: f32, buf: []u64, rng: anytype) void {
    assert(prob >= 0.0 and prob <= 1.0);

    if (prob == 1.0) {
        @memset(buf, std.math.boolMask(u64, true));
    }
    const frexp = floatDec(prob);
    var bits = frexp.significand;
    var exp = frexp.exponent;
    // has to do shifts in two operations to avoid undefined behaviour
    bits >>= @ctz(bits);
    bits >>= 1;
    for (buf) |*v| {
        v.* = rng.next();
    }
    while (bits > 0) : (bits >>= 1) {
        if (bits % 2 == 1) {
            for (buf) |*v| {
                v.* |= rng.next();
            }
        } else {
            for (buf) |*v| {
                v.* &= rng.next();
            }
        }
    }
    while (exp < -1) : (exp += 1) {
        for (buf) |*v| {
            v.* &= rng.next();
        }
    }
}

There’s a clever mathematical solution here using something like a Huffman code–for example if p = 1/16, you can map the sequence 1111 => 1 and every other sequence of bits to zero in your output sequence. This allows you to generate N i.i.d. Bernoulli random variables with Shannon entropy H in something like HN bits.

Your algorithm could translate 10010111011101001111 into
10 => 0
0 => 0
10 => 0
1110 => 0
1110 => 0
10 => 0
0 => 0
1111 => 1

So instead of generating 8 random u32’s = 256 random bits to generate 8 bernoulli RVs, this uses a single u20 (<1/12 the input).

See section 2.4: https://www.inference.org.uk/itprnn/book.pdf

3 Likes