Hash alternative for a blunt array

In my chess program I keep track of how much ‘nodes’ are spent on a particular move to make my time-management smarter.

nodes_spent: [4096]u64;

This array is updated for each (root) move indexed by combining the from and to square (u6).
So the index is a u12.

I find this quite disturbing because most entries are not used. In fact the maximum number of used entries is 224 and in most cases 20-40.

What would be the best (fastest) hashmap replacement?
I was thinking about

nodes_spent: std.AutoHashMapUnmanaged(u12, u64);
1 Like

The hashmap in standard library is pretty efficient for small tables like that. Most specialized hash tables are optimized for large tables, so not applicable here. However, unless you have thousands of these in your engine, why not just accept the 32k wasted memory and enjoy the o(1) access?

Mostly because I already have a LOT of other stuff in memory.
And because the nodes_spent is only updated when the recursive search routine is at the “root” I was thinking about changing it. Not sure about the performance implications.

Maybe you can get away with something like this ?

const std = @import("std");

const Iterations = 50_000_000;
const KeyCount = 224;

pub const Map = struct {
    const LaneCount = 16;
    const Vec = @Vector(LaneCount, u16);

    keys: [256]u16 = undefined,
    values: [256]u64 = undefined,
    len: u16 = 0,

    pub inline fn clear(self: *Map) void {
        self.len = 0;
    }

    pub inline fn add(self: *Map, key: u16, delta: u64) void {
        const vk: Vec = @splat(key);
        var i: usize = 0;
        while (i + LaneCount <= self.len) : (i += LaneCount) {
            const block: Vec = @as(*Vec, @ptrCast(@alignCast(self.keys[i .. i + LaneCount]))).*;
            const mask = block == vk;

            if (@reduce(.Or, mask)) {
                const idx = i + @ctz(@as(u16, @bitCast(mask)));
                self.values[idx] += delta;
                return;
            }
        }

        while (i < self.len) : (i += 1) {
            if (self.keys[i] == key) {
                self.values[i] += delta;
                return;
            }
        }

        self.keys[self.len] = key;
        self.values[self.len] = delta;
        self.len += 1;
    }

    pub inline fn get(self: *Map, key: u16) ?u64 {
        const vk: Vec = @splat(key);

        var i: usize = 0;
        while (i + LaneCount <= self.len) : (i += LaneCount) {
            const block: Vec = @as(*Vec, @ptrCast(@alignCast(self.keys[i .. i + LaneCount]))).*;
            const mask = block == vk;

            if (@reduce(.Or, mask)) {
                const idx = i + @ctz(@as(u16, @bitCast(mask)));
                return self.values[idx];
            }
        }

        while (i < self.len) : (i += 1) {
            if (self.keys[i] == key)
                return self.values[i];
        }

        return null;
    }
};

pub fn main() !void {
    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    defer _ = gpa.deinit();
    const alloc = gpa.allocator();

    var prng = std.Random.DefaultPrng.init(0);
    const rnd = prng.random();

    var keys: [KeyCount]u12 = undefined;
    for (keys) |i| {
        keys[i % KeyCount] = rnd.int(u12);
    }

    var simd_map = Map{};
    var timer = try std.time.Timer.start();

    var i: usize = 0;
    while (i < Iterations) : (i += 1) {
        const k = keys[i % KeyCount];
        simd_map.add(k, 1);
    }
    const simd_ns = timer.read();

    var map: std.AutoHashMapUnmanaged(u12, u64) = .empty;
    defer map.deinit(alloc);
    try map.ensureTotalCapacity(alloc, KeyCount);

    timer.reset();

    i = 0;
    while (i < Iterations) : (i += 1) {
        const k = keys[i % KeyCount];
        const entry = map.getOrPutAssumeCapacity(k);
        entry.value_ptr.* += 1;
    }

    const hash_ns = timer.read();

    std.debug.print(
        \\Iterations: {}
        \\Keys: {}
        \\SIMD map:  {} ns ({d:.2} ns/iter)
        \\Hash map:  {} ns ({d:.2} ns/iter)
        \\Speedup:   {d:.2}x
        \\
    , .{
        Iterations,
        KeyCount,
        simd_ns,
        @as(f64, @floatFromInt(simd_ns)) / Iterations,
        hash_ns,
        @as(f64, @floatFromInt(hash_ns)) / Iterations,
        @as(f64, @floatFromInt(hash_ns)) / @as(f64, @floatFromInt(simd_ns)),
    });
}

test "simple" {
    var map = Map{};
    const my_key = [3]u12{ 33, 52, 25 };
    const my_val = [3]u64{ 1, 2, 3 };

    for (my_key, my_val) |k, v| {
        map.add(k, v);
    }

    for (my_key, my_val) |k, v| {
        try std.testing.expect(map.get(k).? == v);
    }
}

Iterations: 50000000
Keys: 224
SIMD map:  194760994 ns (3.90 ns/iter)
Hash map:  366807991 ns (7.34 ns/iter)
Speedup:   1.88x
2 Likes

Just off the top of my head, succinct data structure fits the bill. Your array is basically a sparse array. A succinct data structure can pack sparse arrays into much smaller space while maintaining ~O(1) access.

Given a bitmask and a dense ArrayList, you can do roughly:

bitmask: [4096]u1
values: ArrayList(u64)

if bitmask[i] == 0
    return NotFound
r = rank(bitmask, i)
return values.get(r)

The bitmask takes up 512 bytes and values takes up whatever u64 entries. It can give you close to 64X in space reduction.

When I get the time, I’ll put out a Zig implementation.

3 Likes

I think it depends on your requirements, if the only thing you care about is random access then the wasted memory is usually irrelevant. The hashmaps in std library are auto-growing if you want them to have a dynamic size.

If you want fast iteration then I can see how having a large buffer that’s mostly empty is an issue. If the hashmaps provided by std don’t cut it for you then you can experiment with a swisstable map.

As promised, I’ve put down a Zig implementation for a succinct data structure for 4096 of u64 data items in the following link. It has enough tests to show that it’s working fine.

It’s a simple implementation with rank() only (no select() implementation as there’s no use case here). It uses only 1-layer block of prefix sums, as the data set is a small 4096 element set. Succinct data structures are usually for huge data sets with multiple customized layers. In this case 1 layer of blocks works well.

I haven’t run benchmarks yet. I expect it to be comparable to hash tables as most operations are O(1). It does 3 memory reads on get(). Most hash tables do 2 memory reads at best on lookup (read on index and read on data), and more when there’re collisions. I expect it to be worse than direct array access which has only 1 memory read.

When it comes to size, yes, you can have your cake and eat it. too. It’s very compact, close to the information-theoretic lower bound, and it’s very fast, O(1) all around. Succinct data structures are amazing.

3 Likes

Nice. Thank you very much. I will try this during my tests.
I also will try a hashmap, just to compare the nodes per second.

Great work @ww520, I’m a big fan of bit-twiddling for fun and profit. And Zig makes it fun ^_^!

That might be Optimal given the smallish-power-of-two constraint on the structure, but I wanted to also link to Russ Cox’s sparse array article, which, as he points out, is an old trick but a good one.

The big advantage of using the technique linked above: resetting the structure is a matter of clearing one value, which in practice will be in a register. If the workload of the set involves using one at a time in a hot loop, then clearing memory can dominate, and this approach to a sparse array eliminates that.

It also doesn’t need to be zeroed before use, which makes it fairly choice as a stack data structure as well: just bump the stack pointer up, set a couple registers to point into the slice, set a counter to zero, and go to town.

4 Likes

Thanks for the paper. It’s good to see previous techniques on the subject. That’s a neat trick to reset the array length to clear the array. The sparse array technique described by Russ Cox (of Briggs and Torczon’s paper) was an interesting idea, by having bi-directional pointers between the sparse index array and the dense array. In term of space, it requires two words (forward and backward indices) for every index in the index space, i.e. N x 2-word for [0, N) index range.

Succinct data structure is mostly a recent development. There’re a number of approaches, and new ones are coming up from time to time. The prefix-sum approach I did requires 1 bit for every index in the index space and it’s relatively easy to implement. Other approaches like Elias-Fano encoding have even more crazy compression. The amazing thing is all of them support O(1) query and O(1) iteration. The only downside is they’re static structures, i.e. cannot add/remove item after creation. The prefix-sum approach while not having the most optimal compression still allows dynamic add/remove, with a cost - adding ordered items is O(1) but adding out of order items is ~O(number of items).

1 Like

Two of whatever width you need, for 4096 that’s realistically two bytes. I wouldn’t want to stuff five u12s per eight bytes, four u16 seems more than acceptable.

But to your general observation, it’s absolutely a space-for-time tradeoff, and it’s hard to guess in advance how those are going to turn out. 8192 bytes is two pages usually, but well less than one on my laptop, and small-seeming details of how the hot loop progresses affect whether or not it’s a good tradeoff to make.

I play in this arena myself, as it happens. I have a few tweaks to Runeset in the pipeline, one of which is, believe it or not, a space-for-time tradeoff :slight_smile:

Succinctness does tend to lead to that property, but if they’re fast enough to work with then just treat them as immutable. Immutability is better when we can get away with it anyway, Rich Hickey is right about the consequences of “place-oriented programming”, even though under the hood, if you will, there’s no other kind.

I do have to remind myself sometimes that the point isn’t to make the smallest possible data structure (except when there are papers to write, then it’s very much the point), it’s to make efficient ones, and compression / succinctness is a but good means to that end. Prefix-sum sounds like it lands somewhere on the Pareto frontier, just depends on whether the application can inhabit that region.

1 Like

It’s very impressive to come up with a novel way to do membership test on UTF-8 in RuneSet. I can’t say I know much about the UTF-8 encoding. I know the variable length code point make things difficult.

One thing comes to mind. I’m not sure if this is relevant. A difficult thing on dealing with UTF-8 is the access to random positions of a string, due to variable length character encoding. Some succinct data techniques might be helpful in this regard since we’re on the topic.

Create a bit array in parallel to the bytes of the UTF-8 string, 1 bit per 1 byte. Set a bit to 1 for the beginning of a code point, 0 for the rest of the bytes of a code point. E.g. for a string with 3 code points, [s1, a, b], [s2, x, y, z], [s3, w] => [1, 0, 0], [1, 0, 0, 0], [1, 0] => 100100010.

Let’s call the bit array B. Select(B, i) gives the bit position of the i-th index of 1-bit. E.g. select(B, 0) = 0, select(B, 1) = 3, select(B, 2) = 7. The bit position is the byte position of the starting byte of the code point. We’ve just located the beginning of the code point by position! This allows random access to any position of a UTF-8 string in O(1), as the bit array is a succinct structure with an O(1) select() operation.

You don’t need a bit array to find the beginning of a codepoint (see quote below).

And if you want to get the codepoint index I think it would be easier to just put fence-posts as separators into a big string to separate it into regions of smaller codepoint-sequences (for example for batched parallel processing in a linear forward way), you just have to (UTF-8 - Wikipedia):

… it is self-synchronizing so searches for short strings or characters are possible; and the start of a code point can be found from a random position by backing up at most 3 bytes. The values chosen for the lead bytes means sorting a list of UTF-8 strings puts them in the same order as sorting UTF-32 strings.

The UTF-8 bytes already contain enough information to find the codepoint starts.

Finding the codepoint index from random-access could be done by just updating a counter while iterating from the previous fence-post, basically the fence-post could serve as index structure to skip part way into the string and then iterate the remaining code points.

I think bit array could still be useful while transforming data or to mark special things / properties about the string (reminds me of Hyper-optimizing the Zig Parser). I think another thing to note is that you still could be in the middle of a grapheme cluster, so just finding the beginning of a code point isn’t enough.

I am not very experienced with unicode stuff, the conversation just reminded me of fragments I read here and there.

2 Likes

Ah, I didn’t know UTF-8 has fence post encoding to identify the starting byte of a code point by searching backward for the fence post byte. Learn something today.

The companion bit array is a succinct application technique to add “fence posts” to data segments of varying lengths and quickly jump to a segment by position.

Actually, wait. Although UTF-8 string has self identifying fence post byte as the starting byte of a code point, we still don’t know how how far to jump to start looking for the fence post byte. We don’t know how many code points (with their varying lengths) to skip to locate a character by position. Let’s we want to jump to the 4200th character of a string. How do we know how many bytes to skip to start the searching process. Bit array might still be useful?

1 Like

This was considered a serious problem for some reason, Python 3 totally mutilated its string type in order to support O[1] access to the nth character of a string.

But this turns out to be very nearly useless. No one needs the 50th arbitrary character in a string, they need: the end of the sentence, the next occurrence of bleh, a fragment which will fit in the available amount of PDF / terminal screen / whatever, but “whatever the nth codepoint is, I need that, and I need it in constant time”, it’s just not a useful property to maintain.

On the bright side, your plan for how to do it is cheap (one bit per byte overhead, not bad) and sensible, which I can’t say for most “we need this string to have random access!!” decisions I’ve seen in the wild. It just, turns out not to be as handy as programmers who grew up on ASCII were inclined to think it would be.

I don’t think it’s constant time though, not without an infinite-width POPCNT. Might be faster than just looking for it, if there were enough accesses to amortize creating the bitmap.

2 Likes