Math perfect hash question

ericlang · June 7, 2025, 10:31pm

I am quite bad at real math. Maybe someone knows an answer to this…
If not I will also join some math forum
After a day of trial and error I resigned. It must be possible with some binomial trick.
Then I tried one hour with chat gpt, that thing is even more dumb than I am.
Any math wizards here?

I need a perfect hash for a (encoded) string.
In my program characters are encoded to u6 like A = 1, B = 2 etc.
I would like (instead of using a hash table) to map a string to a number and use an array.
The order does not matter so AABD wil map to the same number as DBAA.
And I need the reverse function as well!

fn to_number(str: std.BoundedArray(u6, 7) u32 {
}

fn from_number(u: u32) std.BoundedArray(u6, 7) {
}

Also it would be nice to have to number of combinations possible as a function.

fn max_combinations(max_len: u3) u32 {}

Numbers will probably map to something like:

A = 0 or 1
AA = 1 or 2
AAA = 2 or 3
...
ZZZZZZZ = some_big_number

Retro_Dev · June 7, 2025, 11:32pm

This seems like a problem I would enjoy working on. Is this alphabetic data only? Also, are the “words” a maximum of 7 letters each? If so, there’s roughly ~33.3 bits of entropy right there. It might work for you to simply convert the “base 27” (the extra symbol is representing a termination) to binary, hoping that you don’t run into collisions.

Of course, you said “perfect hash” here. It’s not possible for 34 bits to fit into 32 bits, so you need to know what inputs are (or are not) possible for this function. The cheapest way out would be to use something like Andrew Kelley’s Comptime Perfect Hashing. If you want a general purpose tool for finding a collision-free mapping between a string and an integer, then gperf is what you want.

Before I keep rambling, do you already know the set of all possible encoded strings? Or can they be anything.

ericlang · June 7, 2025, 11:47pm

Ah actually… It is for a min length of 1 and a max length of 6 which will be less huge. I think 7 is too long to fit inside a ‘normal’ sized array.
The table is meant to be a precalculated one with heuristic values for each possible scrabble-rack after a player has made his move on the board.
This heuristic is helping my scrabble engine with evaluations.
That calculation is a heavy one, comparing all possible racks against all possible words and determining how good or bad it is.

It is alphabetical data only. Some languages have too much letters in the alphabet. For this case I just need a real hash table. But for 26 letters an array would hugely speedup things.

I will have a look at your 2 links.

Edit: So the exact number of possibilities is known.

Retro_Dev · June 8, 2025, 12:22am

So 26 symbols, at a length of [1, 6]. The length will need to be encoded somehow. For simplicity, I’ll use variable length encoding. Each letter can be encoded as a base-27 digit, with a 0 digit as the “terminator”.

Here’s some code that I think should do the trick:

EDIT: woops, I think that the decoder reverses the data. I’ll fix that rn.
EDIT 2: fixed & loosely tested

const std = @import("std");
const assert = std.debug.assert;
const Rack = std.BoundedArray(u6, 6);

fn numberFromRack(rack: Rack) u32 {
    assert(rack.len >= 1 and rack.len <= 6);
    var num: u32 = 0;

    for (0..rack.len) |idx| {
        const rev_idx = rack.len - idx - 1;
        const letter = rack.buffer[rev_idx];
        assert(letter <= 25);
        num = num * 27 + letter + 1;
    }

    return num;
}

fn rackFromNumber(encoded: u32) Rack {
    var rack: Rack = .{ .len = 0 };
    var num: u32 = encoded;

    while (num != 0) : (num /= 27) {
        const digit: u6 = @intCast(num % 27);
        assert(digit != 0);
        rack.buffer[rack.len] = digit - 1;
        rack.len += 1;
    }

    assert(rack.len >= 1 and rack.len <= 6);
    return rack;
}

test "conversions" {
    const rack_list: []const Rack = &.{
        .{ .len = 1, .buffer = .{ 0, 0, 0, 0, 9, 0 } },
        .{ .len = 2, .buffer = .{ 0, 1, 0, 0, 0, 0 } },
        .{ .len = 6, .buffer = .{ 0, 1, 2, 3, 4, 5 } },
        .{ .len = 4, .buffer = .{ 3, 2, 1, 9, 0, 0 } },
    };

    for (rack_list) |rack| {
        const num = numberFromRack(rack);
        const decoded = rackFromNumber(num);

        try std.testing.expectEqualSlices(
            u6,
            rack.buffer[0..rack.len],
            decoded.buffer[0..decoded.len],
        );
    }
}

EDIT 3: Out of curiosity, why is the BoundedArray in your first example code using a u6? if this were to just be alphabetic data -26 letters - then I think u5 would work. Am I missing something?

ericlang · June 8, 2025, 8:19am

Oh! That code looks insanely short. I’ll check it out and let you know here.

There is this list of languages for which scrabble was made.
Most of them are representable with u5. But some have even 46 characters! So I chose u6.
There is some wiki with the scrabble distributions.

Afrikaans
Arabic
Bulgarian
Catalan
Croatian
Czech
Danish
Dutch
English
Estonian
Faroese
Finnish
French
German
Hungarian
Icelandic
Indonesian
Irish
Italian
Latin
Latvian
Malagasy
Malay
Norwegian
Polish
Portuguese
Romanian
Scottish Gaelic
Slovak
Slovenian
Swedish
Turkish
Welsh
Greek
Hebrew
Russian
Ukrainian

ericlang · June 8, 2025, 9:27am

Ok it produces codes.
But the math problem is it should be an array without any holes in it.
When we run this:

    for (1..30) |i|
    {
        std.debug.print("magicnummer: {} -> ", .{i});
        const magic: u32 = @truncate(i);
        const rack = rackFromNumber(magic);
        std.debug.print("rack slice: {any} \n", .{ rack.slice()});
    }

We get this:

// 1 to 23 ok
magicnummer: 24 -> rack slice: { 23 }
magicnummer: 25 -> rack slice: { 24 }
magicnummer: 26 -> rack slice: { 25 }
magicnummer: 27 -> thread 10368 panic: reached unreachable code

Also BC and CB should produce the same code. When we run this:

pub fn check() void {
    const rack_list: []const Rack = &.{
        .{ .len = 2, .buffer = .{ 2, 1, 0, 0, 0, 0 } },
        .{ .len = 2, .buffer = .{ 1, 2, 0, 0, 0, 0 } },
    };

    for (rack_list) |rack| {
        const num = numberFromRack(rack);
        std.debug.print("{any} -> {}\n", .{rack.slice(), num});
    }
}

We get:

{ 2, 1 } -> 57
{ 1, 2 } -> 83

(edit: but that could just be a matter of sorting before decoding)

Justus2308 · June 8, 2025, 1:59pm

I don’t have a solution, but you can map your problem to ranking/unranking permutations of a multiset.

Multisets because you can represent your problem as a set/array [n]u3 where n is the number of unique characters you have +1 to represent no character. Then you can generate a unique mset for each permutation by just adding to the respective counters for each element in your permutation. Maybe std.enums.BoundedEnumMultiset is useful here.

The amount of possible permutations is the multiset coefficient ((n k)).

Now you just need to find a bijection from mset permutations to [0, ((n 7))]
There are papers out there about this problem, but I‘m not knowledgable enough in combinatorics to give a definitive answer here.

I‘ve found this: Zamboch: Ranking permutations of multiset - more details and this: Theoretical Aspects of Computing -- ICTAC 2011: 8th International Colloquium, Johannesburg, South Africa, August 31 -- September 2, 2011, Proceedings | SpringerLink (p. 218 if you have some way to access this)

Maybe there is also some way to avoid factoring in permutations that contain the non-existing character in the middle of their sequence since you will never encounter them.
I guess you could build 7 different sets and discriminate based on the length of your string first and just set n to the amount of characters you have. Then you could just concatenate your bijections to [0…((n k))] and add the appropriate offset.

LucasSantos91 · June 8, 2025, 3:59pm

const std = @import("std");

fn Character(
    comptime character_count: comptime_int,
) type {
    return std.math.IntFittingRange(
        0,
        character_count + 1, // +1 for whitespace
    );
}

fn maxIndex(
    comptime word_length: comptime_int,
    comptime character_count: comptime_int,
) comptime_int {
    const a: comptime_int = std.math.powi(usize, character_count + 1, word_length) catch unreachable;
    return a - 1;
}

fn Index(
    comptime word_length: comptime_int,
    comptime character_count: comptime_int,
) type {
    return std.math.IntFittingRange(0, maxIndex(word_length, character_count));
}

fn hash(
    comptime word_length: comptime_int,
    comptime character_count: comptime_int,
    word: [word_length]Character(character_count),
) Index(word_length, character_count) {
    const adjusted_character_count = character_count + 1; // 1 for whitespace
    var count: [adjusted_character_count]Character(character_count) = .{0} ** adjusted_character_count;
    for (word) |char| count[char] += 1;
    const Result = Index(word_length, character_count);
    var result: Result = 0;
    var base: Result = 1;
    for (count[1..], 1..) |this_char_count, character| {
        const first_term = character * base;
        const pow = std.math.powi(Result, adjusted_character_count, this_char_count) catch unreachable;
        base *= pow;
        const multiplier = @divExact(pow - 1, character_count);
        const to_add = first_term * multiplier;
        result += @intCast(to_add);
    }
    return result;
}

const test_namespace = struct {
    const test_character_count = 26;
    const test_word_length = 7;
    fn testHash(word: []const u8) void {
        var word_adjusted: [test_word_length]std.math.IntFittingRange(0, character_count) = undefined;
        for (word, word_adjusted[0..word.len]) |src, *dst| {
            dst.* = @intCast(src - 'A' + 1);
        }
        @memset(word_adjusted[word.len..], 0);
        const index = hash(test_word_length, test_character_count, word_adjusted);
        std.debug.print("{}\n", .{index});
    }
};

pub fn main() void {
    const testHash = test_namespace.testHash;
    testHash("");
    testHash("A");
    testHash("AABC");
    testHash("BCAA");
    testHash("ZZZZZZZ");
}

output:

0 
1
60535
60535
10460353202

godbolt

Explanation:
Like @Retro_Dev mentioned, we can encode the index as a base-27 number:

A A B Z 0 0 0
=
[1] [1] [2] [26] [0] [0] [0]

Each one of the number in brackets is a single digit, in a base-27 number. The number is in little-endian for convenience. To convert to decimal:

[1] [1] [2] [26] [0] [0] [0]
=
1 * 27^0 + 1 * 27^1 + 2 * 27^2 + 26 * 27^3 + 0 * 27^4 + 0 * 27^5 + 0 * 27^6

Since order doesn’t matter, we can order the word lexicographically. Empty space counts as 0.
Note that:

A A A 0 0 0 0
=
1 * 27^0 + 1 * 27^1 + 1 * 27^2

Which is the sum of a geometric progression, with first element 1 * 27^0, ratio 27 and number of elements 3.

1 * 27^0 + 1 * 27^1 + 1 * 27^2
=
1 * 27^0 *(27^3 - 1)/(27 - 1)

In the implementation, we don’t actually waste time ordering, we just count the ocurrence of each digit.

Retro_Dev · June 9, 2025, 7:55am

ok, I did it. Turns out that I had been thinking about the issue a bit wrong. Now everything is densely packed, and still pretty simple. This doesn’t work with alphabets with more than 26 symbols though - Let me know if you really care about that feature and I’ll adjust how this works. (The maximum number of symbols that an alphabet can have for this method would be 40, if you are still constrained to fitting inside of a u32)

EDIT: I forgot you said “Also BC and CB should produce the same code.” - That’s a problem for another day. In this code they are distinct.

EDIT 2: More optimal code: The Zig Pastebin

const Rack = std.BoundedArray(u5, 6);

fn numberFromRack(rack: Rack) u29 {
    assert(rack.len >= 1 and rack.len <= 6);
    var num: u29 = 0;

    for (0..rack.len) |idx| {
        const rev_idx = rack.len - idx - 1;
        const letter = rack.buffer[rev_idx];
        assert(letter < 26);
        num = num * 26 + letter;
    }

    return num + encodeLen(@intCast(rack.len));
}

fn rackFromNumber(encoded: u29) Rack {
    const len = decodeLen(encoded);
    var rack: Rack = .{ .len = len };
    var num: u29 = encoded - encodeLen(len);

    for (0..len) |idx| {
        rack.buffer[idx] = @intCast(num % 26);
        num = num / 26;
    }
    assert(num == 0);

    return rack;
}

fn decodeLen(encoded: u29) u3 {
    return switch (encoded) {
        0...25 => 1,
        26...701 => 2,
        702...18277 => 3,
        18278...475253 => 4,
        475254...12356629 => 5,
        12356630...321272405 => 6,
        else => unreachable,
    };
}

fn encodeLen(len: u3) u29 {
    return switch (len) {
        1 => 0,
        2 => 26,
        3 => 702,
        4 => 18278,
        5 => 475254,
        6 => 12356630,
        else => unreachable,
    };
}

test "every possible encoding" {
    for (0..321272406) |num| {
        const rack = rackFromNumber(@intCast(num));
        const decoded = numberFromRack(rack);
        try std.testing.expectEqual(num, decoded);
    }
}

ericlang · June 9, 2025, 8:34am

Thank you all very much for the posts, links, code, explanations!
Probably I have no time coming 2 weeks because I’m going to Holland. After that I’ll be back.

ericlang · June 9, 2025, 6:36pm

On my last free day today dived a bit into all kinds of permutations and the links provided above…

A little extension on the problem which will probably hugely compress the permutation - but makes it even more complex - is the scrabble letter distribution.
Most letters are very restricted in how often they can occur.
In the meanwhile I also asked a math wizard I know.

index :  1  2  3  4  5  6  7  8  9  10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
letter:  A  B  C  D  E  F  G  H  I  J  K  L  M  N  O  P  Q  R  S  T  U  V  W  X  Y  Z
distr :  9  2  2  4  12 2  3  2  9  1  1  4  2  6  8  2  1  6  4  6  4  2  2  1  2  1
max   :  6  2  2  4  6  2  3  2  6  1  1  4  2  6  6  2  1  6  4  6  4  2  2  1  2  1

I gave it a new name: constrained multiset permutation

Retro_Dev · June 9, 2025, 9:56pm

This problem would very well fit a bit-level adaptive arithmetic coder… It would perfectly handle the distributions of letters, and the probability ranges of certain letters could be adjusted to assume that only a “sorted” input was provided (meaning that AB and BC would have the same value) - this would make the encoding more compact. The problem with this method is that it is 1. slower, 2. hard to reason about, and 3. it might have gaps. I’ll think about this a bit longer, I think if I were to write an encoder/decoder for this, it may be ~250 LOC.