Code review - unicode string trie

lygaret · June 22, 2024, 12:50am

Hello again!

I’m hoping y’all would be willing to review some beginner code! I’m working through Crafting Interpreters (kinda) and am working on string interning, and I built a unicode code point aware string trie which can return known u32-sized integers for each individual string.

Any comments would be well appreciated, thank you!

const std = @import("std");
const rune = @import("code_point").CodePoint;
const RuneIterator = @import("code_point").Iterator;

pub const Node = struct {
    alloc: std.mem.Allocator,
    terminal: bool,
    children: std.AutoHashMap(rune, Node),

    pub fn init(self: *Node, terminal: bool, alloc: std.mem.Allocator) void {
        self.alloc = alloc;
        self.terminal = terminal;
        self.children = std.AutoHashMap(rune, Node).init(alloc);
    }

    pub fn deinit(self: *Node) void {
        var iter = self.children.iterator();
        while (iter.next()) |entry| {
            entry.value_ptr.deinit();
        }

        self.children.deinit();
    }

    pub fn insert(self: *Node, iterator: *RuneIterator) !*Node {
        // if there's another character, keep walking the trie
        if (iterator.next()) |r| {
            const result = try self.children.getOrPut(r);
            if (!result.found_existing) {
                result.value_ptr.init(false, self.alloc);
            }

            return result.value_ptr.insert(iterator);
        }

        // if we hit the end of the iterator, we're a terminal
        else {
            self.terminal = true;
            return self;
        }
    }
};

const Self = @This();

alloc: std.mem.Allocator,
index: std.AutoArrayHashMap(*Node, void),
root: *Node,

pub fn init(alloc: std.mem.Allocator) !Self {
    const root = try alloc.create(Node);
    root.init(false, alloc);

    return .{
        .alloc = alloc,
        .index = std.AutoArrayHashMap(*Node, void).init(alloc),
        .root = root,
    };
}

pub fn deinit(self: *Self) void {
    self.root.deinit();
    self.alloc.destroy(self.root);

    self.index.deinit();
}

pub fn insert(self: *Self, iterator: *RuneIterator) !u32 {
    const node = try self.root.insert(iterator);
    const entry = try self.index.getOrPut(node);

    return @truncate(entry.index);
}

pub fn insert_string(self: *Self, input: []const u8) !u32 {
    var iterator = RuneIterator{ .bytes = input };
    return self.insert(&iterator);
}

test "trie stuff" {
    var trie = try Self.init(std.testing.allocator);
    defer trie.deinit();

    const idxa = try trie.insert_string("hello");
    const idxb = try trie.insert_string("goodbye");
    const idxc = try trie.insert_string("hello");

    try std.testing.expect(idxa == idxc);
    try std.testing.expect(idxa != idxb);
}

dude_the_builder · June 22, 2024, 1:35am

I think it looks good. The only thing that caught my eye was the @truncate; why not @intCast which would let you know if data would be lost?

A question that popped in my head is: How would you search for a string in this trie?

AndrewCodeDev · June 22, 2024, 1:44am

lygaret:

    pub fn init(self: *Node, terminal: bool, alloc: std.mem.Allocator) void {
        self.alloc = alloc;
        self.terminal = terminal;
        self.children = std.AutoHashMap(rune, Node).init(alloc);
    }

I personally would not give the Nodes an allocator reference. They don’t actually use it except to pass to down to other structures.

The most direct reason is that you could pass in allocators as arguments when you need them and spare having tons of references to the same allocator.

The other reason is that this encourages an object oriented design. Each object exists as their own islands that can make syscalls to get more memory. I think of it this way - what is the value of a Node without the Trie? If there isn’t one, it shouldn’t have its own allocator. It should be controlled by a parent that knows where to put it and the memory that it presides over.

lygaret · June 22, 2024, 3:22am

I don’t think I really need to search the trie; this is really just for getting unique indices for shortish strings; think of it as a deduplicating set, rather than a lookup. Inserting a string into the tree returns the node if it’s already terminal, which is mostly a search right now anyway.

Per @truncate: the array hashmap returns the index as usize, but in my VM, it’s important it’s u32 or smaller. I can’t imagine needing u32 indices, so I’m just chopping off the high bits.

lygaret · June 22, 2024, 3:22am

This is great feedback, thank you! I’ll be back with an update!

lygaret · June 22, 2024, 6:20am

@AndrewCodeDev @dude_the_builder

thanks both of you for your feedback; I’ve got a cleaned up version which passes a reference to the trie along the insert call chain, and removes the allocator reference from the nodes.

I also added an assert above the @truncate to hopefully clarify my intention behind using that rather than @intCast.

Thanks again!

trie.zig:

//! a table which interns strings to a unique index
//! inserting a string will return a unique number per

const std = @import("std");
const rune = @import("code_point").CodePoint;
const RuneIterator = @import("code_point").Iterator;

const Self = @This();

root: *Node,
count: usize,
index: std.AutoArrayHashMap(*Node, void),
alloc: std.mem.Allocator,

/// initialize the trie and index set
/// does _not_ return a stack value, but rather initializes an already allocated instance
pub fn init(alloc: std.mem.Allocator) !Self {
    const root = try alloc.create(Node);
    root.init(alloc);

    return .{
        .alloc = alloc,
        .index = std.AutoArrayHashMap(*Node, void).init(alloc),
        .root = root,
        .count = 0,
    };
}

/// clean up the trie, it's nodes, and the index
pub fn deinit(self: *Self) void {
    self.root.deinit();
    self.alloc.destroy(self.root);

    self.index.deinit();
}

/// insert the given utf8 encoded string into the trie, and return it's unique entry index
/// returns an error if the input is empty
pub fn insert(self: *Self, input: []const u8) !u32 {
    if (input.len == 0)
        return error.EmptyString;

    var iterator = RuneIterator{ .bytes = input };
    const node = try self.root.insert(self, &iterator);
    const entry = try self.index.getOrPut(node);

    // fits in a u32
    std.debug.assert(entry.index < 0x100000000);
    return @truncate(entry.index);
}

const Node = struct {
    terminal: bool,
    children: std.AutoHashMap(rune, Node),

    pub fn init(self: *Node, alloc: std.mem.Allocator) void {
        self.terminal = false;
        self.children = std.AutoHashMap(rune, Node).init(alloc);
    }

    pub fn deinit(self: *Node) void {
        var iter = self.children.iterator();
        while (iter.next()) |entry| {
            entry.value_ptr.deinit();
        }

        self.children.deinit();
    }

    pub fn insert(self: *Node, trie: *Self, iterator: *RuneIterator) !*Node {
        // if there's another character, keep walking the trie
        if (iterator.next()) |r| {
            const result = try self.children.getOrPut(r);
            if (!result.found_existing) {
                trie.count += 1;
                result.value_ptr.init(trie.alloc);
            }

            return result.value_ptr.insert(trie, iterator);
        }

        // if we hit the end of the iterator, we're a terminal
        else {
            self.terminal = true;
            return self;
        }
    }
};

test "trie stuff" {
    var trie = try Self.init(std.testing.allocator);
    defer trie.deinit();

    const idxa = try trie.insert("hello");
    const idxb = try trie.insert("goodbye");
    const idxc = try trie.insert("hello");

    try std.testing.expect(idxa == idxc);
    try std.testing.expect(idxa != idxb);
}

test "no empty strings" {
    var trie = try Self.init(std.testing.allocator);
    defer trie.deinit();

    if (trie.insert("")) |_| {
        try std.testing.expect(false);
    } else |err| {
        try std.testing.expect(err == error.EmptyString);
    }
}

AndrewCodeDev · June 22, 2024, 6:29am

Looks better!

I like the assert - it shows that you are intentionally truncating.

One thing that is debatably more clear is to use std.math.maxInt(u32) in the assert because then you don’t need to comment the magic number:

std.debug.assert(entry.index <= std.math.maxInt(u32));

Anyhow, it’s looking nice - what are your thoughts about it so far? In terms of self-code review, what was the most challenging part of getting this going in Zig?

lygaret · June 22, 2024, 6:54am

Honestly, I’m enjoying the heck out of it; I’m a Rubyist by trade, and don’t get into “systems” languages very often, though I have the same amount of C anyone that grew up in the 90s does

Rust seems interesting, but I would reach for OCaml if I wanted something shaped like an ML, and every time I go to write something in C I remember how annoyed I get at it (mostly at the complete lack of modularity/namespacing).

My favorite part of Zig so far is really that there’s like 7 concepts, and the rest falls into place pretty easily, and especially the “structs are modules are namespaces” feels really good to me.

comptime didn’t really phase me (I’ve done enough Common Lisp to have dealt with “this macro needs this function but it’s defined at runtime”).

The real driver to pick Zig up again (I played around in 0.9.0) was the automatic docs site getting put up. Being able to dig in and read the source of std made me realize that it’s really really not that complicated, which was a pleasant surprise.

eg. after reading the source of the intrusive doubly_linked_list struct, I realized that not only would implementing my own subset be easy, but that it pretty much reads as you’d expect!

I’m certain there’s places in std that are super complicated (gpa, probably, etc) but from what I’ve seen zig has a penchant for clarity that makes it seem like nbd.

one open question actually; say I have this code:

hash: std.ArrayList(Something),

def init(alloc: std.mem.Allocator) Self {
  return {
    .hash = std.ArrayList(Something).init(alloc),
  };
}

Note that I’m calling std.ArrayList(Something) twice: do those return different type instances?

I recognize that the code has the behavior I’m looking for, but it makes me wonder if it would be better to add a const SomethingList = std.ArrayList(Something), so that it’s not called twice.

One the one hand, I could see the compiler returning the same type for both calls, but that feels like more magic than I’d expect from zig; ie. that would imply some kind of comptime caching which I haven’t seen in the docs, and would change my understanding of comptime, which is very “common lisp macro expansion time” right now.

Anyway, thank you!

AndrewCodeDev · June 22, 2024, 6:58am

lygaret:

hash: std.ArrayList(Something),

def init(alloc: std.mem.Allocator) Self {
  return {
    .hash = std.ArrayList(Something).init(alloc),
  };
}
Note that I’m calling std.ArrayList(Something) twice: do those return different type instances?

Nope - those resolve to just types. It’s like writing i32 in two separate places. I get why that could be confusing since it’s a function returning a type. That’s all comptime though.

Interesting post though - I’m generally interested in how people get around to Zig.

dude_the_builder · June 22, 2024, 11:36am

std.testing.expectError

mnemnion · June 22, 2024, 8:14pm

Observation: in safety-check mode, this is the same thing that @intCast will do. The difference is that if entry.index ends up too large in ReleaseFast/Small, @truncate will do something predictable, but an @intCast is in undefined behavior land. It might do what @truncate does, but it’s a bad idea to expect anything out of undefined behavior.

The question you want to answer is: what should happen here? Truncation will wrap around when the significant bit is dropped, @intCast is all bets off. If you want the error condition to persist in every build state, you can test with an if statement and panic. If you think no realistic program will ever have 2^32 - 1 strings, which is a respectable choice, you can use either @truncate with the assert, or @intCast.

I would favor an @intCast here, personally, I like @truncate better when it’s valid for the truncation to take place. But my intention is to help familiarize you with how these things work together, and get you thinking about what happen to the code when assertions go away.

squeek502 · June 22, 2024, 8:21pm

Just to clarify, what’s the benefit of storing codepoints here? Unless I’m missing something, as long as you know that the strings always use the same encoding, storing a trie composed of e.g. UTF-8 bytes would work just as well.

Either way, though, you might be interested in the technique described here:

squeek502 · June 22, 2024, 8:33pm

There’s also std.math.cast which will return null on overflow in all optimization modes.

lygaret · June 22, 2024, 9:22pm

@mnemnion fantastic feedback all around, but this in particular I very much appreciate, and I am grateful for your help

@squeek502 that’s… a good point; and thank you for that link, hash contexts seem really powerful, I have to think about what it means to be able to swap out equality when searching…

between your making me rethink needing to worry about breaking on code points (at least in backing storage), the link’s example of a buffer + hashtable for indexing, and this blog post about path compression, I’m going to give this another shot.

thank you for your time!

AndrewCodeDev · June 22, 2024, 9:56pm

This is frankly a great point about casting integers that isn’t considered enough. I’ve worked in situations where you only want the top or bottom bits of an integer and then it makes sense to shift and truncate because it is specifically truncation that should occur.

For typical purposes, we really should be promoting @intCast because that’s what’s actually intended here - “give me the same numeric value, but in a smaller int because I know that it will be properly represented”.

I change my opinion on this because I agree with @mnemnion.