StringHashMap

ericlang · June 2, 2025, 8:43pm

Still some confusion here with my first real hashmap adventures
(yes i had a lot of memory corruptions but solved them and learned something).

Do I need to free the keys of the hashmap?

var map: StringHashMap(u8) = .empty;

fn add(name: []const u8, value: u8) {
    map.put(name, value);
}

add("hello", 42);

mnemnion · June 2, 2025, 8:45pm

Yes.

Maybe. Definitely the hash map will not do it for you.

One reason being that the strings could be in static memory, and that would segfault.

ericlang · June 2, 2025, 8:55pm

So I would need to make a new copy of the parameter, put that copy into the hashmap and free the key when dropping the hashmap. To be ultra sure…?

mnemnion · June 3, 2025, 2:23am

You need to decide what owns the slices.

If it’s the StringHashMap, so be it: iterate the keys() and free them, then deinit the HashMap.

If it’s something else, don’t do that. Have the something else do it instead.

Example: diffz has a line-diff mode, and how it works is fairly clever: it hashes every line in both files, and converts them to a list of unique characters (UTF-8 encoded). It then run the Myer’s diff on just the characters, and iterates the character diffs, retrieving the lines from the HashMap. Result: line-based diffs.

That hash map doesn’t own the slices, so it doesn’t free them. They’re just views into the source buffers. The final diffs have copies, although that isn’t strictly necessary, but it means you can free the file buffers once the diffing is done and nothing bad will happen.

ericlang · June 3, 2025, 9:25am

Very tricky
You just put in some “hello” parameter and before you know …armageddon

gonzo · June 3, 2025, 3:43pm

FWIW, for my Advent of Code solutions in zig, I use an interned string table. This is basically:

An ArrayList that holds all strings (and they are dynamically allocated copies of the slices passed in); the position in this table becomes the string id.
A StringHashMap which maps from each string to its position in the array (its id).

With this in place, I can easily compare two strings by comparing their positions in the ArrayList. When the string table is being deinited, I release the memory used by each string.

This works really well for my use cases. Cheers.

ericlang · June 3, 2025, 4:21pm

Thx. I was also thinking about that. Currently it is working without mistakes. Better structures for future.

mnemnion · June 3, 2025, 4:22pm

These are great, confirmed. I’m translating a program which uses them, and it’s a lifesaver.

You have independently derived the StringArrayHashMap, more or less. This is my whole interning implementation:

const StrSafe = struct {
    safe: StringArrayHashMap(void),
    allocator: Allocator,

    pub fn init(allocator: Allocator) StrSafe {
        return .{ .allocator = allocator, .safe = .empty };
    }

    pub fn find(strsafe: *StrSafe, key: []const u8) ?[]const u8 {
        return strsafe.safe.getKey(key);
    }

    pub fn intern(strsafe: *StrSafe, k: []const u8) ![]const u8 {
        if (strsafe.safe.getKey(k)) |key| {
            return key;
        }
        const dupe = try strsafe.allocator.dupe(u8, k);
        try strsafe.safe.put(strsafe.allocator, dupe, {});
        return dupe;
    }
};

(The safe field actually uses the Unmanaged variant, I’m phasing in the new default to minimize code changes later)

gonzo · June 3, 2025, 6:49pm

Not for the first time… It is not a surprise that the solution space for these things quickly converges to one or two good ideas. Can’t claim any flashes of genius, but it is good to get confirmation that others have walked the same path. Cheers!

Sze · June 3, 2025, 7:20pm

There also was this old (clever) implementation that was used within the WASM namespace for a while (that still could be used for other things):

github.com/ziglang/zig

src/link/Wasm.zig

94cee4fb2


      
          pub const StringTable = struct {
              /// Table that maps string offsets, which is used to de-duplicate strings.
              /// Rather than having the offset map to the data, the `StringContext` holds all bytes of the string.
              /// The strings are stored as a contigious array where each string is zero-terminated.
              string_table: std.HashMapUnmanaged(
                  u32,
                  void,
                  std.hash_map.StringIndexContext,
                  std.hash_map.default_max_load_percentage,
              ) = .{},
              /// Holds the actual data of the string table.
              string_data: std.ArrayListUnmanaged(u8) = .{},
          
              /// Accepts a string and searches for a corresponding string.
              /// When found, de-duplicates the string and returns the existing offset instead.
              /// When the string is not found in the `string_table`, a new entry will be inserted
              /// and the new offset to its data will be returned.
              pub fn put(table: *StringTable, allocator: Allocator, string: []const u8) !u32 {
                  const gop = try table.string_table.getOrPutContextAdapted(
                      allocator,

This file has been truncated. show original

The benefit of that variant is that it doesn’t need to use/store any slices internally, it just stores the start position of specific null terminated keys.

Here is a related zig news post

Ahh, now it is still used here:

github.com/ziglang/zig

src/link/StringTable.zig

80170d017


      
          buffer: std.ArrayListUnmanaged(u8) = .empty,
          table: std.HashMapUnmanaged(u32, void, StringIndexContext, std.hash_map.default_max_load_percentage) = .empty,
          
          pub fn deinit(self: *Self, gpa: Allocator) void {
              self.buffer.deinit(gpa);
              self.table.deinit(gpa);
          }
          
          pub fn insert(self: *Self, gpa: Allocator, string: []const u8) !u32 {
              const gop = try self.table.getOrPutContextAdapted(gpa, @as([]const u8, string), StringIndexAdapter{
                  .bytes = &self.buffer,
              }, StringIndexContext{
                  .bytes = &self.buffer,
              });

ericlang · June 3, 2025, 8:32pm

For short strings (and small maps) we can use at boundedarray as well though take care that they hash correctly.

vulpesx · June 4, 2025, 11:27am

What @gonzo described is not a StringArrayHashmap, for the purposes of string interning, it is no different from using a StringHashMap, the only difference is how efficient it is to iterate through the stored keys/values due to how it stores that data.

@mnemnion is using the first pointer given to the hash map as the ID, this can be done with a StringHashMap as well.
@gonzo is using an index into an ArrayList of copies of the keys as the ID. This offers the ability to use a smaller type for the ID.
They don’t specify if they are using an ArrayList(u8) or ArrayList([]const u8)

Both are interning strings, but they are different.

gonzo · June 4, 2025, 1:50pm

I am using an std.ArrayList([]const u8). Here is my code, in case anyone is curious.

ericlang · June 4, 2025, 2:37pm

O M G

all these years, all these puzzels…