Help moving data from array to hashmap

Hi! zig beginner here. I’m trying to build a .env file reader in Zig.
When I run this program with some debug print statements, it only stores the last key - val pair in the hashmap. I believe the problem is when I clear my temporary arrays, maybe the hashmap only points at the data and thus I loose my entry. Is that so?

Thank you !

const std = @import("std");
const Location = enum { InKey, InVal };
const SyntaxError = error{ MalformedKey, MalformedVal };
pub fn main() !void {
    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    var arena = std.heap.ArenaAllocator.init(gpa.allocator());
    defer arena.deinit();
    var storage = std.StringHashMap([]const u8).init(arena.allocator());
    const file = try std.fs.cwd().openFile(".env", .{});
    const text = try file.readToEndAlloc(arena.allocator(), 1000);
    var loc = Location.InKey;
    var key_temp = std.ArrayList(u8).init(arena.allocator());
    var val_temp = std.ArrayList(u8).init(arena.allocator());
    for (text) |char| {
        if (char == '=') {
            if (loc == .InVal) {
                return SyntaxError.MalformedKey;
            }
            loc = Location.InVal;
            continue;
        }
        switch (loc) {
            .InKey => try key_temp.append(char),
            .InVal => try val_temp.append(char),
        }
        if (char == '\n') {
            if (loc == .InKey) {
                return SyntaxError.MalformedVal;
            }
            try storage.put(key_temp.items, val_temp.items);
            key_temp.clearRetainingCapacity();
            val_temp.clearRetainingCapacity();
            loc = .InKey;
        }
    }
    var map_iterator = storage.iterator();
    while (map_iterator.next()) |entry| {
        std.debug.print("key = {s}, value = {s}\n", .{ entry.key_ptr.*, entry.value_ptr.* });
    }
}

It seems like you’ve answered your own question. The hashmap does not copy or manage the keys in any way. You could use .toOwnedSlice() on your temporary arraylists and then add the resulting slices as keys. Then you would need to iterate the keys and free each one individually if you don’t want to leak memory.

2 Likes

Thank you! Forgive my ignorance, but how would I free the keys?
by doing .toOwnedSlice() am I moving the data from the arena to some other location?

Honestly, since you’re allocating with an arena, everything will be freed with the defer arena.deinit() in the block.

3 Likes

toOwnedSlice creates an allocation of the exact size of the arraylist’s current length and moves the elements from the arraylists allocated buffer into it. This clears the arraylist and resets it’s length to zero, so you can reuse it in the next iteration.

I want to expand on the mental model for why it’s important to always be thinking about where are the bytes. I think that the OP had this mental model:

  1. Let me append a bunch of chars into an array list.
  2. When I’ve collected enough chars to get my (key, value) pair, I’ll add this pair to the hashmap.
  3. try storage.put(key_temp.items, val_temp.items) is going to place this pair into the hashmap. <-- it’s here that the model falls apart
  4. Let me clear the ArrayList for the key and value because I’ve got a new line, go back to step 1.

See, .put doesn’t do anything with regards to changing the ownership of the bytes. The ArrayList still owns the bytes. The hashmap only contains a slice to the ArrayList.items. This is why you need to use .toOwnedSlice() so that there is a new chunk of memory that the hashmap can be the owner of. This is also why you’re only seeing the last line in the hashmap, because, from the perspective of the hashmap, you’re adding in the same key and value every time – the slices to the key_temp and val_temp items, you’re inserting the same pointer from the ArrayList.items for each key and value. If you encountered a long enough env line, you might get another key because the ArrayList had to resize, which would change .items.

In this use-case, there’s no need to worry about leaks because it’s all going to be cleaned up by arena.deinit(), but otherwise you would need to iterate through the hashmap and free all keys and values:

var map_iterator = storage.iterator();
while (map_iterator.next()) |entry| {
    std.debug.print("key = {s}, value = {s}\n", .{ entry.key_ptr.*, entry.value_ptr.* });
    allocator.free(entry.key_ptr);
    allocator.free(entry.value_ptr);
}
// now you deinit the hashmap itself, which also has it's own 
// separate allocator for managing memory for key/value slices
storage.deinit();

As you can see from the std docs, you’ve got to handle the freeing of keys and values yourself

3 Likes

Well that cleared quite a few misconceptions! Thank you for your response

this is incorrect, the hashmap does not compare keys directly it compares the hash of them and stores that hash in its own memory, so it will see different keys every time.
further the length is stored with the pointer in the hashmap, so if the length of the keys change they wont be the same, same goes for the values

Ah, you’re right. Of course the hashmap is hashing the string.

I’m trying to find out where in the source the hashmap is storing the length of the key. I’m looking at getOrPutAssumeCapacityAdapted. Does the wyhash somehow keep information about the length? (Silly me, of course the key is a slice which contains the len and mem.eql is doing the len comparison.)

When I first thought about it, I figured the code from OP was only printing out a single line when iterating through the hashmap but after running the code myself, I see multiple lines. The hashmap is indeed creating new hashes for each new key that is being inserted because the string will be different due to these different lengths. But the pointer part of the slice to the key will remain the same which is why only the last line will print (with some artifacts for some lines lines because clearRetainingCapacity doesn’t zero the buffer, it only sets .items.len = 0).

1 Like

i could be wrong about details but i think i understand the gist of it.
HashMapUnmanaged is where the meat of the implementation is.
It stores everything in one big buffer that looks something like
header|[cap]metadata|[cap]keys|[cap]values| each section is aligned to their type so there may be padding between sections.
Header contains a [*] to the start of keys and values sections aswell as the capacity.
the hashmap stores a [*] to the metadata section and uses pointer arithmatic to get the header which is used to get the keys and values.
metadata stores upper 7 bits of the hash as those will probably never be the same between keys, aswell as a flag to signify if the key,value at the corresponding index is used.
It used the lower 32 bits modulo the capacity as the index to store the key, value and metadata. if metadata indicates its used it compares the upper 7 bits of the hash to determin if the keys are the same, if they arent moves to the next slot until it finds an unused slot or the 7 bits match

2 Likes