Get hash of entry in HashMap

I have a JSON file I’m trying to read in, a list of ~1000 structs (all the same type) with 4 []const u8s, and I want to store those fields in a SQL table. Many entries have repeated data in some of their fields, so rather than taking the data from File → Table, I check and/or add the data to a HashMap in-between so each SQL statement only needs to be ran once for a particular piece of data. Either way, before I add the data to the SQL table, I hash the data so as to make a unique ID in the table for each piece of data, and I realized that I’m hashing the data twice as I’m doing this; once for the HashMap and once for the SQL table. While not maybe not the most important issue to fix, I wanted to see if I could do something about that. My first thought was to create a wrapper around a HashMap that would hash once, return that hash whenever I ran something like put or getOrPut, and also make a Context which would make the hash of the wrapped HashMap the identity function (my hashes are u64s anyways). This doesn’t seem terrible, but I haven’t convinced myself it’s worth the trouble. My second thought was just trying to see if there’s any way of just getting the hash back after a get or getOrPut, but I couldn’t see any (obvious) way of doing that. My third thought was just to ask if anyone might know if I’m making a mountain out of a molehill, or if they might have any tips :grin: I should mention that while the file I’m reading has ~1000 entries, I would expect to read anywhere from like 3-100 of these files, so reducing the amount of work necessary for that initial read seems valuable (but more importantly, a fun little problem to try to overcome)

yes, according to my understanding of the source (lib/std/hash_map.zig), all (managed) hashmaps have a ctx field with a public declaration called hash which returns the hash. all unmanaged hashmaps have putContext and getOrPutContext functions that you can call instead to supply your own context.


by the way, i just grokked something in understanding that file: unmanaged hashmaps require you to manage storage of Context (if it is not zero-sized) separately, but can and will reach out to a zero-sized Context type without you having to carry one around by calling getOrPutContext (for instance) with their ctx arg as undefined. Super neat!


to close a loop I unintentionally left open: if you’re using an auto hash map, you can get its context type by calling the AutoContext(K: type) function in hash_map.zig.

Have you considered organizing the database and sql-queries in such a way that all the deduplication and id-assignment happens in the database?

For example it sounds a bit like you have 4 fields that often can have the same data so you want those to be foreign keys that link to that data, then you effectively would have 4 tables for the 4 different fields those tables would have uniqueness constraints so that you only share one entry for each unique text, then the table for the entries would have an entry id primary key, maybe a group-id (to identify the file it came from) and 4 foreign keys to the field specific data.

You might even be able to combine everything into a single cte (common table expression) using the with statement and insert ... on conflict(field_data) do nothing returning id as somefield for the fields and then an insert for the entry.

I wonder how it would perform if you made a prepared statement for that insert query and used a single transaction per file and within that one invocation of the insert statement per entry. You basically could readFileAlloc the file, parse it and send one big transaction returning you a list of entry ids. At least that would avoid many multiple separate and manual hashings of the data.

With key constraints you also could make sure your data-relations are described by the database itself instead of only partially and managed by the external application mostly.

The part I can’t quite predict without doing some experiments myself is whether there would be any bottlenecks in how the data is passed to sqlite, but considering that sqlite works in process it could work quite well, especially if all the data for a file can be read into memory at once, so that you could use a single transaction.


Hmm seems like sqlite doesn’t have support for using returning within subqueries or cte, so it would require a lot more queries, might be ok if it is all in the same transaction, but the single cte per entry does not seem possible. I guess I was thinking of postgresql’s with queries which allow a lot more.

Re-reading I realize that you didn’t say what flavor of sql you use, so depending on which one you use there may be different options. For example with postgresql or sqlite you also can directly process json. (but I only have some experience using postgresql with json)