ZiglangSet: a general purpose and comprehensive Set datastructure for Zig

deckarep · May 7, 2024, 6:56pm

Hello,

First time posting here. I am the original author of the popular Go-based set GitHub - deckarep/golang-set: A simple, battle-tested and generic set type for the Go language. Trusted by Docker, 1Password, Ethereum and Hashicorp. and thought I’d take a crack at an implementation for Zig.

This is intended to be modeled after the components built in the standard library and offer the idioms and expectations of what you would expect in a good performing container type.

I am standing on the shoulder of giants however as this is built on top of the AutoHashMap but with the clear intention of this having all the common set operations that you would expect in other languages.

I welcome any and all constructive feedback! Contributions are also welcome and encouraged!

I hope someone finds it useful.

Cheers!

-Deckarep

pierrelgol · May 7, 2024, 7:47pm

Welcome to the forum, and congratulation, this looks really good

deckarep · May 7, 2024, 8:03pm

Thank you for the welcome! I appreciate the encouraging words…and I’m happy to be here and learn more and excited for Zig and the community.

=D

pierrelgol · May 7, 2024, 11:35pm

Your welcome it’s always nice to see new members If you have any trouble feel free to post and everyone will try to help you out.

deckarep · May 8, 2024, 12:34am

Excellent, will do!

dude_the_builder · May 8, 2024, 12:39am

Very, very nice! I was looking at the code and see that many methods require iterating over the elements. I recalled that the doc comments for std.HashMap state:

/// If iterating over the table entries is a strong usecase 
/// and needs to be fast, prefer the alternative 
/// `std.ArrayHashMap`.

So I wonder if std.ArrayHashMap would work and even provide better performance in this case?

deckarep · May 8, 2024, 3:37am

Hey, I know you @dude_the_builder!

If I have the right person, I’ve literally watched probably around 20 of your Zig videos on Youtube and even some a few times over. What a fantastic series you have.

You bring up a valid point about considering ArrayHashMap as the backend. In fact, I need to investigate the viability of having the user of the ZiglangSet select which backend they want to use depending on their needs.

The one thing that is usually the case with set datastructures is a lot of it ends up favoring membership checks which implies lookups which is why I started out with the HashMap as the backend.

It’s true that a handful of methods do need to iterate, which will usually occur when you are in the Set composition stages of code…but then it’s somewhat expected that after this stage you have stability (where your sets are now mostly static).

From there you would basically do membership checks going forward. But, I will say this may not be true for every use case out there. Perhaps some use cases favor iteration in their workload. In that case the ArrayHashMap might be the way to go so I’ll investigate this.

And seriously thank you for the encouraging words!

dude_the_builder · May 8, 2024, 11:01am

Hey, there’s only one Dude the Builder. lol Thank you for the positive feedback on the videos. I’m currently starting a new series on Zig that will hopefully try to keep up with the latest changes in the language. Let’s see how it goes.

I think being able to switch the backend hash map type is an excellent idea. Making the ZiglangSet adaptable to specific use cases; that’s powerful indeed. And the way those data structures are designed in the standard library, I think their APIs are very similar so it could be a drop-in replacement in many cases.

deckarep · May 8, 2024, 11:45am

This sounds great and I will look into this for sure. I like the idea of having flexibility and exploiting the choices around this.

Thanks for the thoughtful suggestion and I look forward to your new video series!

mholub · May 8, 2024, 1:15pm

One small suggestion is to maybe add *AssumeCapacity methods which other Zig data structures usually have?
like:
https://ziglang.org/documentation/master/std/#std.array_list.ArrayListAligned.addOneAssumeCapacity

deckarep · May 8, 2024, 1:59pm

Excellent, I’m definitely looking for this type of feedback and I see no reason why this shouldn’t have it.

Will do!

teknico · May 13, 2024, 3:04pm

Thank you so much for golang-set, it’s been instrumental for a file management tool I wrote. I’m going to RIIZ so ziglang-set will be very welcome.

Indeed. My tool builds sets of file paths, then iterates over each member of each set.

deckarep · May 13, 2024, 4:48pm

I am happy to hear this! I rarely get personal thanks so this means a lot. I appreciate the kind words and I’m glad it’s been serving you well.

Hopefully the Zig version will be a nice addition to the Zig ecosystem.

nyc · May 26, 2024, 1:51pm

I hate those with a passion. I think they are a huge mistake. That is a single branch that is going to properly predicted everytime it matters (it is isn’t you are already going to allocate so it doesn’t matter in that case).

It shows a poorly designed APi and kitchen sink approach.

It does not help performance over just writing better code, and it doubles the API size.

nyc · May 26, 2024, 1:58pm

They don’t help performance one bit. The single capacity test you are going around is going to predicted properly every time it matters so only cost the instruction slot. The time is isn’t predicted (when you do need to grow), the cost to grow the struct will swamp it 10000x fold.

its optimizing for the wrong thing and papers over API problems. I work on ultra high performance code daily and we never do that stuff (nanoseconds matter). Spend time on better cache usage, bulk operations, almost anything else will give you a better return.

deckarep · May 26, 2024, 2:55pm

I see your point.

I added this for consistency with the APIs in the Zig stdlib because I’m erroring on the side of consistency.

But I see this function as a micro-optimization for when you’ve reserved a large capacity of memory up front and then perhaps in a tight loop are adding items per loop iteration. In that case you get rid of capacity checks entirely.

But a few more points on this.

If people really care they should benchmark and prove out the performance for their workload.
One could argue that any hash-based data structure already is not super cache friendly to begin with but they are one of the best data structures ever invented and like everything else in life it’s all about trade-offs.

Just my 2 cents but I see your point.

Sze · June 4, 2024, 8:25pm

13 posts were split to a new topic: API Design

Cloudef · June 12, 2024, 5:30am

For non hash-based bounded sets std.enums has bunch of different kind of sets and maps as well fyi.