Data structure libraries

booniepepper · July 5, 2023, 10:32am

Hi friends!

I’m happy with the toys I get from std.zig but I also want to expand the toolbox.

Is anyone already doing work on Zig data structure libraries? If not… I might give it a go soon.

I’m most interested in the moment in getting the data structures I personally want: Trie, some “persistent” things (a la functional programming), and a configurable cache with TTLs and side-loading… But if I really get momentum I might get on a kick and do all the Red/Black trees and priority-queue kinda stuff too

booniepepper · July 5, 2023, 6:24pm

Also if anyone has crazy ideas for data structures, I’d like to hear them.

ComptimeStringMap from the std library has me thinking there will be ways to make special case data structures when they’re known at compile time

AndrewCodeDev · July 5, 2023, 6:30pm

I was thinking about pitching in an implementation of a free-list allocator to the standard library because I was surprised to see we don’t have one. Caching allocators are also really important for data science because arena-style allocators make too many adjustments to be performant (calculating offsets dynamically for different byte alignments etc).

Also, for working with the GPU, they are a must because alignment is usually going to be very symmetric and done in large chunks - the PCIE bus is really slow if you have to call for multiple allocations. So, caching allocators are important and can wrap other existing components effectively.

Here’s an example of a caching allocator from the Torch library that I was going to loosely base my ideas on - this would probably never get standardized but the free-list probably could: cutorch/lib/THC/THCCachingAllocator.cpp at master · torch/cutorch · GitHub

AndrewCodeDev · July 5, 2023, 6:32pm

I also think we need a good deque implementation. They’re super helpful for doing rotations - the array list is close to being there because they do allow for popping items at a generic index but the closer to the front of the list the pop occurs, the more shifts have to take place.

I’ve seen deques implemented as chained blocks of contiguous memory and that seems to work fairly well but that’s an implementation detail.

booniepepper · July 5, 2023, 7:31pm

There’s std.TailQueue as a doubly-linked list. Is there something in particular you’d want in addition? More optimization, or an abstraction with allocation handled and no Node juggling?

booniepepper · July 5, 2023, 7:33pm

The caching allocator looks interesting. I hadn’t gotten as far as thinking through custom allocators that are well-suited for the data structures themselves… I think there’s a ton of potential there

AndrewCodeDev · July 5, 2023, 9:48pm

Well the deque I was thinking about was a contiguous container much like the one in the C++ STL instead of a linked list, but funny you mention TailQueue because I was just reading that last night

julngomz · July 5, 2023, 11:57pm

Since you’re talking about Data Structures and Functional implementations, I would love to see a Finger-Tree data structure implementation. Finger tree - Wikipedia

booniepepper · July 8, 2023, 6:00am

I had never come across a FingerTree before and it looks amazing

I’m going to start to sketching some stuff up here with @AndrewCodeDev:

Kinda just winging it right now, but I think just from this thread and a brief side convo we’re coming from slightly different backgrounds… which is great! I think that has the potential to really push us to explain these things to each other (and hopefully also in comments etc) as we’re going along

phsch122 · July 8, 2023, 10:51am

I really like the idea of functional persistent data structures in Zig.

The F# standard library has some very elegant implementations of those that could be ported. Clojure too, of course, but the code is far less approchable in my opinion.

The problem is that all implementations (that I know) rely on garbage collection.

The easiest workarounds would either be using arena allocators only and keep all nodes or manual reference counting.

The latter shouldn’t create too much performance overhead, but it’s tricky to make it thread safe. And safely sharing data between threads is one of the main advantages of persistent data structures.
Also, since every operation creates a new instance and there are no destructors, you would have to call deinit() or something similar on each of them.

Does anyone know what the best or most idiomatic solution would be?

dude_the_builder · July 8, 2023, 11:00am

I think the idiomatic way as per many structures in the standard library is to let the user of your structures pass in the allocator. Then you can use that allocator as a foundation for internal implementation details such as using arena allocators when needed.

booniepepper · July 9, 2023, 12:20am

I agree generally that’s idiomatic

Andrew has some intuitions around the ability to provide optimized allocators for the data structures

What I’ve been pondering in the back of my head is how much research and literature there even is on memory-optimizing the usage of some of the structures I want to introduce. A lot that I’ve seen assumes garbage collection if it even goes into memory usage beyond the Big O scaling properties. Like, I wonder if this is a relatively unexplored area