Convenient AST storage at runtime or comptime

Travis · May 24, 2025, 1:33pm

i’ve been working on an edn (think json) parser here and i’m thinking of storing the ast tree structure like this:

values: std.ArrayListUnmanaged(Value),
/// when set, the corresponding value is the parent of the next Value.
parents: std.DynamicBitsetUnmanaged,
/// when set the corresponding value has a sibling.  the sibling can be found in next_sibling_indexes by the count of preceding set bits.
siblings: std.DynamicBitsetUnmanaged,
/// a dense array with len == siblings.count() and one entry for each set bit in siblings.  entries are the next sibling's index in values.
next_sibling_indexes: std.ArrayListUnmanaged(u32),

i’m doing this with a 2 pass parser which does a dry, measuring pass and allocates everyting at once. then in the second pass uses appendAssumeCapacity(). i like this because it can work at comptime or runtime with fixed buffers.

i was originally storing the tree structure parents and siblings as ArrayList(u32) where values, parents and siblings all have the same length. but thats quite a waste since most parents and siblings are null.

i thought this bitsets + dense array idea would be a good way to save memory. i’m not sure it’ll perform better. but this seems like a nice compact way to represent a tree structure. anyone have any thoughts about this approach? i’m not sure what to call it. i originally got the idea for storing the tree structure as parent and sibling arrays from an AI response.

another thing to consider: how akward is it to work with this data? i’m currently providing a Parser.Result.iterator() which seems ok. but its certainly not as nice as a for loop over std.json.Value.array.items. i have started using this format a little bit and it seems workable.

i haven’t tried storing parse results this way before. it seems nice and wanted to ask if you all agree.

IntegratedQuantum · May 24, 2025, 1:59pm

This reminds me of a pointerless octree, so I would call it a pointerless tree.

Normally I think you would only use such structures if you actually need to compress the data on disk or something, because it is rather awkward to use, especially if you consider inserting new values. But in-memory compression also makes sense in some situations.

But, if your goal is primarily to learn things, it surely is a fun exercise to figure out such a data structure. Other than that, I probably wouldn’t use it in practice, given that it is somewhat awkward and memory saving shouldn’t matter for an AST, which you are normally throwing away shortly afterwards anyways.

Travis · May 24, 2025, 2:52pm

Oh interesting. I didn’t find much about the pointerless tree. But I did find this which seems to describe the tree structure i’m using https://en.wikipedia.org/wiki/Left-child_right-sibling_binary_tree.

I guess one big drawback here is that siblings are stored as a linked list and no random access like with a slice.

It is a fun exercise but I ended up here because this simplified parsing allocation so that I didn’t need to allocate slices for list/map entries during the second pass.

IntegratedQuantum · May 24, 2025, 3:09pm

Yeah, right, it only having a few large allocations is certainly a plus.

But if you just want to reduce the number of heap allocations, you could also just use an arena.

mnemnion · May 25, 2025, 7:01pm

The most useful feedback I can give you here is that an AST is an abstract data type. Meaning you can conceptually separate the implementation from the API, and probably should.

That way you can write it as the simplest possible thing which will actually work (which is probably allocating nodes when you need them), and if you decide later that you need an efficiency boost, you can hide the additional complexity which comes with most of those strategies behind the already-working API of the simpler initial implementation.

Practically speaking, this is a matter of putting every AST interaction behind member functions, even if it’s currently as simple as accessing a field. Making the trivial ones inline will make this strategy zero-cost.

So

for (ast.children()) |child| { ... }

Rather than

for (ast.child_slice) |child| { ... }

Even if all children() does is return the slice of child pointers. That kind of thing.

Travis · May 25, 2025, 9:55pm

Thanks for the feedback. I agree about doing the simplest thing. And that was originally allocating slices for containers in my parseList(). But I wanted to support parsing into static buffers and that seemed easier when Parser.Result is flattened and all fields (values, children, siblings) have the same length instead of nested. This way my 2nd pass doesn’t allocate and could be used at comptime or runtime with fixed buffers.

The reason I posted here was to see if there were any other ways to achieve this. One idea I had is to use an ArrayHashMap for my siblings linked list. This would allow me to avoid all the null entries with my current ArrayList approach . But I couldn’t figure out how to init one with fixed buffers ala ArrayList.initBuffer() and appendAssumeCapacity() since it needs to allocate in rehash().

I’m hoping that the std.IndexHashMap proposal https://github.com/ziglang/zig/issues/23872 will make this possible with ArrayHashMap in the future.

mnemnion · May 26, 2025, 4:09pm

Have you looked at the Zig parser?

earthfail · May 27, 2025, 8:48am

I have no good experience in low level parsing but I did write an edn parser a while back (checkout zeal) and it seems like your method should be faster. Could you provide a benchmark (maybe against std.json or clojure reader)?

Travis · May 28, 2025, 3:10am

No i don’t know how i missed zeal. Thanks for sharing it!

I haven’t benchmarked my parser yet. I did start working on that and hope to have something to share soon. Will be interesting to see how it does vs std.json at least. I’m also curious to see how this 2 pass approach with different data structure does too.

Travis · May 28, 2025, 4:05am

It looks like this currently is a little faster than std.json. I’ve saved 64KB.json to /tmp/msft.json and added a json to edn converter to my project. Here i’m timing part of that. I know its not the best benchmark, but i’ve put some timers in main right around parsing json and parsing edn to isolate them. here is one run:

$ zig build --release=fast && zig-out/bin/edn-parse --json-to-edn /tmp/msft.json
std.json.parseFromSlice took 375.097us
$ zig-out/bin/edn-parse --json-to-edn /tmp/msft.json > /tmp/msft.edn
$ zig-out/bin/edn-parse /tmp/msft.edn
edn.parseFromSlice took 284.298us

Kindof surprised its faster on this data. Will push a commit soon if anyone wants to try. Maybe I’ll share results from some other json files too…

EDIT: this helped me find a tokenizing bug too so I’m glad for that. I’ve been fuzzing but I’m sure there are more.

Travis · May 28, 2025, 4:13am

With a larger json file that has lots of utf8 (twitter.json from the simdjson repo) this run looks to be a little slower.

$ zig build --release=fast && zig-out/bin/edn-parse --json-to-edn ../simdjzon/test/twitter.json > /tmp/twitter.edn
std.json.parseFromSlice took 3.669ms
$ zig-out/bin/edn-parse /tmp/twitter.edn
edn.parseFromSlice took 4.421ms

Travis · May 28, 2025, 5:51am

Are you maybe referring to use of multi array lists? I’ve was thinking of using one here at one point.

mnemnion · May 31, 2025, 12:49am

Yes, and it’s using the data-oriented design technique of using indices into a dense array to reduce pointer-chasing. These are basically the same thing, just viewed as data structure and algorithm.

It seemed to me you’re converging in a similar design to what it uses, so it’s worth a close look. A lot of design attention has been put into making it efficient.