i’ve been working on an edn (think json) parser here and i’m thinking of storing the ast tree structure like this:
values: std.ArrayListUnmanaged(Value),
/// when set, the corresponding value is the parent of the next Value.
parents: std.DynamicBitsetUnmanaged,
/// when set the corresponding value has a sibling. the sibling can be found in next_sibling_indexes by the count of preceding set bits.
siblings: std.DynamicBitsetUnmanaged,
/// a dense array with len == siblings.count() and one entry for each set bit in siblings. entries are the next sibling's index in values.
next_sibling_indexes: std.ArrayListUnmanaged(u32),
i’m doing this with a 2 pass parser which does a dry, measuring pass and allocates everyting at once. then in the second pass uses appendAssumeCapacity(). i like this because it can work at comptime or runtime with fixed buffers.
i was originally storing the tree structure parents and siblings as ArrayList(u32) where values, parents and siblings all have the same length. but thats quite a waste since most parents and siblings are null.
i thought this bitsets + dense array idea would be a good way to save memory. i’m not sure it’ll perform better. but this seems like a nice compact way to represent a tree structure. anyone have any thoughts about this approach? i’m not sure what to call it. i originally got the idea for storing the tree structure as parent and sibling arrays from an AI response.
another thing to consider: how akward is it to work with this data? i’m currently providing a Parser.Result.iterator() which seems ok. but its certainly not as nice as a for loop over std.json.Value.array.items. i have started using this format a little bit and it seems workable.
i haven’t tried storing parse results this way before. it seems nice and wanted to ask if you all agree.
This reminds me of a pointerless octree, so I would call it a pointerless tree.
Normally I think you would only use such structures if you actually need to compress the data on disk or something, because it is rather awkward to use, especially if you consider inserting new values. But in-memory compression also makes sense in some situations.
But, if your goal is primarily to learn things, it surely is a fun exercise to figure out such a data structure. Other than that, I probably wouldn’t use it in practice, given that it is somewhat awkward and memory saving shouldn’t matter for an AST, which you are normally throwing away shortly afterwards anyways.
I guess one big drawback here is that siblings are stored as a linked list and no random access like with a slice.
It is a fun exercise but I ended up here because this simplified parsing allocation so that I didn’t need to allocate slices for list/map entries during the second pass.
The most useful feedback I can give you here is that an AST is an abstract data type. Meaning you can conceptually separate the implementation from the API, and probably should.
That way you can write it as the simplest possible thing which will actually work (which is probably allocating nodes when you need them), and if you decide later that you need an efficiency boost, you can hide the additional complexity which comes with most of those strategies behind the already-working API of the simpler initial implementation.
Practically speaking, this is a matter of putting every AST interaction behind member functions, even if it’s currently as simple as accessing a field. Making the trivial ones inline will make this strategy zero-cost.
So
for (ast.children()) |child| { ... }
Rather than
for (ast.child_slice) |child| { ... }
Even if all children() does is return the slice of child pointers. That kind of thing.
Thanks for the feedback. I agree about doing the simplest thing. And that was originally allocating slices for containers in my parseList(). But I wanted to support parsing into static buffers and that seemed easier when Parser.Result is flattened and all fields (values, children, siblings) have the same length instead of nested. This way my 2nd pass doesn’t allocate and could be used at comptime or runtime with fixed buffers.
The reason I posted here was to see if there were any other ways to achieve this. One idea I had is to use an ArrayHashMap for my siblings linked list. This would allow me to avoid all the null entries with my current ArrayList approach . But I couldn’t figure out how to init one with fixed buffers ala ArrayList.initBuffer() and appendAssumeCapacity() since it needs to allocate in rehash().
I have no good experience in low level parsing but I did write an edn parser a while back (checkout zeal) and it seems like your method should be faster. Could you provide a benchmark (maybe against std.json or clojure reader)?
No i don’t know how i missed zeal. Thanks for sharing it!
I haven’t benchmarked my parser yet. I did start working on that and hope to have something to share soon. Will be interesting to see how it does vs std.json at least. I’m also curious to see how this 2 pass approach with different data structure does too.
It looks like this currently is a little faster than std.json. I’ve saved 64KB.json to /tmp/msft.json and added a json to edn converter to my project. Here i’m timing part of that. I know its not the best benchmark, but i’ve put some timers in main right around parsing json and parsing edn to isolate them. here is one run:
Yes, and it’s using the data-oriented design technique of using indices into a dense array to reduce pointer-chasing. These are basically the same thing, just viewed as data structure and algorithm.
It seemed to me you’re converging in a similar design to what it uses, so it’s worth a close look. A lot of design attention has been put into making it efficient.