I writing a small tool that’s going to be processing hundreds of thousand of structs (items). Each one of these items has a string identifier associated with it. Of course, the strings are variable length and completely irrelevant for most of the processing that I’m going to be doing, so I’m thinking:
It’s a string so let’s just store the slice to it in the item. That gives nice constant size items allowing me to have a big array of them. The actual strings can be on the heap, or in an arena.
I could improve on that by changing the slice to an index. That index would lookup in an array of slices. Doing that shrinks my item a little, and helps serialization by getting rid of the pointer.
Rather than just having the strings on the heap, they could be in a large grow-able contiguous buffer. That would remove the pointer completely and have everything index based / easily serializable. Could also add some capacity heuristics to avoid an allocation call for every string.
I’m starting to think GrowableBuffer is just an ArrayList of u8, right? Maybe with some capacity heuristics around it to avoid lots of reallocation. …or would I just be abusing it?
This feels like it should be a fairly common pattern. StringArray or ObjectArray … whenever you want an index-able list of variable sized objects. Does anything like this already exist with a name I haven’t thought of?
My writing style made my meaning ambiguous. I agree the estimation makes things better.
I was referring to using ArrayList(u8) as an unbounded dynamic size buffer and dropping in objects of other types into it. Is that abuse? Feels a bit hacky to me.
At that point it becomes more like an Allocator (need to do alignment, etc), and instead you could use an existing arena allocator. However, then you need to store the pointer rather than an index in the referring struct.
What you’re proposing with the growable buffer, which as you’ve correctly said is basically just an ArrayList(u8), is generally known as a string table. It’s used in a bunch of things like even the executable you run(assuming Linux).
It’s then also quite easy to deduplicate the strings which is nice if some values are very common. Basically just like a hashmap to an index.
Although getting rid of pointers does help if you want to serialize the data, you still might want to consider a data structure that pins the memory (guarantee it doesn’t get reallocated), as it’s a really nice property when interacting with code that expects strings - you don’t have to worry about a realloc (e.g. from growing an arraylist) invalidating a pointer.
There used to be a Segmented List in the standard library (it’s here now). It’s nice because it’s still (reasonably) easy to serialize while guaranteeing any pointer to data it owns will never be invalidated by adding more strings to it.
There have been a bunch of posts about this data structure in the past, because it is quite neat in the way it combines the backing array with a hash set that indexes into that backing array (automatically adding de-duplication).
The only thing that needs to be mentioned is that it uses zero terminated strings, so you can’t use it to store strings that contain zeroes, for that you would have to store the length explicitly like with your OffsetSlice (but personally I would use u32 for the strings unless they really can be huge strings).
In my code I also added a fromOwnedSlice method that way I can easily store the backing array in a file and read it in again restoring my string table from the backing array.
Yes and ArrayList already allocates by reserving a growing capacity to avoid too many reallocations, only if you have measured that those in-frequent reallocations are still a problem, I would consider something like a segmented list.
If the max size and max count of the strings is known, you could use a packed struct instead to encode the length and offset in a u32:
const TableSlice= packed struct (u32) {
len: u11, // could be smaller
offset: u21, // these are just off the top of my head
};
I was interning file names for an LSP thing so I just counted the length of the longest file path in my home directory, and the number of all code files on my computer, doubled both, and used that to determine what a reasonable size for each field was.
you could also store the length in the string table itself as a sort of header before each string as a variable-length encoded u32. This way for most short strings <= 127 bytes it’s as small as a null terminator, and in the worst case of huge strings 5 bytes, if the string is longer than 2^28-1 bytes. This way the table slice/offset type can be a u32-backed enum without having to juggle max string length vs max capacity tradeoffs.