So I have noticed this in a few different situations. I tried to simplify it as much as I could to reproduce it here. Essentially this code will work with the page allocator, but with the testing or GPA it will cause a deadlock on windows and an unreachable code panic on linux (after 4 or 5 lines of random input). If I use the arena allocator with the gpa here then it won’t crash, but some lines will be missing from the HashMap (replaced by ¬¬¬¬¬¬¬¬). It also crashes in a test block if using the testing allocator for “bytes” and the GPA for the hashmap. Additionally it will crash is if using the GPA for bytes and the page allocator for the HashMap.
The program below will read console input into an ArrayList “bytes” and then store each line as a slice in a HashMap with the key indicating the starting position of each line in the total bytes array. The issue only applies to which allocator I’m using for the bytes array, but for the HashMap it doesn’t seem to matter. So basically I’m just trying to understand why this is crashing.
const std = @import("std");
pub fn main() !void {
var gpa = std.heap.GeneralPurposeAllocator(.{}){};
defer std.debug.assert(gpa.deinit() == .ok);
const allocator = gpa.allocator();
const stdin = std.io.getStdIn();
var reader = stdin.reader();
//var bytes = std.ArrayList(u8).init(std.heap.page_allocator);
var bytes = std.ArrayList(u8).init(allocator);
defer bytes.deinit();
var cache = std.AutoHashMap(usize, []const u8).init(allocator);
defer cache.deinit();
var pos: usize = 0;
while (reader.streamUntilDelimiter(bytes.writer(), '\n', 1000) != error.EndOfStream) {
try cache.put(pos, bytes.items[pos..bytes.items.len]);
pos = bytes.items.len;
var it = cache.iterator();
while (it.next()) |kv| {
std.debug.print("{d}:{s}\n", .{ kv.key_ptr.*, kv.value_ptr.* });
}
}
}
It’s because you’re storing slices into the bytes arraylist, but these pointers are getting invalidated when the ArrayList resizes. When the ArrayList needs to grow its capacity it first tries to call Allocator.resize, which attempts to resize the allocation in-place. If it can’t do so, it creates a new allocation and frees the old one - now your pointers point to free’d memory. The reason it doesn’t crash with the heap allocator is because the heap allocator allocates big chunks at a time, and the list is able to resize in-place within that big chunk. The arena allocator doesn’t free old allocations until you call deinit on it, so your old memory sticks around after a resize and you don’t get a crash.
In a situation like this there are two options: store indices into your list instead of pointers, or allocate enough memory up-front and never resize the list. Option 1 has the advantage that you can choose a size smaller than usize for your offset/length, which can save memory over a slice.
Okay thats fair enough. Well origanally i was using option 1. Because i wanted to extract specific sequences of characters from the input and use them later. So i was just using an index and length to calculate the slice when i needed it. But then i thought hey i could just calculate all the slices in one go, and call them later directly when i needed them. When i got into trying that i started having problems. I was tempted to just use the page allocator, but from what youre saying that might cause issues if the input grows larger than the allocation unit. So im not sure exactly what i will do now, but im thinking optipn 2 will be fine, since i might as well go ahead and set a maximum file/input size and just allocate that up front. If its just the single allocation when the program loads, i dont think there will be any noticeable impact on the performance. Thanks for the explanation.
I believe this wouldn’t work for the use case in the OP, since they are relying on particular []u8 slices being contiguous, and SegmentedList(u8) couldn’t guarantee any two bytes in the list are contiguous.
I was looking into using list types earlier. One issue is that there is no writer function on the other lists. Which means there is nothing for streamUntilDelimiter to stream to. So either way it looks like i am starting with the ArrayList. I am thinking to go ahead use the fixed buffer allocator for the input, to set a maximum size, which should be more stable and might make it easier to convert the input to one of the other list types if i have to.
If the memory of a list is not guaranteed to be contiguous, you can’t take arbitrary slices of it. If the memory of a list is guaranteed to be contiguous, you can’t rely on the memory being stable unless you know it will never need to resize.
It might be helpful to look at how ArrayList(u8) is used to store strings in How to use hash map contexts to save memory when doing a string table - Zig NEWS (this technique is used for-real in Zig here). Basically, it puts NUL-terminated strings into the ArrayList(u8) and then stores the index of the first byte to be able to refer to the string / retrieve it (note: this means that your strings cannot contain embedded NUL characters).
EDIT: On looking closer at the OP, StringTable from here might be exactly what you’re looking for.
This is true for the SegmentedList. However I was also looking into streaming the bytes into something like the SinglyLinkedList. I was thinking it might be useful some time. Though from what I can tell ArrayList is still the best option here, since it is faster to iterate over directly. Whereas the linked lists seem more for situations where you need to do a lot of random insertion and deletions.
So I did read through this article previously. It’s pretty close but in this case from what I can tell the idea is to look up a reference to the string, but to look it up using the string literal itself. Where I am trying to look it up using an index or an enum, and to basically just have the slice syntax like items.item[start…len] already calculated for every string i need to refer to. I suppose it might be possible to have another set which has all those slices calculated, and use that to refer back to the lookup table. But there’s also the issue of how to handle duplicate string values, like how to know which index is for each identical string.
StringTable from here might be exactly what you’re looking for.
This also seems to be exactly what I am trying to do. However I think similar to the previous article you can only have one index per string. If you try to add another string of the same value, it simply returns the offset of the existing one. Also I am trying to do everything in as conventional and primitive way as possible, so I would prefer to call it outside of a wasm context, like using std.StringTable.
I am thinking now it might be worth it to use a HashMapUnmanaged. The previous references both rely on this type. So it might be worth it at this stage to do some custom hash map, since I should be able to reuse that code and optimize it for different situations. Though it will take some time, since I’m not quite far enough along in my understanding of Zig to be able to write one, or even to know what I would do with one.
This is basically what I was doing originally. It works okay, but I just found it a little bit cumbersome to use the slice syntax to access the string, when I was hoping I could just refer to the value in the table directly. i.e. I was really hoping to have
and to have a table of all of the slices already calculated based on the lengths, so I wouldn’t have to take the slice positions or any lengths into account when I’m iterating through the data. It would just be accessing the starting index and then immediately seeing the [ ]const u8 paired with it. This is when I started running into problems with the allocators. The idea is that it’s trivial to store an extra set of pointers, but copying over any bytes directly should be a last resort. So I had the idea to store the slices directly.
This would be impossible–each byte is guaranteed to be discontiguous from any other since they are stored in fully separate Nodes.
I think you might be overestimating how long it takes to calculate the length of a NUL-terminated string.
All identical strings use the same index. You can get an index from a string with getOffset, and a string from an index with get.
I’m suggesting copying the StringTable implementation into your project. It’s a supported use-case since std.hash_map provides StringIndexContext and StringIndexAdapter for exactly this purpose.
Is this meant to be a negative? I’m confused about what you’re trying to do, then. De-duplicating is the point of StringTable (and would be the point of getting a HashMap involved at all).
If you just want a list of strings, then use ArrayList([]const u8) and Allocator.dupe the string before inserting it into the list (and also make sure to loop over the list and free each string when deiniting the list).
Thanks. That does explain a lot. Well originally when i posted i was just trying to see why the example i gave was crashing. Which was already explained. Also that example is not directly related to what im working on specifically, which is a rudimentary interpeter. In addition to that there are some other issues i need to address which are beyond the scope of this thread. But i think i have enough information now to go and reorganize and approach the problem from a fresh perspective.
So I guess I did find the answer, just for some closure. It was to use ensureTotalCapacity on the bytes ArrayList to set the maximum size of the input. I had to try it with the fixed buffer allocator to see how it works. But I got to thinking, what if I want to set the maximum size based on a config file variable? The fixed buffer allocator requires a comptime known length, so it won’t work. But with ensureTotalCapacity I should be able to pass a variable, and I can keep using the gpa.
By setting the maximum size this way, I can build an ArrayList of [ ]const u8s for whatever I need from the input without copying anything over, and the pointers are guaranteed to be valid.