A few months ago I doscovered Zig and I like it. I started to learn, also by following the posts on Ziggit.
I want to iterate over the lines of a text in memory where the lines can end with LF, CR or CRLF, but not with LFCR. So, I wrote an iterator:
pub const LineIterator = struct {
buffer: []const u8,
index: usize = 0,
const Self = @This();
pub fn init(buffer: []const u8) Self {
return .{ .buffer = buffer };
}
pub fn next(self: *Self) ?[]const u8 {
if (self.index == self.buffer.len) return null;
const slice = self.buffer[self.index..];
const n = for (slice, 0..) |c, j| {
if (c == '\n' or c == '\r') break j;
} else {
self.index += slice.len;
return slice;
};
self.index += n + 1;
if (slice[n] == '\r' and slice.len > n+1 and slice[n+1] == '\n')
self.index += 1;
return slice[0..n];
}
};
What is the convention for init / deinit ?
Is it okay if my struct has an init but no deinit? Or should I rather replace init by a function outside of the struct ? (like e.g. std.mem.SplitIterator)
The question you want to ask is: what scenarios need to be supported?
If the LineIterator will always be used with a stack-allocated buffer, or if it’s always ok to create a buffer on the heap and then defer a free of the buffer, there’s no need for a separate deinit function.
But if you want LineIterator to work with a heap-allocated buffer, and you want to return LineIterators from a function, you’ll probably want both an init function which receives a number of bytes and an Allocator, and also a deinit which takes an Allocator and uses it to free the underlying buffer. It’s a bad habit to make calling sites keep track of what fields are pointing to heap memory and then free those fields manually. But again, if the buffer itself is always either an array on the stack, or exists as its own variable in scope which can be handler with defer free(buffer);, then there’s no compelling reason to have a deinit.
If the LineIterator might even end up somewhere where the original Allocator isn’t available, now you might start thinking in terms of having a LineIteratorUnmanaged, with a LineIterator which has a LineIteratorUnmanaged as a field, as well as a pointer to the Allocator which allocated the buffer. This doesn’t look to me like a data structure where that would be very helpful, but you might consider the middle ground with an init and deinit function, so that the LineIterator doesn’t have to be consumed in the same scope in which it’s created.
I’m going to play bad-cop here a little bit and say that in general, iterators shouldn’t be in the business of allocating or managing memory. They should be focused on traversing through a data structure that handles memory itself that can spawn iterators to said memory.
That’s a reasonable meta, sure. I meant it more generally than that, as when one should or shouldn’t have a deinit function. I agree that it should be rare for an iterator to survive longer than the while loop that consumes it. And for this application it doesn’t make sense for the iterator to be doing allocation in any case, since the data is already going to exist.
A follow-up observation: I wouldn’t call a []const u8.buffer, which generally means something which is going to temporarily hold data in a mutable way. Someone reading your code like, say, me, might mistake the purpose of that field on this basis. .text or .lines will make it clearer what’s going on.
I’d say that most structs which have a .buffer are going to want a way to free that buffer’s memory. But you’re right, an iterator shouldn’t (generally) be passed around between functions, or be responsible for the memory it’s traversing.
Thanks for your helpful replies. I understand that the buffer must not be freed or go out of scope as long as the iterator is used, or or even any of the lines it returned.
I think this also applies to the std.mem.SplitIterator which I tried to use first:
const text = "a\nb\n";
var iter1 = std.mem.splitAny(u8, text: "\n\r");
var iter2 = std.mem.splitSequence(u8, text, "\r\n");
But I needed a combination of these two delimiter types.
The field is called .buffer bacause I used the code of SplitIterator as a “template”. But you @mnemnion are right, I will rename it to .text (for u8).
This is understandable, because the std.mem functions are type generic, so they need a generic word for the contiguous memory they operate on.
It would be a useful convention, I think, if .buffer always meant []T (including specific T, such as u8, when that makes sense) and .region for []const T.
But I don’t think it’s all that important. If we zoom out a bit and look at the whole standard library, many chunks of memory are starting life as a mutable, then being passed as constant to various functions, so a case could be made for using a consistent name for both.