I’ve been going through the 2017 Advent of Code as a way to learn some Zig. Many of the problems involve reading files line-by-line. I quickly found a for doing so with std.io.bufferedReader but thought it would be an interesting exercise to factor out some of the boilerplate. You can see my solution on Stack Overflow.
Unfortunately, it has a memory corruption issue that I’ve been having trouble tracking down. The full code for my iterator implementation is here and also in the last Stack Overflow link. It works fine for reading the lines in small files (presumably smaller than the 4k buffer) and also if I just read the lines. But if I allocate memory while iterating through the lines of a large file, then everything breaks.
Here’s my code, there’s also steps for a full repro at the bottom. Any pointers on where the memory corruption is happening would be greatly appreciated! (Also general feedback!)
$ git clone https://github.com/danvk/aoc2023.git
$ cd aoc2023/aoc2017
$ git checkout memory-corruption-repro
$ zig build
$ ./zig-out/bin/main day12 day12/sample.txt
(works fine on a small input file, smaller than the buffer)
$ ./zig-out/bin/main day12 day12/sample.txt
(crashes after a few lines on a larger file trying to parse what looks like corrupted memory)
std.io.BufferedReader.reader() returns a Reader with a context of *Self (meaning a pointer to the std.io.BufferedReader). In your case, this is a pointer to the stack-allocated buf_reader which gets invalidated after getBufferedReader returns.
Not totally related to the problem, but I’d personally ditch all of this attempt at abstraction, as IMO it only provides an illusion of convenience. Just sticking the logic directly where you need it would make all of this much more straightforward:
// no need for heap allocation if the buffer size is known/unchanging
var buf: [4096]u8 = undefined;
var file = try std.fs.cwd().openFile(filename, .{});
defer file.close();
var raw_reader = file.reader();
var buf_reader = std.io.bufferedReader(raw_reader);
var in_stream = buf_reader.reader();
while (try in_stream.readUntilDelimiterOrEof(&buf, '\n')) |line| {
// do stuff
}
After working through the rest of 2017’s Advent of Code, I understand and appreciate @squeek502’s answer much better. The *Self context is, indeed, the problem. To fix it, I need to make sure that buf_reader has its final memory address before I create the reader. This means lazily initializing it in the ReadByLineIterator struct.
There’s also no need for the @TypeOf shenanigans. These types all have accessible names. After cleaning all that up and also taking the advice to stack-allocate the buffer, I’m left with something much more straightforward that doesn’t have the memory corruption problem (code below).
Given the simplified code, I’m curious about the assertion that this only “provides an illusion of convenience.” Now that it works, doesn’t it actually provide that convenience for the caller? Is there any downside to using this iterLines function?
const std = @import("std");
const ReaderType = std.fs.File.Reader;
const BufReaderType = std.io.BufferedReader(4096, ReaderType);
const BufReaderReaderType = BufReaderType.Reader;
pub const ReadByLineIterator = struct {
file: std.fs.File,
reader: ReaderType,
buf_reader: BufReaderType,
stream: ?BufReaderReaderType,
buf: [4096]u8,
pub fn next(self: *@This()) !?[]u8 {
if (self.stream == null) {
self.stream = self.buf_reader.reader();
}
if (self.stream) |stream| {
return stream.readUntilDelimiterOrEof(&self.buf, '\n');
}
unreachable;
}
pub fn deinit(self: *@This()) void {
self.file.close();
}
};
// Iterate over the lines in the file using a buffered reader.
// Caller is responsible for calling deinit() on returned iterator when done.
pub fn iterLines(filename: []const u8) !ReadByLineIterator {
var file = try std.fs.cwd().openFile(filename, .{});
var reader = file.reader();
var buf_reader = std.io.bufferedReader(reader);
return ReadByLineIterator{
.file = file,
.reader = reader,
.buf_reader = buf_reader,
.stream = null,
.buf = undefined,
};
}
It does provide some convenience, but it also makes it harder to adapt to different use-cases.
Let’s say some AoC problem for sure has lines longer than 4096 bytes. If you still wanted to use your abstraction, then you’d need to be able to customize the buf size which would be a bit complicated. Without the abstraction, it’s just changing the size of the array directly.
Let’s say some AoC problem uses space-delimiting instead of newline-delimiting. If you still wanted to use your abstraction, you’d then need to be able to customize the delimiter. Without the abstraction, it’s just changing the '\n' directly.
It’s a tradeoff, and in this case I personally don’t think the abstraction provides much benefit. In my experience, a bit of repeating yourself is something that Zig kind of embraces–not really in terms of the language, but it general Zig kind of lends itself towards ‘just sticking the code where you need it.’
True - I would say that this is a product of a few things. First, we don’t have function overloading; once a function is named swap, its parameters are set and internally you have to create the dispatch. Second, allocation awareness tends to create situations where you must directly think about the size of things - it’s very different than the mentality of “throw container x at it and it should just work”.
To @squeek502’s point, I strongly recommend you ask yourself if the code you are going to use is likely going to change.
I’m a fan of progressively moving towards abstraction as opposed to starting at the general case and then working your way back to the problem you currently have (with caveats of course - I would say this sentiment is generally true for utilities). If at some point, you realize you need more flexibility, consider making some of your constants parameters or working with allocators for dynamic sizing options.
That said, if the code you have written absolutely solves the problem you currently face (and will for the foreseeable future), there’s nothing wrong with using code that directly does one thing.