Help debugging memory corruption while reading a file with a buffered reader and iterator

danvk · November 14, 2023, 3:00pm

Hello,

I’ve been going through the 2017 Advent of Code as a way to learn some Zig. Many of the problems involve reading files line-by-line. I quickly found a for doing so with std.io.bufferedReader but thought it would be an interesting exercise to factor out some of the boilerplate. You can see my solution on Stack Overflow.

Unfortunately, it has a memory corruption issue that I’ve been having trouble tracking down. The full code for my iterator implementation is here and also in the last Stack Overflow link. It works fine for reading the lines in small files (presumably smaller than the 4k buffer) and also if I just read the lines. But if I allocate memory while iterating through the lines of a large file, then everything breaks.

Here’s my code, there’s also steps for a full repro at the bottom. Any pointers on where the memory corruption is happening would be greatly appreciated! (Also general feedback!)

// util.zig
pub fn iterLines(filename: []const u8, allocator: std.mem.Allocator) !ReadByLineIterator(@TypeOf(getBufferedReader(std.fs.cwd().openFile(filename, .{}) catch unreachable))) {
    var file = try std.fs.cwd().openFile(filename, .{});
    var buf_reader = getBufferedReader(file);
    return readByLine(allocator, &file, buf_reader);
}

pub fn getBufferedReader(file: std.fs.File) @TypeOf(blk: {
    var raw_reader = file.reader();
    var buf_reader = std.io.bufferedReader(raw_reader);
    var in_stream = buf_reader.reader();
    break :blk in_stream;
}) {
    var raw_reader = file.reader();
    var buf_reader = std.io.bufferedReader(raw_reader);
    var in_stream = buf_reader.reader();
    return in_stream;
}

fn ReadByLineIterator(comptime ReaderType: type) type {
    return struct {
        allocator: std.mem.Allocator,
        file_to_close: *std.fs.File,
        reader: ReaderType,
        buffer: []u8,

        pub fn deinit(self: @This()) void {
            self.allocator.free(self.buffer);
            self.file_to_close.close();
        }

        pub fn next(self: *@This()) !?[]const u8 {
            return self.reader.readUntilDelimiterOrEof(self.buffer, '\n');
        }
    };
}

pub fn readByLine(allocator: std.mem.Allocator, fileToClose: *std.fs.File, reader: anytype) !ReadByLineIterator(@TypeOf(reader)) {
    var buffer = try allocator.alloc(u8, 4096);
    std.debug.print("buffer: {s}\n", .{&buffer});
    return .{
        .allocator = allocator,
        .reader = reader,
        .buffer = buffer,
        .file_to_close = fileToClose,
    };
}

Full repro:

$ git clone https://github.com/danvk/aoc2023.git
$ cd aoc2023/aoc2017
$ git checkout memory-corruption-repro
$ zig build
$ ./zig-out/bin/main day12 day12/sample.txt
(works fine on a small input file, smaller than the buffer)
$ ./zig-out/bin/main day12 day12/sample.txt
(crashes after a few lines on a larger file trying to parse what looks like corrupted memory)

IntegratedQuantum · November 14, 2023, 3:57pm

    var file = try std.fs.cwd().openFile(filename, .{});

file is a stack-local variable, meaning that it may overwritten when you leave the function.

    return readByLine(allocator, &file, buf_reader);

You pass the address of file to the readByLine function. If this address is present in the return value, this would explain the corruption.

pub fn readByLine(allocator: std.mem.Allocator, fileToClose: *std.fs.File, reader: anytype) !ReadByLineIterator(@TypeOf(reader)) {
    ...
    return .{
        ...
        .file_to_close = fileToClose,
    };
}

A quick analysis of the function reveals, that the file pointer is indeed part of the return value.

You have two options here:

Since the File seems to contain just a handle it should be safe to pass it by value instead.
Otherwise you’d need to allocate the File somewhere, instead of storing it on the stack.

danvk · November 14, 2023, 5:24pm

Thanks for the suggestions @IntegratedQuantum. I’ve tried both but unfortunately I’m still seeing the exact same results.

Changing File to pass-by-value (diff):

pub fn readByLine(allocator: std.mem.Allocator, fileToClose: std.fs.File, reader: anytype) !ReadByLineIterator(@TypeOf(reader)) {
    var buffer = try allocator.alloc(u8, 4096);
    return .{
        .allocator = allocator,
        .reader = reader,
        .buffer = buffer,
        .file_to_close = fileToClose,
    };
}

Allocating File on the heap (diff). I’m not sure if this is the right way to do it!

pub fn iterLines(filename: []const u8, allocator: std.mem.Allocator) !ReadByLineIterator(@TypeOf(getBufferedReader(std.fs.cwd().openFile(filename, .{}) catch unreachable))) {
    var stack_file = try std.fs.cwd().openFile(filename, .{});
    var heap_file = try allocator.create(std.fs.File);
    heap_file.* = stack_file;
    var buf_reader = getBufferedReader(heap_file.*);
    return readByLine(allocator, heap_file, buf_reader);
}


fn ReadByLineIterator(comptime ReaderType: type) type {
    return struct {
        allocator: std.mem.Allocator,
        file_to_close_and_free: *std.fs.File,
        reader: ReaderType,
        buffer: []u8,

        pub fn deinit(self: @This()) void {
            self.allocator.free(self.buffer);
            self.file_to_close_and_free.close();
            self.allocator.destroy(self.file_to_close_and_free);
        }

        pub fn next(self: *@This()) !?[]const u8 {
            return self.reader.readUntilDelimiterOrEof(self.buffer, '\n');
        }
    };
}

squeek502 · November 14, 2023, 5:52pm

The same problem exists in getBufferedReader:

std.io.BufferedReader.reader() returns a Reader with a context of *Self (meaning a pointer to the std.io.BufferedReader). In your case, this is a pointer to the stack-allocated buf_reader which gets invalidated after getBufferedReader returns.

Not totally related to the problem, but I’d personally ditch all of this attempt at abstraction, as IMO it only provides an illusion of convenience. Just sticking the logic directly where you need it would make all of this much more straightforward:

// no need for heap allocation if the buffer size is known/unchanging
var buf: [4096]u8 = undefined;

var file = try std.fs.cwd().openFile(filename, .{});
defer file.close();

var raw_reader = file.reader();
var buf_reader = std.io.bufferedReader(raw_reader);
var in_stream = buf_reader.reader();

while (try in_stream.readUntilDelimiterOrEof(&buf, '\n')) |line| {
    // do stuff
}

danvk · November 25, 2023, 10:47pm

After working through the rest of 2017’s Advent of Code, I understand and appreciate @squeek502’s answer much better. The *Self context is, indeed, the problem. To fix it, I need to make sure that buf_reader has its final memory address before I create the reader. This means lazily initializing it in the ReadByLineIterator struct.

There’s also no need for the @TypeOf shenanigans. These types all have accessible names. After cleaning all that up and also taking the advice to stack-allocate the buffer, I’m left with something much more straightforward that doesn’t have the memory corruption problem (code below).

Given the simplified code, I’m curious about the assertion that this only “provides an illusion of convenience.” Now that it works, doesn’t it actually provide that convenience for the caller? Is there any downside to using this iterLines function?

const std = @import("std");

const ReaderType = std.fs.File.Reader;
const BufReaderType = std.io.BufferedReader(4096, ReaderType);
const BufReaderReaderType = BufReaderType.Reader;

pub const ReadByLineIterator = struct {
    file: std.fs.File,
    reader: ReaderType,
    buf_reader: BufReaderType,
    stream: ?BufReaderReaderType,
    buf: [4096]u8,

    pub fn next(self: *@This()) !?[]u8 {
        if (self.stream == null) {
            self.stream = self.buf_reader.reader();
        }
        if (self.stream) |stream| {
            return stream.readUntilDelimiterOrEof(&self.buf, '\n');
        }
        unreachable;
    }

    pub fn deinit(self: *@This()) void {
        self.file.close();
    }
};

// Iterate over the lines in the file using a buffered reader.
// Caller is responsible for calling deinit() on returned iterator when done.
pub fn iterLines(filename: []const u8) !ReadByLineIterator {
    var file = try std.fs.cwd().openFile(filename, .{});
    var reader = file.reader();
    var buf_reader = std.io.bufferedReader(reader);

    return ReadByLineIterator{
        .file = file,
        .reader = reader,
        .buf_reader = buf_reader,
        .stream = null,
        .buf = undefined,
    };
}

squeek502 · November 26, 2023, 3:12am

It does provide some convenience, but it also makes it harder to adapt to different use-cases.

Let’s say some AoC problem for sure has lines longer than 4096 bytes. If you still wanted to use your abstraction, then you’d need to be able to customize the buf size which would be a bit complicated. Without the abstraction, it’s just changing the size of the array directly.
Let’s say some AoC problem uses space-delimiting instead of newline-delimiting. If you still wanted to use your abstraction, you’d then need to be able to customize the delimiter. Without the abstraction, it’s just changing the '\n' directly.

It’s a tradeoff, and in this case I personally don’t think the abstraction provides much benefit. In my experience, a bit of repeating yourself is something that Zig kind of embraces–not really in terms of the language, but it general Zig kind of lends itself towards ‘just sticking the code where you need it.’

AndrewCodeDev · November 26, 2023, 4:57am

True - I would say that this is a product of a few things. First, we don’t have function overloading; once a function is named swap, its parameters are set and internally you have to create the dispatch. Second, allocation awareness tends to create situations where you must directly think about the size of things - it’s very different than the mentality of “throw container x at it and it should just work”.

To @squeek502’s point, I strongly recommend you ask yourself if the code you are going to use is likely going to change.

I’m a fan of progressively moving towards abstraction as opposed to starting at the general case and then working your way back to the problem you currently have (with caveats of course - I would say this sentiment is generally true for utilities). If at some point, you realize you need more flexibility, consider making some of your constants parameters or working with allocators for dynamic sizing options.

That said, if the code you have written absolutely solves the problem you currently face (and will for the foreseeable future), there’s nothing wrong with using code that directly does one thing.