Io.Reader tracking line and column

kudu · August 23, 2025, 2:01pm

For a text file parser it would be nice to have a std.Io.Reader that keeps track of the line and the column number. It could then be used for diagnostics.

As a template I used std.Io.Reader.Hashed from the standard library. And I think I have found a solution that works. But I am not 100% confident, and I would appreciate if someone with a good knowledge of the new Io.Reader interface has time to have a look at the code.

I found it a bit tricky (especially for the column) because the vtable.stream function is sometimes called to fill the reader buffer, and sometimes to write to the destination. I may well have missed a case.

This is the code (copied from here):

/// Track position, line number and column.
const TextPosition = struct {
    in: *Reader,
    reader: Reader,
    pos: usize = 0,
    nl_pos: usize = 0,
    nl_prebuf: usize = 0,
    nl_count: usize = 0,

    pub fn init(in: *Reader, buffer: []u8) TextPosition {
        return .{
            .in = in,
            .reader = .{
                .vtable = &.{ .stream = stream },
                .buffer = buffer,
                .seek = 0,
                .end = 0,
            },
        };
    }

    fn stream(r: *Reader, w: *Writer, limit: Limit) Reader.StreamError!usize {
        const t: *TextPosition = @alignCast(@fieldParentPtr("reader", r));
        const data = limit.slice(try w.writableSliceGreedy(1));
        var vec: [1][]u8 = .{data};
        const n = try t.in.readVec(&vec);

        if (data.ptr == r.buffer.ptr) t.nl_prebuf = t.nl_pos;
        for (data[0..n]) |c| {
            t.pos += 1;
            if (c == '\n') {
                t.nl_count += 1;
                t.nl_pos = t.pos;
            }
        }

        w.advance(n);
        return n;
    }

    fn position(t: *const TextPosition) usize {
        return t.pos - t.reader.bufferedLen();
    }
    /// 0-based
    fn line(t: *const TextPosition) usize {
        const r = &t.reader;
        return t.nl_count - std.mem.count(u8, r.buffer[r.seek..r.end], "\n");
    }
    /// 0-based
    fn column(t: *const TextPosition) usize {
        const r = &t.reader;
        if (r.bufferedLen() == 0)
            return t.pos - t.nl_pos;
        if (std.mem.lastIndexOfScalar(u8, r.buffer[0..r.seek], '\n')) |i| {
            return r.seek - i - 1;
        } else {
            return t.pos - t.nl_prebuf - r.bufferedLen();
        }
    }
};

vulpesx · August 24, 2025, 12:15am

I would put that logic in your parser, since you’re already doing logic on the file’s contents there.

Having it in the reader adds a layer of indirection that may prevent devirtualisation.

I haven’t done any testing to back that up

kudu · August 24, 2025, 6:53am

Yes, it adds a layer of indirection but probably not more syscalls.

My idea was to simplify the parser. In my parser there are more than 10 places
where different Reader methods are called (take…, peek…, toss, stream…).
So the line/column logic would have to go there. Alternatively it could be centralized in two or three parser functions but this would also add a layer of indirection.

hvbargen · August 24, 2025, 8:13am

Maybe you should reconsider your parser.
Usually, a scanner aka lexer should perform the task of dividing the sequence of input bytes into tokens and the parser should consume these tokens.
You can store the byte offset, or the line/col, inside each token then, eg a token could be a struct consisting of the token type (enum), a slice (or a copy of the text) and the byte offset, or line/col info.
If you are writing an interpreter (not a compiler), consider keeping the whole source in memory, then a slice suffices to store the token text, so you don’t need to allocate for a copy of it.