Sketch of type-safer buffered IO

matklad · June 21, 2024, 10:32am

Zig’s Reader has this function:

/// Reads 1 byte from the stream or returns `error.EndOfStream`.
pub fn readByte(self: Self) anyerror!u8 {
    var result: [1]u8 = undefined;
    const amt_read = try self.read(result[0..]);
    if (amt_read < 1) return error.EndOfStream;
    return result[0];
}

Higher-level operations like “read a line” bottom out in calling this function in a loop (very approximate code):

fn readLine(reader: AnyReader, buffer: []u8) []u8 {
    for(buffer, 0..) |&byte, index| {
        byte.* = try reader.readByte();
        if (byte.* == '\n') return buffer[..index];
    }
}

This code has three performance bugs:

In the worst-case, it does one syscall per byte of input
It does one virtual call per byte of input
It doesn’t use SIMD and is not vectorizable — there’s simply no slice of memory we can run SIMD over here

Now, the first (and only the first) issue can be fixed by wrapping a reader into a buffered reader, but that still leaves a couple of performance rakes lying dangerously around:

fn uses_reader(reader: AnyReader) !void

This signature gives raise to at least three distinct possibilities:

The function isn’t using readByte-derived APIs, in which case it is fine to pass something like std.fs.File in directly
The function does call something like readLine internally, so the caller must supply a buffered reader.
Out of caution, the function internally wraps a reader into a buffered reader, so the user must not pass a buffered reader, as that would leave to unnecessary double buffering.

If the caller’s and callee expectations mismatch, there’s a perf bug! It’s also not hard to imagine a situation where a library gets refactored from 2. to 3. to fix perf issues for one user, creating new perf issues for other users who did buffer already.

I think the right solution here is to move byte-oriented API to a buffered reader, such that, if you want to call readLine, you function signature tells the caller that they need to supply a buffered reader. That’s basically how Rust Read, BufRead and BufReader are set up.

I don’t think I am quite ready to submit a PR to Zig repo with this (relatively large scale) change, but I couldn’t help but sketch the API this morning! Here’s the result:

gist.github.com

https://gist.github.com/matklad/cc25c31417e73c69f11d174dc2df71c6

Reader.zig

const Self = @This();
const mem = std.mem;
const eof = error.EndOfStream;
const std = @import("../std.zig");

const Reader = struct {
    context: *const anyopaque,
    readFn: *const fn (context: *const anyopaque, buffer: []u8) anyerror!usize,

    pub const Error = anyerror;

This file has been truncated. show original

This is the dynamically-dispatched part of the API. For the generic part, I think it basically boils down to rewriting the existing fn BufferedReader along the lines of fn GenericReader(