This file reader line by line is slow

fu3x11 · September 3, 2024, 12:44am

im trying to get a 2gb file, and put each line as a value within a hash set
but this solution is very slow, i dont want any software based bottlenecks whatsoever

each line can contain a max of 62 chars excluding \n

const std = @import("std");

pub fn main() !void {
    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    defer _ = gpa.deinit();
    const allocator = gpa.allocator();

    const file = try std.fs.cwd().openFile(
        "list2.txt",
        .{},
    );
    defer file.close();

    var buf_reader = std.io.bufferedReader(file.reader());
    const reader = buf_reader.reader();

    var map = std.AutoHashMap(u32, []const u8).init(
        allocator,
    );

    defer map.deinit();

    var index: u32 = 0;
    while (true) {
        const line = reader.readUntilDelimiterOrEofAlloc(allocator, '\n', 62) catch {
            break;
        };

        if (line) |unwrapped_line| {
            if (unwrapped_line.len == 0) break;

            try map.put(index, unwrapped_line);
            index += 1;
        } else {
            break;
        }
    }

    var iter = map.iterator();
    while (iter.next()) |entry| {
        std.debug.print("{s}\n", .{entry.value_ptr.*});
    }
}

AndrewCodeDev · September 3, 2024, 1:49am

A large part of why this is slow is the readUntilDelimiterOrEofAlloc call:

pub fn readUntilDelimiterOrEofAlloc(
    self: Self,
    allocator: mem.Allocator,
    delimiter: u8,
    max_size: usize,
) anyerror!?[]u8 {
    var array_list = std.ArrayList(u8).init(allocator);
    defer array_list.deinit();
    self.streamUntilDelimiter(array_list.writer(), delimiter, max_size) catch |err| switch (err) {
        error.EndOfStream => if (array_list.items.len == 0) {
            return null;
        },
        else => |e| return e,
    };
    return try array_list.toOwnedSlice();
}

Each time through, it creates an empty array list that must call resize multiple times (so it’s worse than just a single allocation). If it were me, I would read the entire file into a single buffer and then use something like a std.mem.splitScalar and split on the \n character to create slices of the underlying buffer. I would then store those slices in the hash map and only free the underlying file memory when you’re done with the map itself. That way, you’re getting a view into the underlying buffer and you minimize your allocations.

markus · September 3, 2024, 5:43am

Note that readUntilDelimiter is considered deprecated, though (Zig Documentation) and that streamUntilDelimiter, its replacement, allows you to read into any writer. This means you can for example read into an arraylist when you initialized with a good estimate for the min/max line size, or even better, a BoundedArray

I will say, that if the file isnt too large, @AndrewCodeDev 's solution should be far more performant as you can use the iterators which make use of simd to search the next item.

Edit: I just saw the file is two gigs, which I consider to be quite large. It’s up to you and your setup which solution is better for you ig

markus · September 3, 2024, 5:57am

This reminds me of the The One Billion Row Challenge - Gunnar Morling, as the input is quite predictable.
I tried it too, but never got into any multithreading. I did, however, do this:

/// This breaks a stream separated by delimiters and separates them so that as many values are
/// stored in the blocks as possible.
fn Chunkifier(
    comptime chunk_n: comptime_int,
    /// should be the minimum distance between delimiters, larger blocks are encouraged for performance gains
    comptime chunk_size: comptime_int,
) type {
    return struct {
        const Self = @This();

        pub const Error = error{
            /// data is being separated where it shouldnt be and thus being corrupted
            BlockTear,
        };

        reader: std.io.AnyReader,
        delimiter: []const u8,

        bufs: [chunk_n][chunk_size]u8 = undefined,
        bufn: usize = 0,
        current: []const u8 = undefined,
        next: []const u8 = undefined,
        start: usize = 0,

        pub fn init(reader: std.io.AnyReader, delimiter: []const u8) !Self {
            var self = Self{ .reader = reader, .delimiter = delimiter };
            try self.loadNextChunk();
            return self;
        }

        pub fn nextChunk(self: *Self) !?struct { []const u8, []const u8 } {
            if (self.next.len == 0)
                return null;
            try self.loadNextChunk();

            const end = std.mem.indexOf(u8, self.next, self.delimiter) orelse idx: {
                if (self.next.len == chunk_size)
                    return error.BlockTear;
                break :idx self.next.len;
            };
            const ret = .{ self.current[self.start..], self.next[0..end] };
            self.start = end;
            return ret;
        }

        fn loadNextChunk(self: *Self) !void {
            self.current = self.next;
            const len = try self.reader.readAll(&self.bufs[self.bufn]);
            self.next = self.bufs[self.bufn][0..len];
            self.bufn = (self.bufn + 1) % chunk_n;
        }
    };
}

This allowed me to choose (at comptime) how much data I want to have loaded at once without allocating while still being able to make use of the simd functionality of iterators. The only downside is that one word is (almost) always split between the two slices, which will have to be rejoined to search properly.

dude_the_builder · September 3, 2024, 1:16pm

Are you building in a release mode? On my machine, your code in the release modes read a 3.8MiB file in ~500ms.

Side NOTE: To avoid leaking the map string values, you’ll need to mdify your defer:

    defer {
        var iter = map.iterator();
        while (iter.next()) |entry| allocator.free(entry.value_ptr.*);
        map.deinit();
    }

FMarno · September 3, 2024, 1:43pm

If you are going to save all the lines, then @AndrewCodeDev’s solution of storing the whole files in one big block would likely be best for a number of reasons (fewer kernel calls, contiguous data is good for cache locality etc).
I think you should also consider if a HashMap is really needed? You seem to just be storing a mapping from the line number to the line, could an ArrayList be used?

pierrelgol · September 3, 2024, 11:58pm

2gb ? This is screaming for a good old mmap