How to read an entire text file line by line into memory efficiently?

MarvellouslyExist · October 19, 2024, 7:15am

With a bit help from ChatGPT, I am able to make the following code work

fn loadWords(allocator: Allocator, words_path: []const u8) !([][]u8) {
    var file = try fs.cwd().openFile(words_path, .{});
    defer file.close();

    var buf_reader = std.io.bufferedReader(file.reader());
    var reader = buf_reader.reader();

    var word_list = std.ArrayList([]u8).init(allocator);

    var buf = std.ArrayList(u8).init(allocator);

    while (true) {
        reader.streamUntilDelimiter(buf.writer(), '\n', null) catch |err| switch (err) {
            error.EndOfStream => break,
            else => return err,
        };
        if (isAlpha(buf.items)) {
            try word_list.append(try allocator.dupe(u8, buf.items));
        }
        buf.clearRetainingCapacity();
    }
    return try word_list.toOwnedSlice();
}

I noticed that buf is a ArrayList. I wonder if perhaps I can change this to a fixed length array to speed things up a little bit?

JPL · October 19, 2024, 7:29am

Hello, if you use this procedure to read different text files, it is better to leave it as is.

Anyway it will improve very little, zig is very fast…

MarvellouslyExist · October 19, 2024, 7:41am

I understand that there is not much to gain in performance. The reason I asked this question is mostly for learning to use zig. I searched a bit on the same topic via Google but did not see any hit. So it feels that a post dealing with such tasks might be useful if anyone searches for it in the future.

IntegratedQuantum · October 19, 2024, 8:02am

First up you are leaking memory:
buf is never deinited. I suggest testing with a GeneralPurposeAllocator which does leak checking in debug mode to catch these problems.

And regarding actual performance: One problem I see is that you are potentially doing many small allocations. Now depending on the file and allocator this may not be a big deal, but it might add up. You naming the function loadWords suggests to me that each line contains just one word, so the allocations will be fairly small. Now of course if you use an ArenaAllocator, then this won’t be a big deal, but for other allocators like a GeneralPurposeAllocator many small allocations can be quite expensive.

Because of this I would generally suggest to just write everything into a big buffer and then use a split iterator to extract the actual lines:


pub fn readFile(allocator: Allocator, words_path: []const u8) ![]u8 {
    const file = try fs.cwd().openFile(words_path, .{});
    defer file.close();
    return file.readToEndAlloc(allocator, std.math.maxInt(usize));
}
// Then on use:
const buf = try readFile(...);
defer allocator.free(buf);
var splitIterator = std.mem.splitScalar(u8, buf, '\n');
while(splitIterator.next()) |line| {
    ...
}

With this the code is much simpler, and there is only one allocation you need to free.

MarvellouslyExist · October 19, 2024, 8:32am

Thanks for pointing it out. This is single run command line program. So I read in the language reference that it is OK to not to free memory.

I am also using arena the language reference suggested in my main()

    var arena = std.heap.ArenaAllocator.init(std.heap.page_allocator);
    defer arena.deinit();

    const allocator = arena.allocator();

I will try the improvement you suggested!

andrewrk · October 20, 2024, 5:45am

std.fs.Dir.readFileAlloc

MarvellouslyExist · October 20, 2024, 8:29am

I have updated my code according the suggestions above.

fn readFile(allocator: Allocator, words_path: []const u8) ![]const u8 {
    const file = try fs.cwd().openFile(words_path, .{});
    defer file.close();
    return file.readToEndAlloc(allocator, std.math.maxInt(usize));
}

const Words = struct {
    word_list: [][]const u8,
    buf: []const u8,
};

fn loadWords(allocator: Allocator, words_path: []const u8) !Words {
    var file = try fs.cwd().openFile(words_path, .{});
    defer file.close();

    const buf = try readFile(allocator, words_path);

    var word_list = std.ArrayList([]const u8).init(allocator);

    var splitIterator = std.mem.splitScalar(u8, buf, '\n');
    while (splitIterator.next()) |word| {
        if (isAlpha(word)) {
            try word_list.append(word);
        }
    }
    const words = Words{
        .word_list = try word_list.toOwnedSlice(),
        .buf = buf,
    };
    return words;
}

You can find the entire code here. Hope it can help others learning zig.

pierrelgol · October 20, 2024, 4:29pm

I think you are opening the file twice if I’m not mistaken. Also if you want the absolute most efficient way to read a file it would probably to mmap, but readToEndAlloc is more than good enough honestly.

andrewrk · October 20, 2024, 10:10pm

You reimplemented std.fs.Dir.readFileAlloc, except worse.

MarvellouslyExist · October 21, 2024, 12:08am

Thanks. I switched to use readFileAlloc:

fn loadWords(allocator: Allocator, words_path: []const u8) !Words {
    const buf = try fs.cwd().readFileAlloc(allocator, words_path, std.math.maxInt(usize));

    var word_list = std.ArrayList([]const u8).init(allocator);

    var splitIterator = std.mem.splitScalar(u8, buf, '\n');
    while (splitIterator.next()) |word| {
        if (isAlpha(word)) {
            try word_list.append(word);
        }
    }
    const words = Words{
        .word_list = try word_list.toOwnedSlice(),
        .buf = buf,
    };
    return words;
}

MarvellouslyExist · October 21, 2024, 2:02am

I will give it a try. Although it seems that Zig does not have a very convenient way to do it yet.

andrewrk · October 21, 2024, 7:07am

You passed maxInt(usize) but are you sure you want to read a file into memory if it is 16,384 PiB?

pierrelgol · October 21, 2024, 7:36am

Okay so just to try to see if it was worth your time, I’ve made a little program testing the difference between using the std.fs and using mmap directly and honestly it’s pretty close, like mmap is still faster, but it’s not as pronounced.

const std = @import("std");
const fs = std.fs;
const Allocator = std.mem.Allocator;
const cwd = std.fs.cwd();
const file_name = "test.txt";
const time = std.time;

pub fn Words(alignment: ?u29) type {
    return struct {
        const Self = @This();
        word_list: [][]const u8,
        buf: []align(alignment orelse 1) u8,

        pub fn init(word_list: [][]const u8, buf: []align(alignment orelse 1) u8) Self {
            return .{
                .word_list = word_list,
                .buf = buf,
            };
        }
    };
}

fn isAlpha(word: []const u8) bool {
    for (word) |char| {
        if ((char | 32) >= 'a' and (char | 32) <= 'z') continue else return false;
    }
    return true;
}

fn mapwords(allocator: Allocator) !Words(std.mem.page_size) {
    const file = try cwd.openFile(file_name, .{ .mode = .read_only });
    errdefer file.close();

    const file_stat = try file.stat();
    const file_size = file_stat.size;

    const file_mapping: ?[*]align(std.mem.page_size) u8 = null;
    const content = try std.posix.mmap(file_mapping, file_size, std.posix.PROT.READ, .{ .TYPE = .PRIVATE }, file.handle, 0);
    errdefer std.posix.munmap(content);

    var word_list = std.ArrayList([]const u8).init(allocator);
    errdefer word_list.deinit();

    var splitIterator = std.mem.splitScalar(u8, content, '\n');
    while (splitIterator.next()) |word| {
        if (isAlpha(word)) {
            try word_list.append(word);
        }
    }
    const words = Words(std.mem.page_size).init(try word_list.toOwnedSlice(), content);
    file.close();
    return words;
}

fn loadWords(allocator: Allocator) !Words(null) {
    const buf = try fs.cwd().readFileAlloc(allocator, file_name, std.math.maxInt(usize));
    var word_list = std.ArrayList([]const u8).init(allocator);

    var splitIterator = std.mem.splitScalar(u8, buf, '\n');
    while (splitIterator.next()) |word| {
        if (isAlpha(word)) {
            try word_list.append(word);
        }
    }
    const words = Words(null).init(try word_list.toOwnedSlice(), buf);
    return words;
}

pub fn warmpu(allocator: Allocator) !void {
    for (0..10) |_| {
        const from_load = try loadWords(allocator);
        _ = from_load;
    }
    for (0..10) |_| {
        const from_map = try mapwords(allocator);
        defer std.posix.munmap(from_map.buf);
    }
}

fn mmapFirst(allocator: Allocator) !void {
    for (0..10) |_| {
        const from_load = try loadWords(allocator);
        _ = from_load;
    }
    for (0..10) |_| {
        const from_map = try mapwords(allocator);
        defer std.posix.munmap(from_map.buf);
    }
}

pub fn main() !void {
    const page_allocator = std.heap.page_allocator;
    var arena = std.heap.ArenaAllocator.init(page_allocator);
    errdefer arena.deinit();
    const allocator = arena.allocator();

    std.log.info("first : loadWords | second : mapWords", .{});
    try warmpu(allocator);
    _ = arena.reset(.retain_capacity);
    {
        var timer = try time.Timer.start();
        for (0..10) |_| {
            const from_load = try loadWords(allocator);
            std.log.info("loadWords time = {d} us.", .{timer.lap() / time.ns_per_us});
            _ = from_load;
        }

        _ = arena.reset(.retain_capacity);
    }

    {
        var timer = try time.Timer.start();
        for (0..10) |_| {
            const from_map = try mapwords(allocator);
            std.log.info("mapwords time = {d} us.", .{timer.lap() / time.ns_per_us});
            defer std.posix.munmap(from_map.buf);
        }
    }

    std.log.info("first : mapWords | second : loadWords", .{});
    time.sleep(1000 * 1000 * 1000 * 5);

    {
        var timer = try time.Timer.start();
        for (0..10) |_| {
            const from_map = try mapwords(allocator);
            std.log.info("mapwords time = {d} us.", .{timer.lap() / time.ns_per_us});
            defer std.posix.munmap(from_map.buf);
        }
    }

    {
        var timer = try time.Timer.start();
        for (0..10) |_| {
            const from_load = try loadWords(allocator);
            std.log.info("loadWords time = {d} us.", .{timer.lap() / time.ns_per_us});
            _ = from_load;
        }
    }

    arena.deinit();
}

with a 20mb file I get those results :

❯ ./temp
info: first : loadWords | second : mapWords
info: loadWords time = 25759 us.
info: loadWords time = 19943 us.
info: loadWords time = 20093 us.
info: loadWords time = 20031 us.
info: loadWords time = 20147 us.
info: loadWords time = 19920 us.
info: loadWords time = 20447 us.
info: loadWords time = 19817 us.
info: loadWords time = 19897 us.
info: loadWords time = 19740 us.
info: mapwords time = 8884 us.
info: mapwords time = 9305 us.
info: mapwords time = 9173 us.
info: mapwords time = 9202 us.
info: mapwords time = 9050 us.
info: mapwords time = 9156 us.
info: mapwords time = 9088 us.
info: mapwords time = 9106 us.
info: mapwords time = 9439 us.
info: mapwords time = 9339 us.
info: first : mapWords | second : loadWords
info: mapwords time = 9251 us.
info: mapwords time = 9586 us.
info: mapwords time = 9234 us.
info: mapwords time = 9392 us.
info: mapwords time = 9134 us.
info: mapwords time = 9402 us.
info: mapwords time = 9403 us.
info: mapwords time = 9309 us.
info: mapwords time = 9685 us.
info: mapwords time = 9364 us.
info: loadWords time = 11505 us.
info: loadWords time = 12323 us.
info: loadWords time = 11988 us.
info: loadWords time = 12153 us.
info: loadWords time = 12333 us.
info: loadWords time = 12673 us.
info: loadWords time = 12455 us.
info: loadWords time = 12495 us.
info: loadWords time = 12381 us.
info: loadWords time = 12358 us.

MarvellouslyExist · October 21, 2024, 3:28pm

Thank you! I have incorporated your code here. Now the program can take a command line parameter -m to use memory map.

For a smaller file (about 1MB), there does not seem to exists any noticeable difference.

$ hyperfine -N --warmup 200 -m 100 "./rand-words -s 100"
Benchmark 1: ./rand-words -s 100
  Time (mean ± σ):       4.6 ms ±   0.4 ms    [User: 2.9 ms, System: 1.5 ms]
  Range (min … max):     3.4 ms …   6.5 ms    683 runs
 
$ hyperfine -N --warmup 200 -m 100 "./rand-words -s 100 -m"
Benchmark 1: ./rand-words -s 100 -m
  Time (mean ± σ):       4.4 ms ±   0.5 ms    [User: 3.0 ms, System: 1.2 ms]
  Range (min … max):     3.3 ms …   7.4 ms    559 runs

pierrelgol · October 21, 2024, 3:43pm

Yes, mmap is great when you need to load a large amount of data quickly. The OS will directly map the file’s pages into your process’s address space, which is generally more efficient than making multiple read syscalls. However, for very small files, the overhead of setting up memory mapping might outweigh the benefits, it’ like using a tank in a fistfight, it’s probably overkill for the task at hand. But very useful on larger files

JPL · October 21, 2024, 4:05pm

Personally, I keep it simple.

zig 0.13.0 or 0.14. dev

const std = @import("std");

 pub fn main() !void {

    const data = @embedFile("./src_file.txt");

    var flines = std.mem.tokenizeAny(u8, data, "\n");

    while(flines.next()) |line| {
      std.debug.print("{s}\n", .{line}); 
    }
}

Sze · October 21, 2024, 4:40pm

@embedFile only makes sense if the text file is small enough and isn’t a dynamic input to the program.

JPL · October 21, 2024, 4:58pm

const std = @import("std");
pub const myallocator = std.heap.page_allocator;

 pub fn main() !void {

 


    const parmdir:[] const u8="../Zterm";
	const cDIR = std.fs.cwd().openDir(parmdir, .{}) catch |err| {
	    std.debug.print(" dir >{}",.{err});
		return;
	};

	var my_file = cDIR.openFile("src_file.txt", .{}) catch |err| {
	    std.debug.print("file >{}",.{err});
		return;
	};
	defer my_file.close();

	const file_size = try my_file.getEndPos();
	var buffer: []u8 = myallocator .alloc(u8, file_size) catch unreachable;

	_ = try my_file.read(buffer[0..buffer.len]);

    var flines = std.mem.tokenizeAny(u8, buffer, "\n");

    while(flines.next()) |line| {
      std.debug.print("{s}\n", .{line}); 
    }
}

Afterwards, you make your directory or file name dynamic

JPL · October 21, 2024, 5:04pm

Go see my project:
https://github.com/AS400JPLPC/zig_TermCurs

the mdlFile.zig file

There is how to have the dynamic directory, as well as the file

LucasSantos91 · October 21, 2024, 7:22pm

Well, no computer is ever going to read 16,384 PiB, but what should the limit be? If the computer has 8GB of RAM, you might set the limit to 4GB, or maybe even 8GB and rely on virtual memory. If the computer has 32GB, you’ll set a higher limit. By setting the limit to the max, you’re basically delegating the decision to the operating system. If it doesn’t have enough memory to handle a certain file, you’ll get an out of memory error. I don’t see the problem with this approach. Why should the programmer put an artificial limit here, if the code will naturally fail with OOM if it can’t handle it?