With a bit help from ChatGPT, I am able to make the following code work
fn loadWords(allocator: Allocator, words_path: []const u8) !([][]u8) {
var file = try fs.cwd().openFile(words_path, .{});
defer file.close();
var buf_reader = std.io.bufferedReader(file.reader());
var reader = buf_reader.reader();
var word_list = std.ArrayList([]u8).init(allocator);
var buf = std.ArrayList(u8).init(allocator);
while (true) {
reader.streamUntilDelimiter(buf.writer(), '\n', null) catch |err| switch (err) {
error.EndOfStream => break,
else => return err,
};
if (isAlpha(buf.items)) {
try word_list.append(try allocator.dupe(u8, buf.items));
}
buf.clearRetainingCapacity();
}
return try word_list.toOwnedSlice();
}
I noticed that buf
is a ArrayList
. I wonder if perhaps I can change this to a fixed length array to speed things up a little bit?
1 Like
JPL
October 19, 2024, 7:29am
2
Hello, if you use this procedure to read different text files, it is better to leave it as is.
Anyway it will improve very little, zig is very fastā¦
1 Like
I understand that there is not much to gain in performance. The reason I asked this question is mostly for learning to use zig. I searched a bit on the same topic via Google but did not see any hit. So it feels that a post dealing with such tasks might be useful if anyone searches for it in the future.
First up you are leaking memory:
buf
is never deinited. I suggest testing with a GeneralPurposeAllocator which does leak checking in debug mode to catch these problems.
And regarding actual performance: One problem I see is that you are potentially doing many small allocations. Now depending on the file and allocator this may not be a big deal, but it might add up. You naming the function loadWords suggests to me that each line contains just one word, so the allocations will be fairly small. Now of course if you use an ArenaAllocator, then this wonāt be a big deal, but for other allocators like a GeneralPurposeAllocator many small allocations can be quite expensive.
Because of this I would generally suggest to just write everything into a big buffer and then use a split iterator to extract the actual lines:
pub fn readFile(allocator: Allocator, words_path: []const u8) ![]u8 {
const file = try fs.cwd().openFile(words_path, .{});
defer file.close();
return file.readToEndAlloc(allocator, std.math.maxInt(usize));
}
// Then on use:
const buf = try readFile(...);
defer allocator.free(buf);
var splitIterator = std.mem.splitScalar(u8, buf, '\n');
while(splitIterator.next()) |line| {
...
}
With this the code is much simpler, and there is only one allocation you need to free.
5 Likes
Thanks for pointing it out. This is single run command line program. So I read in the language reference that it is OK to not to free memory.
I am also using arena the language reference suggested in my main()
var arena = std.heap.ArenaAllocator.init(std.heap.page_allocator);
defer arena.deinit();
const allocator = arena.allocator();
I will try the improvement you suggested!
I have updated my code according the suggestions above.
fn readFile(allocator: Allocator, words_path: []const u8) ![]const u8 {
const file = try fs.cwd().openFile(words_path, .{});
defer file.close();
return file.readToEndAlloc(allocator, std.math.maxInt(usize));
}
const Words = struct {
word_list: [][]const u8,
buf: []const u8,
};
fn loadWords(allocator: Allocator, words_path: []const u8) !Words {
var file = try fs.cwd().openFile(words_path, .{});
defer file.close();
const buf = try readFile(allocator, words_path);
var word_list = std.ArrayList([]const u8).init(allocator);
var splitIterator = std.mem.splitScalar(u8, buf, '\n');
while (splitIterator.next()) |word| {
if (isAlpha(word)) {
try word_list.append(word);
}
}
const words = Words{
.word_list = try word_list.toOwnedSlice(),
.buf = buf,
};
return words;
}
You can find the entire code here . Hope it can help others learning zig.
I think you are opening the file twice if Iām not mistaken. Also if you want the absolute most efficient way to read a file it would probably to mmap, but readToEndAlloc is more than good enough honestly.
You reimplemented std.fs.Dir.readFileAlloc
, except worse.
1 Like
Thanks. I switched to use readFileAlloc
:
fn loadWords(allocator: Allocator, words_path: []const u8) !Words {
const buf = try fs.cwd().readFileAlloc(allocator, words_path, std.math.maxInt(usize));
var word_list = std.ArrayList([]const u8).init(allocator);
var splitIterator = std.mem.splitScalar(u8, buf, '\n');
while (splitIterator.next()) |word| {
if (isAlpha(word)) {
try word_list.append(word);
}
}
const words = Words{
.word_list = try word_list.toOwnedSlice(),
.buf = buf,
};
return words;
}
I will give it a try. Although it seems that Zig does not have a very convenient way to do it yet.
1 Like
You passed maxInt(usize)
but are you sure you want to read a file into memory if it is 16,384 PiB?
2 Likes
Okay so just to try to see if it was worth your time, Iāve made a little program testing the difference between using the std.fs and using mmap directly and honestly itās pretty close, like mmap is still faster, but itās not as pronounced.
const std = @import("std");
const fs = std.fs;
const Allocator = std.mem.Allocator;
const cwd = std.fs.cwd();
const file_name = "test.txt";
const time = std.time;
pub fn Words(alignment: ?u29) type {
return struct {
const Self = @This();
word_list: [][]const u8,
buf: []align(alignment orelse 1) u8,
pub fn init(word_list: [][]const u8, buf: []align(alignment orelse 1) u8) Self {
return .{
.word_list = word_list,
.buf = buf,
};
}
};
}
fn isAlpha(word: []const u8) bool {
for (word) |char| {
if ((char | 32) >= 'a' and (char | 32) <= 'z') continue else return false;
}
return true;
}
fn mapwords(allocator: Allocator) !Words(std.mem.page_size) {
const file = try cwd.openFile(file_name, .{ .mode = .read_only });
errdefer file.close();
const file_stat = try file.stat();
const file_size = file_stat.size;
const file_mapping: ?[*]align(std.mem.page_size) u8 = null;
const content = try std.posix.mmap(file_mapping, file_size, std.posix.PROT.READ, .{ .TYPE = .PRIVATE }, file.handle, 0);
errdefer std.posix.munmap(content);
var word_list = std.ArrayList([]const u8).init(allocator);
errdefer word_list.deinit();
var splitIterator = std.mem.splitScalar(u8, content, '\n');
while (splitIterator.next()) |word| {
if (isAlpha(word)) {
try word_list.append(word);
}
}
const words = Words(std.mem.page_size).init(try word_list.toOwnedSlice(), content);
file.close();
return words;
}
fn loadWords(allocator: Allocator) !Words(null) {
const buf = try fs.cwd().readFileAlloc(allocator, file_name, std.math.maxInt(usize));
var word_list = std.ArrayList([]const u8).init(allocator);
var splitIterator = std.mem.splitScalar(u8, buf, '\n');
while (splitIterator.next()) |word| {
if (isAlpha(word)) {
try word_list.append(word);
}
}
const words = Words(null).init(try word_list.toOwnedSlice(), buf);
return words;
}
pub fn warmpu(allocator: Allocator) !void {
for (0..10) |_| {
const from_load = try loadWords(allocator);
_ = from_load;
}
for (0..10) |_| {
const from_map = try mapwords(allocator);
defer std.posix.munmap(from_map.buf);
}
}
fn mmapFirst(allocator: Allocator) !void {
for (0..10) |_| {
const from_load = try loadWords(allocator);
_ = from_load;
}
for (0..10) |_| {
const from_map = try mapwords(allocator);
defer std.posix.munmap(from_map.buf);
}
}
pub fn main() !void {
const page_allocator = std.heap.page_allocator;
var arena = std.heap.ArenaAllocator.init(page_allocator);
errdefer arena.deinit();
const allocator = arena.allocator();
std.log.info("first : loadWords | second : mapWords", .{});
try warmpu(allocator);
_ = arena.reset(.retain_capacity);
{
var timer = try time.Timer.start();
for (0..10) |_| {
const from_load = try loadWords(allocator);
std.log.info("loadWords time = {d} us.", .{timer.lap() / time.ns_per_us});
_ = from_load;
}
_ = arena.reset(.retain_capacity);
}
{
var timer = try time.Timer.start();
for (0..10) |_| {
const from_map = try mapwords(allocator);
std.log.info("mapwords time = {d} us.", .{timer.lap() / time.ns_per_us});
defer std.posix.munmap(from_map.buf);
}
}
std.log.info("first : mapWords | second : loadWords", .{});
time.sleep(1000 * 1000 * 1000 * 5);
{
var timer = try time.Timer.start();
for (0..10) |_| {
const from_map = try mapwords(allocator);
std.log.info("mapwords time = {d} us.", .{timer.lap() / time.ns_per_us});
defer std.posix.munmap(from_map.buf);
}
}
{
var timer = try time.Timer.start();
for (0..10) |_| {
const from_load = try loadWords(allocator);
std.log.info("loadWords time = {d} us.", .{timer.lap() / time.ns_per_us});
_ = from_load;
}
}
arena.deinit();
}
with a 20mb file I get those results :
⯠./temp
info: first : loadWords | second : mapWords
info: loadWords time = 25759 us.
info: loadWords time = 19943 us.
info: loadWords time = 20093 us.
info: loadWords time = 20031 us.
info: loadWords time = 20147 us.
info: loadWords time = 19920 us.
info: loadWords time = 20447 us.
info: loadWords time = 19817 us.
info: loadWords time = 19897 us.
info: loadWords time = 19740 us.
info: mapwords time = 8884 us.
info: mapwords time = 9305 us.
info: mapwords time = 9173 us.
info: mapwords time = 9202 us.
info: mapwords time = 9050 us.
info: mapwords time = 9156 us.
info: mapwords time = 9088 us.
info: mapwords time = 9106 us.
info: mapwords time = 9439 us.
info: mapwords time = 9339 us.
info: first : mapWords | second : loadWords
info: mapwords time = 9251 us.
info: mapwords time = 9586 us.
info: mapwords time = 9234 us.
info: mapwords time = 9392 us.
info: mapwords time = 9134 us.
info: mapwords time = 9402 us.
info: mapwords time = 9403 us.
info: mapwords time = 9309 us.
info: mapwords time = 9685 us.
info: mapwords time = 9364 us.
info: loadWords time = 11505 us.
info: loadWords time = 12323 us.
info: loadWords time = 11988 us.
info: loadWords time = 12153 us.
info: loadWords time = 12333 us.
info: loadWords time = 12673 us.
info: loadWords time = 12455 us.
info: loadWords time = 12495 us.
info: loadWords time = 12381 us.
info: loadWords time = 12358 us.
1 Like
Thank you! I have incorporated your code here . Now the program can take a command line parameter -m
to use memory map.
For a smaller file (about 1MB), there does not seem to exists any noticeable difference.
$ hyperfine -N --warmup 200 -m 100 "./rand-words -s 100"
Benchmark 1: ./rand-words -s 100
Time (mean ± Ļ): 4.6 ms ± 0.4 ms [User: 2.9 ms, System: 1.5 ms]
Range (min ⦠max): 3.4 ms ⦠6.5 ms 683 runs
$ hyperfine -N --warmup 200 -m 100 "./rand-words -s 100 -m"
Benchmark 1: ./rand-words -s 100 -m
Time (mean ± Ļ): 4.4 ms ± 0.5 ms [User: 3.0 ms, System: 1.2 ms]
Range (min ⦠max): 3.3 ms ⦠7.4 ms 559 runs
1 Like
Yes, mmap
is great when you need to load a large amount of data quickly. The OS will directly map the fileās pages into your processās address space, which is generally more efficient than making multiple read
syscalls. However, for very small files, the overhead of setting up memory mapping might outweigh the benefits, itā like using a tank in a fistfight, itās probably overkill for the task at hand. But very useful on larger files
1 Like
JPL
October 21, 2024, 4:05pm
16
Personally, I keep it simple.
zig 0.13.0 or 0.14. dev
const std = @import("std");
pub fn main() !void {
const data = @embedFile("./src_file.txt");
var flines = std.mem.tokenizeAny(u8, data, "\n");
while(flines.next()) |line| {
std.debug.print("{s}\n", .{line});
}
}
3 Likes
Sze
October 21, 2024, 4:40pm
17
@embedFile
only makes sense if the text file is small enough and isnāt a dynamic input to the program.
1 Like
JPL
October 21, 2024, 4:58pm
18
const std = @import("std");
pub const myallocator = std.heap.page_allocator;
pub fn main() !void {
const parmdir:[] const u8="../Zterm";
const cDIR = std.fs.cwd().openDir(parmdir, .{}) catch |err| {
std.debug.print(" dir >{}",.{err});
return;
};
var my_file = cDIR.openFile("src_file.txt", .{}) catch |err| {
std.debug.print("file >{}",.{err});
return;
};
defer my_file.close();
const file_size = try my_file.getEndPos();
var buffer: []u8 = myallocator .alloc(u8, file_size) catch unreachable;
_ = try my_file.read(buffer[0..buffer.len]);
var flines = std.mem.tokenizeAny(u8, buffer, "\n");
while(flines.next()) |line| {
std.debug.print("{s}\n", .{line});
}
}
Afterwards, you make your directory or file name dynamic
JPL
October 21, 2024, 5:04pm
19
Go see my project:
https://github.com/AS400JPLPC/zig_TermCurs
the mdlFile.zig file
There is how to have the dynamic directory, as well as the file
Well, no computer is ever going to read 16,384 PiB, but what should the limit be? If the computer has 8GB of RAM, you might set the limit to 4GB, or maybe even 8GB and rely on virtual memory. If the computer has 32GB, youāll set a higher limit. By setting the limit to the max, youāre basically delegating the decision to the operating system. If it doesnāt have enough memory to handle a certain file, youāll get an out of memory error. I donāt see the problem with this approach. Why should the programmer put an artificial limit here, if the code will naturally fail with OOM if it canāt handle it?
2 Likes