Best allocator for massive files?

Alumman · August 28, 2024, 2:45am

I am working on a fasta file reader, and trying to find the best memory allocation strategy for it. Currently I am using the GeneralPurposeAllocator following the allocators guide since I don’t know if I really need a specialized one, but I am allocating anywhere between 30,000 bytes to upwards of 800MB for massive files.
Example code: (has some hand written string functions)

const Fasta = struct {
    id:?[]u8,
    strand:?[]u8,
    length:u64,
}
pub fn main() !void {
    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    const allocator = gpa.allocator();
    var fasta:[]Fasta = allocator.alloc(Fasta, 1);
    //omitting code for brevity
    if (id)line {
         if (first_id) {
             fasta[i].id = try allocator.alloc(u8, file_line.len);
             string_copy(fasta[i].id, &file_line, file_line.len); 
         }
         else {
             fasta = allocator.realloc(Fasta, i);
             fasta[i].id = try allocator.alloc(u8, file_line.len);
             string_copy(fasta[i].id, &file_line, file_line.len);
         }
    }
    if (new_strand) {
        fasta[i].length = file_line.len;
        fasta[i].strand = allocator.alloc(u8, file_line.len); 
        string_copy(fasta[i].strand, &file_line, file_line.len);
    }
    if (current_strand) {
        fasta[i].length += file_line.len;
        allocator.realloc(u8, fasta[i].length);
        concat(fasta[i].strand[fasta[i].length - file_line.len], &file_line);
    }

So I am frequently calling alloc and realloc. In the worst case scenario I have, I allocate memory about 1,000,000 times. Is there a better allocator for this type of heap usage in terms of speed? I am willing to accept there isn’t (avoiding c_allocator), but I don’t know enough about page_allocator to really know if it could help.

AndrewCodeDev · August 28, 2024, 3:38am

Before that, here’s a bit of context about the page_allocator: Over-Allocation with the Page Allocator

It looks like you are copying into a file or some kind of buffer - you may want to look into dupe instead of alloc for the memory allocator.

My honest opinion from hearing what you said about allocating memory a million times makes me think that this is actually an X/Y problem. Maybe you can explain your circumstances a bit more so we can get into the details.

There’s a lot of options, but it depends on the lifetime of each allocation and how you’re using those.

ktz_alias · August 28, 2024, 3:46am

Using std.heap.ArenaAllocator may be useful for your case.
That allocator providing from std.heap.ArenaAllocator does not return memories to OS but pool and reuse it.

    fn free(ctx: *anyopaque, buf: []u8, log2_buf_align: u8, ret_addr: usize) void {
        _ = log2_buf_align;
        _ = ret_addr;

        const self: *ArenaAllocator = @ptrCast(@alignCast(ctx));

        const cur_node = self.state.buffer_list.first orelse return;
        const cur_buf = @as([*]u8, @ptrCast(cur_node))[@sizeOf(BufNode)..cur_node.data];

        if (@intFromPtr(cur_buf.ptr) + self.state.end_index == @intFromPtr(buf.ptr) + buf.len) {
            self.state.end_index -= buf.len;
        }
    }

https://ziglang.org/documentation/master/std/#src/std/heap/arena_allocator.zig

When it calls deinit of ArenaAllocator, the memories are returned.
Carefully calling deinit of Allocator will result in reusing memories.

example:

var gpa = std.heap.GeneralPurposeAllocator(.{}){};
var arena = std.heap.ArenaAllocator.init(gpa.allocator());
defer arena.deinit();
const allocator = arena.allocator();

var fasta = try allocator.alloc(Fasta, 1);
defer allocator.free(fasta);

// (snip)

Alumman · August 28, 2024, 3:55am

So the goal is to have a Fasta file read into memory that can then have operations done on it (i.e generate a protein sequence, find restriction sites, primer candidate regions, and promoter regions). The given file format has 2 parts:

an ID line that starts with ‘>’
genetic code broken up into many 80 character lines
for 2 I convert into a single readable array that I can walk to find the answers.

The whole file will stay in memory until the user is done with it, at which point the entire memory block will be freed (was planning on implementing an Arena Allocator for this but wasn’t sure if a certain child allocator would be better than another).

I have been trying to think of clever ways to get the needed array size ahead of time by looking ahead into the file until I find the next ID line, allocating the exact size and then going back and then copy the data, but I haven’t fully figured out the seekTo function for that/file position jumping (since I have only thought of this idea 10 ish minutes ago).

AndrewCodeDev · August 28, 2024, 4:00am

Dp you do any mutations per line that you read in? If not, you can potentially just slice up the underlying file buffer and have a view into each line.

Alumman · August 28, 2024, 4:02am

I only concatenate the lines together. Beyond that they are a constant value for reading and generating new data.

AndrewCodeDev · August 28, 2024, 4:06am

Okay, so one approach you may want to use here is something like a pre-sized ArrayList that you use as a writer (literally array_list.writer()). That way, you can write values directly into place and avoid allocations between concatenations. If you size your writer based on something smaller than the file size, you may only need to reallocate once or twice if you undershoot the size of the buffer (the default writer will call append… you can use a fixed writer if you don’t want that behavior).

Alumman · August 28, 2024, 4:12am

I’ll look into that as an option. I haven’t messed with ArrayLists at all so I don’t know their associated functions. I will sleep on this problem and get back to it tomorrow.

Sze · August 28, 2024, 9:42am

I think you should avoid the concatenation, if you design your access and iteration logic based on the static line format it should be possible to just ignore the newlines and > characters, if there are other lines you can just ignore those as well.

I think it would make sense to use memory mapping to map the entire file in memory as a read only mapping. (That way you don’t have to read in the entire file and can just let the os handle the loading of physical pages)
Then I would write code that scans over the entire memory region creating an index for all the lines that contain wanted data.

In this case you would essentially create an ArrayList([]const u8) data_lines, which is a list of slices that start with the first byte after > and end before the newline (pointing into the memory mapped area).

That index essentially gives you a view into what the data is without the control characters, you can store that index as a separate file and load it again (when the original file hasn’t changed).
(You would use the base pointer of the memory mapping to calculate the offset and length for every slice and store that instead of the slices themselves, then upon reading of the index you add the offset to the new base pointer of the new memory mapping)

If that isn’t enough, then my next step would probably be to convert the data_lines which seem to be text into something more dense like a binary format and write them out as a separate binary file, at that point it makes sense to do the concatenation by just writing out the data but not any of the format characters. If the lines just contain ACGT characters you could write out many enum(u2) {A, C, G, T} (possibly accumulating those until you have a full byte)

If the alphabet of what is contained in data is different/larger you would have to adapt to something similar.

So one of my questions is what is the data, can it be represented more compactly in binary and would a more compact representation help you?

Another thing to consider is that if you don’t really care about getting small sizes the converting to a binary form isn’t really necessary, instead you could have a step that compresses/decompresses the original data file. (compress it when it isn’t being used, or just leave it uncompressed if storage isn’t a bottleneck)

If there is lots of redundancy in the text format it likely compresses very well, without needing to use manual techniques.

Here is an example for using mmap

chrboesch · August 29, 2024, 7:39pm

You could try a fix buffer on the heap, it’s extremely fast:

    const heap = std.heap.page_allocator;
    const memory_buffer = try heap.alloc(u8, 800 * 1024 * 1024); // 800 MB memory
    defer heap.free(memory_buffer);
    var fba = std.heap.FixedBufferAllocator.init(memory_buffer);
    const allocator = fba.allocator();

Alumman · August 30, 2024, 3:24am

Update to the solution. A fix to what I previously had, I swapped the allocations from structs to a single array to dump the file in, and swapped the buffer to directly read into the array. That decreased the large file read time from 30 minutes to 30 seconds. And then using an array list to mark specific indexes I use to slice the pieces for evaluation

pub fn main() !void {
    var array_list = std.ArrayList(u64).init(allocator);
    const file_size = (try file.stat()).size;
    var file_contents = try allocator.alloc(u8, file_size);
    var index = 0;
    while (try reader.readUntilDelimiterOrEof(file_contents[index..], '\n')) |line| {
        if (line[0] == '>') {
             try array_list.append(index + 1);
             try array_list.append(index + line.len); 
        }
        index += line.len;
    }
}

david_vanderson · August 30, 2024, 8:22pm

Maybe also look into mmap - you can use it to have the OS give you what looks like memory but is backed by the file.

Alumman · August 30, 2024, 9:08pm

I am going to look into that as well because I do want it to be even more efficient. But the previous solution was a quick gain for minimal code change to get it functioning at a reasonable speed.