RAM Memory use and proper aloccation of array

Over the past six months, I have been developing a chess engine using the Zig programming language. It is already quite strong and has significant potential for further improvement, with a goal to reach an ELO strength of over 3000.

One of my objectives in creating this chess engine was to learn a new programming language. Initially, I started with C#, which I occasionally use for my job. However, I decided to learn something completely new. I experimented with both Rust and Zig, and ultimately chose Zig as it seemed more relevant to my work. I am electrical engineer and I also work with embedded systems. Coming from a background in C, Zig felt more intuitive to me compared to Rust.
Until now, I have been able to learn from resources available on the internet without needing to ask specific questions. However, at this point, I have quite a lot of things that bother me so I will be asking a number of questions.
First one is related to allocators and array.

My first question is related to allocators and arrays. One of the key components in a chess engine is the hash table of positions, which is built during the search for the best move. This table stores a nearly unique hash key for each position, the best move in that position, the evaluation score, and some other data. During the search, you can arrive at the same positions via different sequences of moves, and the search can take some time. Therefore, it is highly beneficial to store the best move and evaluation of each position. This way, you can probe the hash table and avoid repeating searches in positions that have already been explored. This can save a significant amount of time, which can then be used to perform deeper searches, which improves the strength of the chess engine.

One problem or bug I am encountering is related to how I reinitialize the hash table when I set its size. It seems that I am using twice the amount of RAM as the size of the hash table. While this does not affect the engine’s strength, it becomes an issue when I submit the engine for competitive play. All engines need to have a hash table of the same size, as a larger hash size increases the strength of the engine. The size of the hash table is correct, but the chess gui shows a large ruse of RAM memory, which could imply that my has table is actually bigger.
The code for setting the new size is as follows:"

    pub fn init(self: *TranspositionTable, size_mb: u64) void {
        self.ttArray.deinit();

        var size: u64 = 1 << NB_BITS;
        const MB: u64 = 1 << 20;
        var nb_bits: u5 = 16;
        if (size_mb > 0) {
            while (((@as(u64, 1) << nb_bits) * @sizeOf(scoreEntry) <= size_mb * MB / 2) and (nb_bits < 29)) : (nb_bits += 1) {}
            size = @as(u64, 1) << nb_bits;
        }

        std.debug.print("Hash size in MB: {}\n", .{size * @sizeOf(scoreEntry) / MB});
        std.debug.print("Hash size in bytes: {}\n", .{size * @sizeOf(scoreEntry)});

        var tt = TranspositionTable{
            .ttArray = std.ArrayList(u128).init(tt_allocator.allocator()),
            .size = size,
            .mask = size - 1,
            .age = 0,
        };

        tt.ttArray.ensureTotalCapacity(tt.size) catch {};
        tt.ttArray.expandToCapacity();

        self.* = tt;
        self.clear();
    }

The ScoreEntry is a packed structure that is 128 bits in size. The TranspositionTable structure contains ttArray, which consumes the majority of the memory. The rest of the structure contains information about the size, mask, and age. I use std.heap.c_allocator for its speed.

At the start, I deinitialize the old ttArray, create a new one with a new size, and set the pointer to the new array. I suspect that the old array is not being “erased”, but instead continues to consume the original amount of RAM, while the newly initialized array uses additional memory.

For example, I start with a hash table of 128 MB, then the GUI reinitializes the size with an additional 64 MB. When I set the size to a 256 MB array, the total memory used ends up being 128 + 64 + 256 MB, which is almost twice the actual required size. My understanding of deinit is that it clears and frees the memory, effectively reducing the memory used to 0 MB. I then reserve the correct size of new memory and use that. However, it seems my understanding is incorrect.

The question here is where exactlly is my understanding wrong and can you help me with some usefull examples.

Code for the engine is on https://github.com/jabolcni/Lambergar and code for hash table is https://github.com/jabolcni/Lambergar/blob/main/src/tt.zig

1 Like

Looking at your code, you seem to use an ArenaAllocator. ArenaAllocators do (normally) not free any memory. This makes them really fast at the cost of using extra memory when used in the wrong cases(like for data structures that get resized frequently).
Arenas are great for cases, where all memory is freed at once (by calling arena.deinit() or arena.reset()) instead of frequent individual frees throughout the lifetime of the arena (which is what you have here).

Instead I would recommend you to use a GeneralPurposeAllocator here.
The GeneralPurposeAllocator is also great for debugging, as it points out memory leaks and other common memory errors.

3 Likes

I do use the arena allocator, specifically pub var tt_allocator = std.heap.ArenaAllocator.init(std.heap.c_allocator); .

I chose the arena allocator because I could use c_allocator , which is as fast as possible, a crucial factor in chess engines. I didn’t find any examples of how to use c_allocator with a general-purpose allocator. I wasn’t aware that with this allocator, you do not free any memory, which I presume is exactly what is happening.

During the search, I don’t actually free anything; I simply overwrite with new values. Freeing only happens during resizing, which typically occurs at the start of the match, during the actuall chess match I do not free anything.

I replaced the line

pub var tt_allocator =
std.heap.ArenaAllocator.init(std.heap.c_allocator);

with a new initialization:

pub var tt_allocator = std.heap.GeneralPurposeAllocator(.{}){};

This results in the expected behavior of RAM usage, exactly what I need. The potential issue might be lower speed and, consequently, a lower strength of the engine. I am now running matches between the older and new versions to see the difference. This will take some time as the old and new versions need to play a lot of games to collect enough results to discern the difference in strength.

Can you try std.heap.c_allocator without arena. It is much faster that GPA and will conserve RAM. GPA is for debugging, not release.

1 Like

The GPA is intended for general purposes, not just for debugging.
It is true that the GPA is rather slow right now, but it is intended to become faster than the c_allocator in release modes: implement a fast general purpose allocator · Issue #12484 · ziglang/zig · GitHub

2 Likes

Yes, in the bright future when GPA will be on par or faster than other allocators. But for a competitive chess engine today, I am afraid, GPA does not cut it. For example, ghostty is using c_allocator instead of GPA precisely because of performance issues with the later. Allocator interface is one of the Zig’s strengths. Swapping one for another should not be a problem when GPA delivers.

6 Likes

Just to be clear, you can use c_allocator directly:

const allocator = std.heap.c_allocator;

// example usage
var slice = try allocator.alloc(u8, 1024);
defer allocator.free(slice);

c_allocator is a general purpose allocator.

4 Likes

@jabolcni: Tangentially related to your question, but since you’re learning it might be worth mentioning:

First, you should probably not use ArrayList here. ArrayList is used to manage a slice of memory that can grow and/or shrink. If you’re not going to resize the list after it’s created (deiniting and reiniting doesn’t count!) then it’s a useless wrapper and it’s better to just allocate the slice directly with allocator.alloc(), just like @squeek502 suggested.

Second, it’s typical to use usize instead of u64 for memory sizes. On 64-bit systems these types are the same, but using u64 explicitly needlessly prevents your code from compiling on 32-bit systems.

Finally, instead of running a while-loop to determine the size in bits, you could use the math.log2_int() function instead (there is also math.log2_int_ceil() if you need to round up). The performance benefit is negligible in this case, but it’s useful to get in the habit of using standard functions instead of re-inventing the wheel.

The GPA is only slow in relative terms. You won’t notice the difference between the C allocator and Zig’s GPA unless your program does so many allocations that the allocator dominates the performance. It’s not clear that’s the case here.

4 Likes

I implemented both GPA approach without c_allocator and approach with c_allocator and without ArrayList.

It turns out that GPA is actually not that slower. It affects performance but is actually just marginally slow. So observation from @maksverver is actually quite correct.

Using c_allocator withou ArrayList is faster. Again not significantly faster, but it a clearly a preferred solution.

I also used usize and math.log2_int(), the code is nicer now.

I would call it a solved problem. Thank for the help to all.

5 Likes

Just to clarify – did you remove Arena allocator that was preventing releasing the RAM?

1 Like

You could also try this as c_allocator / gpa alternative

Yes, I use c_allocator directly.