Heap allocated struct faster than comptime calculated arrays

I did a small experiment:
My chessprogram computes all attackdata @comptime into hard constant arrays, so these are baked into the exe.
Now I did a clone which creates these into a heap-created struct with attackdata @runtime.
To my surprise the heap-version is faster.
Why could that be?

Bench:

Total nodes: 18999768562 13.772s 1379.5293 Mnodes/s (1379529270) // comptimes
Total nodes: 18999768562 13.089s 1451.5786 Mnodes/s (1451578573) // heap
const Data = struct {
    file_magics: [64]MagicEntry,
    main_magics: [64]MagicEntry,
    anti_magics: [64]MagicEntry,
    pawn_attacks_white: [64]u64,
    pawn_attacks_black: [64]u64,
    knight_attacks: [64]u64,
    king_attacks: [64]u64,
    rank_attacks: [64 * 64]u64,
    file_attacks: [64 * 64]u64,
    diag_main_attacks: [64 * 64]u64,
    diag_anti_attacks: [64 * 64]u64,
};

const MagicEntry = struct {
    mask: u64,
    magic: u64,
};

Maybe having them in a struct rather than global arrays, the compiler was able to rearrange the struct and get better locality of reference for the pattern of memory access.

Try doing deeper profiling and look at cache misses.

5 Likes

Adding to this, you can at least narrow down to them being global by moving them to the stack and seeing if the improvement persists. ofc that might not be practical if they are very large.

The difference you show here, could also come from differences in code alignment, see:

All in all I wouldn’t worry about such small differences.

1 Like

Know that one yes. But I would have expected heap pointer to be much slower than consts, but maybe it just boils down to (almost) the same pattern.

Why do you think they would be slower? Are you just profiling the memory access or the whole system? I’d expect a slowdown would be because of the work to allocate and create the Arrarys, but after that access shouldn’t really be that different. Memory is the same weather it is global const or on the heap.

He said

So, since they’re contiguous like this, you’re saying (right?), that there should be no realizable difference. Sure, a cache miss might be more likely to get the struct instance in the first place, but thereafter it’s the same… but if “these” were many data dispersed, then there’d surely be a bit of difference, no? Memory (access) isn’t always “the same weather it is global const or on the heap”… fair?

If the way the new struct is created is disparate and spread throughout the heap, then yeah, that would be surprising to see faster access times than a straight memory array. I’m just not sure what the entire difference between the two approaches are. That’s why I’m asking why the expectations that they should be different is.

Yes I realize heap access or const access is about the same.
I probably will change some things to heap. Makes the exe smaller too.

While pondering this issue I noticed something unexpected. Running the following code (optimize = ReleaseFast):

const std = @import("std");

const a: [5]u64 = undefined;
const b: [64 * 64 * 14]u64 = undefined;
const c: [2]u64 = undefined;

pub fn main() void {
    std.debug.print("a@{d}\n", .{@intFromPtr(&a)});
    std.debug.print("b@{d}\n", .{@intFromPtr(&b)});
    std.debug.print("c@{d}\n", .{@intFromPtr(&c)});
}

produces these results:

a@16782616
b@16782656
c@17241792

I would expect the compiler to move the large struct to the end (or beginning) so that the smaller values are closer together. This does happen where a has only 4 elements:

a@17241464
b@16782616
c@17241784

Anyway, the intermingling of the large struct with other constant values is what I suspect to be behind the performance difference. In the example above, a is in the same page as the beginning of b. Accessing the former would prevent the TLB entry for accessing the latter from being evicted even when it’s not needed.

3 Likes

Just to cover an unusual scenario, you won’t happen to have a machine with a mix of performant cores and efficient cores?

I encountered inconsistent benchmark results when I got a new machine recently. The same benchmark had different timings from one run to the another. It finally dawned on me that they were running on different core types. I often started a long running benchmark and switched windows to do something else. The benchmark in the background window got assigned to an efficient core and ran slower.

If you have a machine with a mix of different core types, be sure to run the benchmarks in a consistent environment, either in foreground or in background.

1 Like

“started a long running benchmark and switched windows to do something else”

Doing something else on the same machine is guaranteed to spoil your benchmarking results.

The point is to have an explanation of the anomaly so that a consistent benchmark environment can be set up. I can leave it running and do nothing else, but running it alone in foreground or background still has different outcomes.

I can certainly do other things while it’s running. The benchmark is single threaded and the machine has many cores. As long as I run it in a consistent manner and don’t exhaust all the cores it’s ok. Of course for formal publishable benchmarks, I would leave it alone.

In case this helps: I’ve found that having the browser open (Firefox) with a bunch of tabs, even if I don’t use them, will make my single-threaded benchmarks run quite a bit slower and more inconsistently.