Performance Hit Relating to GPA and DOD

amesaine · September 13, 2024, 1:22pm

It’s my first time profiling any program and im using poop. I’m just experimenting with Andrew’s data-oriented design.

There’s three benchmarks here. base, dod_arena, and dod_gpa. In dod_gpa, I free memory as soon as possible. There’s a significant increase in wall_time and cache_references in dod_gpa. These are all ReleaseSafe.

General guidance would suffice because im struggling to find resources on how to interpret this. I can share the source code on github if someone’s interested enough.

EDIT: Outliers omitted

Benchmark 1 (29 runs): ./base
  measurement          mean ± σ            min … max      delta
  wall_time           177ms ± 12.8ms     153ms …  208ms   0%
  peak_rss           34.3MB ±  513KB    33.4MB … 35.1MB   0%
  cpu_cycles          219M  ± 25.4M      171M  …  278M    0%
  instructions        386M  ±  193       386M  …  386M    0%
  cache_references    938K  ± 72.3K      809K  … 1.07M    0%
  cache_misses        179K  ± 33.3K      121K  …  268K    0%
  branch_misses      1.39M  ±  332K      745K  … 2.11M    0%
Benchmark 2 (138 runs): ./dod_arena
  measurement          mean ± σ            min … max      delta
  wall_time          36.3ms ± 4.78ms    28.5ms … 46.5ms   ⚡- 79.5% ±  1.5%
  peak_rss           4.05MB ± 2.01KB    4.05MB … 4.05MB   ⚡- 88.2% ±  0.2%
  cpu_cycles         43.6M  ± 9.25M     28.3M  … 62.8M    ⚡- 80.1% ±  2.5%
  instructions       74.2M  ± 4.11      74.2M  … 74.2M    ⚡- 80.8% ±  0.0%
  cache_references   71.0K  ± 24.3K     33.0K  …  150K    ⚡- 92.4% ±  1.6%
  cache_misses       7.25K  ±  851      6.10K  … 14.4K    ⚡- 95.9% ±  3.1%
  branch_misses       274K  ±  106K     88.8K  …  465K    ⚡- 80.2% ±  4.8%
Benchmark 3 (5 runs): ./dod_gpa
  measurement          mean ± σ            min … max      delta
  wall_time          1.03s  ± 58.3ms    1000ms … 1.14s    💩+483.4% ± 13.3%
  peak_rss            803KB ±    0       803KB …  803KB   ⚡- 97.7% ±  1.4%
  cpu_cycles          223M  ± 12.7M      215M  …  245M      +  1.8% ± 10.9%
  instructions        199M  ± 9.34       199M  …  199M    ⚡- 48.5% ±  0.0%
  cache_references   6.42M  ±  981K     5.55M  … 8.03M    💩+584.6% ± 37.3%
  cache_misses       61.8K  ± 27.2K     42.3K  …  108K    ⚡- 65.5% ± 18.0%
  branch_misses      1.22M  ± 65.7K     1.14M  … 1.31M      - 11.8% ± 22.2%

dimdin · September 13, 2024, 4:16pm

In the following table I have erased the outliers column and all the measurements except wall_time.

Benchmark 1 (29 runs): ./base
  measurement     mean ± σ          min … max      delta
  wall_time      177ms ± 12.8ms   153ms …  208ms     0%
Benchmark 2 (138 runs): ./dod_arena
  measurement     mean ± σ          min … max      delta
  wall_time     36.3ms ± 4.78ms  28.5ms … 46.5ms     ⚡- 79.5% ±  1.5%
Benchmark 3 (5 runs): ./dod_gpa
  measurement     mean ± σ          min … max      delta
  wall_time     1.03s  ± 58.3ms  1000ms … 1.14s      💩+483.4% ± 13.3%

Each benchmark runs many times for a total time slot. In this time slot the 1st benchmark executed 29 times: Benchmark 1 (29 runs): ./base, the second 138 times and the third 5 times. Since the time slot is the same, the second benchmark versus the first is faster (138 times vs 29 times), the third benchmark versus the first is slower (5 times vs 29 times).

The first measurement is the wall_time (a wall clock that measures the run time), there is the mean time (average) and the standard deviation (how far or near the values are from average), for 1st benchmark it is 177ms ± 12.8ms, lets ignore the deviation for now and list the averages:

Benchmark wall time	Average wall time
1st	177ms
2nd	36.3ms
3rd	1.03s

The second is the fastest with 36.3 ms and the third is the slowest with 1030 ms.

delta shows the difference in percent between the benchmarks.
The second benchmark is -79.5% (177ms / 36.3ms). Negative means faster than the first, it is about 5 times faster aka .
The third benchmark is 483.4% (1030ms / 177ms). Positive means slower than the first, it is about 5 times slower aka .

min..max lists the minimum and the maximum time for the runs.

memory measurements are:

peak_rss the peak memory usage of the process (rss=resident set size).
cache_references EDIT: total number of cache accesses ~~number of references retrieved from cache~~.
cache_misses number of memory references that was not in cache.

cpu measurements are:

cpu_cycles total number of cpu clock frequency cycles.
instructions total number of instructions executed.
branch_misses number of wrong branch predictions.

amesaine · September 13, 2024, 5:35pm

How does cache_references affect perfomance? It’s quite a hard concept to google for. I succumbed to ChatGPT-4o and asked it how to interpret the table. What I got from it was that constant alloc/free syscalls(?) have high overhead. Solutions included memory pooling or batch allocation.

IntegratedQuantum · September 13, 2024, 5:45pm

Some of these metrics are kind of advanced, you probably only need wall_time most of the time.
But the others can give you a hint at potential causes for slowness.
For all of them lower is better.

wall_time is just the time that your program ran in total
peak_rss is basically the peak memory usage, so if you use the GPA to free memory as soon possible, then it makes sense that would be lower than using an arena.
a cpu cycle or clock cycle is the unit of time derived from your clock speed, this metric tells you how many cycles the CPU is actually doing work, vs sitting idle and waiting for other things. If you multiply this value by the clock speed of your processor (assuming it’s not dynamic) then you can get an estimate of what fraction of time the processor was busy.
Instructions is the number of assembly instruction executed, each instruction can take a different number of cycles, for example integer division takes around 17 cycles, whereas integer addition can be done multiple times per cycle. So the relation between instructions and cycles can give you a hint for whether you are using slow instructions.
cache_references and cache_misses: Every time the CPU wants to fetch some memory, it first looks at the nearest caches before going to the next cache level or finally main memory. cache_references is basically how often it tried to get a value from the cache, and cache misses is how often it failed to do so.
branch_misses: Your cpu tries to predict a branch (like an if statement) before it’s done calculating the values. Once it’s done calculation the values it may have to roll back a few instructions if the prediction was false, this is called a branch miss.

As for your conclusion: The main problem here is just that the GPA is ineffecient: implement a fast general purpose allocator · Issue #12484 · ziglang/zig · GitHub
You can try the c_allocator instead, which should have better performance. If possible using a MemoryPool might also be a good idea.

Generally it seems odd that memory allocation even is an issue, usually with DoD, you should have few allocation/free calls in your program.

amesaine · September 13, 2024, 6:09pm

I don’t understand why high cache_references increased wall_time. I expected the opposite.

Im writing a parser. I dont set an initial capacity for the ArrayLists so that explains the allocations. And in the case of dod_gpa, I essentially free all those allocations immediately. These benchmarks are each parsing 90,000 nodes. Anyway, i was just trying to see how small I can make my memory footprint.

dimdin · September 13, 2024, 6:28pm

Lower cache_references means less cache memory is referenced, since any kind of memory is slow it is better performance.
Higher cache_references means more cache memory referenced, it is worst performance.

Another metric is: 100 * cache_misses / cache_references
Bigger % means the cache is not used.

Benchmark	c ref.	c misses	% miss
1	938K	179K	19.1%
2	71.0K	7.25K	10.2%
3	6420K	61.8K	0%

For Benchmark 3 the cache is used (it is 0% cache miss) but the cpu is accessing 7 times more the cache from the 1st benchmark and more than 100 times than the 2nd benchmark.
Each level of cache memory is slower. What we cannot see here if the reference is for 1st level cache (fast) or for 2nd or 3rd level cache (much slower).

Note that there are much more kernel perf metrics not exposed by poop.

I believe wall_time and peak_rss are the most useful metrics. The other metrics are mostly for performance detectives.

amesaine · September 13, 2024, 6:33pm

It seems that the constant allocation and freeing is the main overhead. In benchmark 2, You can see how if I free the arena on every iteration instead of accumulating all memory until program exit, it manages the same memory footprint as the fine grained dod_gpa with way better performance.

In benchmark 4, the c_allocator performed much better albeit using twice the memory of dod_gpa. So although allocation/free is the main overhead, a better GPA will get you ~90% (wall_time) of the way there on less memory.

Benchmark 1 (156 runs): ./arena_free_on_exit
  measurement          mean ± σ            min … max       delta
  wall_time          31.9ms ± 5.78ms    27.6ms … 47.5ms    0%
  peak_rss           4.05MB ± 2.02KB    4.05MB … 4.05MB    0%
  cpu_cycles         35.4M  ± 11.0M     27.5M  … 64.1M     0%
  instructions       74.2M  ± 4.36      74.2M  … 74.2M     0%
  cache_references   50.4K  ± 26.3K     28.1K  …  146K     0%
  cache_misses       6.78K  ±  736      5.87K  … 10.9K     0%
  branch_misses       175K  ±  122K     82.4K  …  464K     0%
Benchmark 2 (52 runs): ./arena_free_every_iteration
  measurement          mean ± σ            min … max       delta
  wall_time          97.1ms ± 9.91ms    82.3ms …  119ms    💩+204.2% ±  6.9%
  peak_rss            799KB ±    0       799KB …  799KB    ⚡- 80.3% ±  0.0%
  cpu_cycles         52.4M  ± 9.45M     38.0M  … 69.5M     💩+ 48.1% ±  9.4%
  instructions       80.8M  ± 4.48      80.8M  … 80.8M     💩+  9.0% ±  0.0%
  cache_references    401K  ±  190K      165K  …  910K     💩+695.3% ± 60.7%
  cache_misses       7.78K  ±  922      6.63K  … 10.8K     💩+ 14.8% ±  3.6%
  branch_misses       347K  ±  149K      115K  …  611K     💩+ 98.9% ± 23.2%
Benchmark 3 (6 runs): ./gpa
  measurement          mean ± σ            min … max       delta
  wall_time           970ms ± 6.86ms     957ms …  978ms    💩+2937.5% ± 14.9%
  peak_rss            799KB ±    0       799KB …  799KB    ⚡- 80.3% ±  0.0%
  cpu_cycles          203M  ± 3.77M      197M  …  207M     💩+474.9% ± 25.0%
  instructions        199M  ± 11.8       199M  …  199M     💩+167.7% ±  0.0%
  cache_references   4.92M  ±  335K     4.35M  … 5.21M     💩+9674.5% ± 104.7%
  cache_misses       33.6K  ± 2.80K     31.1K  … 38.5K     💩+396.1% ± 10.6%
  branch_misses      1.05M  ± 62.8K      940K  … 1.10M     💩+499.5% ± 56.2%
Benchmark 4 (134 runs): ./c_allocator
  measurement          mean ± σ            min … max       delta
  wall_time          37.4ms ± 7.45ms    29.6ms … 50.9ms    💩+ 17.1% ±  4.8%
  peak_rss           1.57MB ± 27.4KB    1.51MB … 1.61MB    ⚡- 61.2% ±  0.1%
  cpu_cycles         51.7M  ± 15.1M     36.0M  … 78.0M     💩+ 46.2% ±  8.5%
  instructions       91.5M  ± 6.54      91.5M  … 91.5M     💩+ 23.4% ±  0.0%
  cache_references   57.6K  ± 34.1K     23.0K  …  203K       + 14.3% ± 13.8%
  cache_misses       13.1K  ± 2.13K     9.93K  … 28.4K     💩+ 93.4% ±  5.3%
  branch_misses       350K  ±  167K      166K  …  620K     💩+100.3% ± 19.1%

Here is the snippet for freeing the arena on every iteration. I leave it up to imagination how with GPA/c_allocator, fine-grained freeing happens within parse(...).

pub fn main() !void {
    var arena = std.heap.ArenaAllocator.init(std.heap.page_allocator);
    const allocator = arena.allocator();
    // defer arena.deinit();

    var start: usize = 0;
    for (mocks, 0..) |ch, i| {
        if (ch == '\n') {
            const input = mocks[start..i];
            const ast = Parser.parse(allocator, input);
            const ast_str = ast.stringify(allocator, 0);

            start = i + 1;
            _ = arena.reset(.free_all);
        }
    }
}

amesaine · September 13, 2024, 6:36pm

Okay I understand now. The main question is answered but I have follow-up. If memory is slow, what is implied to be the faster alternative? Is it CPU calculations/instructions?

dimdin · September 13, 2024, 6:39pm

Yes, some times it is better to recalculate something instead of keeping it in memory.

The best benchmark must be arena_free_every_iteration using the c allocator as arena child_allocator.

amesaine · September 13, 2024, 6:43pm

Oh right, I’ll try that. Thank you! I won’t post the results anymore. I’ll just edit this comment to confirm.

EDIT: An arena with a c_allocator, freeing on iteration, is the fastest and best wall_time to peak_rss ratio.

mocompute · September 14, 2024, 10:18pm

I haven’t tested it, but in principle would std.mem.page_allocator as the arena’s child allocator be faster than the c_allocator, assuming page size increments fits your overall memory profile?

dimdin · September 14, 2024, 10:47pm

Using the page_allocator as arena’s child allocator means a page (for most systems 4KiB) for each arena allocation. If you allocate ~~many~~ a mix of small and large objects you might have increased memory usage. I am expecting the C allocator to have multiple small allocations in one page and that means management overhead.

But zig makes arena child allocator easy to change The best strategy for complex allocations is to measure your actual application performance with each option.

squeek502 · September 14, 2024, 11:08pm

Some info about that here: Pattern: `Lot` data structure for scope-lifetime nonresizable arrays of runtime-known length · Issue #17656 · ziglang/zig · GitHub

mocompute · September 15, 2024, 12:59am

I’m surprised to hear this. I would expect the arena to go to its child allocator only when it runs out of capacity. So allocating many small objects smaller than a page size is essentially free.

I’ll have a look at the source later to check this as I’m using arenas under the assumption they act like growable bump allocators.

dimdin · September 15, 2024, 1:37am

You are correct. I was wrong. I edited my answer for correctness.

The arena is a single list of nodes (child allocator allocations) and the arena allocator tries to allocate from these nodes, when no space is left it tries to resize the nodes, and finally it creates new nodes.

The best feature of arena allocator is that you can reset it without throwing away the nodes (child allocated memory). e.g. In a loop (or request processing thread), instead of recreating a new arena allocator instance, you can have only one instance and reset it for each iteration (or each request). That means almost no system calls after the first iteration, because the memory remains allocated. In this scenario there is no difference which child allocator you use, and a page_allocator look like a better fit.