It’s my first time profiling any program and im using poop. I’m just experimenting with Andrew’s data-oriented design.
There’s three benchmarks here. base, dod_arena, and dod_gpa. In dod_gpa, I free memory as soon as possible. There’s a significant increase in wall_time and cache_references in dod_gpa. These are all ReleaseSafe.
General guidance would suffice because im struggling to find resources on how to interpret this. I can share the source code on github if someone’s interested enough.
In the following table I have erased the outliers column and all the measurements except wall_time.
Benchmark 1 (29 runs): ./base
measurement mean ± σ min … max delta
wall_time 177ms ± 12.8ms 153ms … 208ms 0%
Benchmark 2 (138 runs): ./dod_arena
measurement mean ± σ min … max delta
wall_time 36.3ms ± 4.78ms 28.5ms … 46.5ms ⚡- 79.5% ± 1.5%
Benchmark 3 (5 runs): ./dod_gpa
measurement mean ± σ min … max delta
wall_time 1.03s ± 58.3ms 1000ms … 1.14s 💩+483.4% ± 13.3%
Each benchmark runs many times for a total time slot. In this time slot the 1st benchmark executed 29 times: Benchmark 1 (29 runs): ./base, the second 138 times and the third 5 times. Since the time slot is the same, the second benchmark versus the first is faster (138 times vs 29 times), the third benchmark versus the first is slower (5 times vs 29 times).
The first measurement is the wall_time (a wall clock that measures the run time), there is the mean time (average) and the standard deviation (how far or near the values are from average), for 1st benchmark it is 177ms ± 12.8ms, lets ignore the deviation for now and list the averages:
Benchmark wall time
Average wall time
1st
177ms
2nd
36.3ms
3rd
1.03s
The second is the fastest with 36.3 ms and the third is the slowest with 1030 ms.
delta shows the difference in percent between the benchmarks.
The second benchmark is -79.5% (177ms / 36.3ms). Negative means faster than the first, it is about 5 times faster aka .
The third benchmark is 483.4% (1030ms / 177ms). Positive means slower than the first, it is about 5 times slower aka .
min..max lists the minimum and the maximum time for the runs.
memory measurements are:
peak_rss the peak memory usage of the process (rss=resident set size).
cache_references EDIT: total number of cache accesses number of references retrieved from cache.
cache_misses number of memory references that was not in cache.
cpu measurements are:
cpu_cycles total number of cpu clock frequency cycles.
instructions total number of instructions executed.
How does cache_references affect perfomance? It’s quite a hard concept to google for. I succumbed to ChatGPT-4o and asked it how to interpret the table. What I got from it was that constant alloc/free syscalls(?) have high overhead. Solutions included memory pooling or batch allocation.
Some of these metrics are kind of advanced, you probably only need wall_time most of the time.
But the others can give you a hint at potential causes for slowness.
For all of them lower is better.
wall_time is just the time that your program ran in total
peak_rss is basically the peak memory usage, so if you use the GPA to free memory as soon possible, then it makes sense that would be lower than using an arena.
a cpu cycle or clock cycle is the unit of time derived from your clock speed, this metric tells you how many cycles the CPU is actually doing work, vs sitting idle and waiting for other things. If you multiply this value by the clock speed of your processor (assuming it’s not dynamic) then you can get an estimate of what fraction of time the processor was busy.
Instructions is the number of assembly instruction executed, each instruction can take a different number of cycles, for example integer division takes around 17 cycles, whereas integer addition can be done multiple times per cycle. So the relation between instructions and cycles can give you a hint for whether you are using slow instructions.
cache_references and cache_misses: Every time the CPU wants to fetch some memory, it first looks at the nearest caches before going to the next cache level or finally main memory. cache_references is basically how often it tried to get a value from the cache, and cache misses is how often it failed to do so.
branch_misses: Your cpu tries to predict a branch (like an if statement) before it’s done calculating the values. Once it’s done calculation the values it may have to roll back a few instructions if the prediction was false, this is called a branch miss.
I don’t understand why high cache_references increased wall_time. I expected the opposite.
Im writing a parser. I dont set an initial capacity for the ArrayLists so that explains the allocations. And in the case of dod_gpa, I essentially free all those allocations immediately. These benchmarks are each parsing 90,000 nodes. Anyway, i was just trying to see how small I can make my memory footprint.
Lower cache_references means less cache memory is referenced, since any kind of memory is slow it is better performance.
Higher cache_references means more cache memory referenced, it is worst performance.
Another metric is: 100 * cache_misses / cache_references
Bigger % means the cache is not used.
Benchmark
c ref.
c misses
% miss
1
938K
179K
19.1%
2
71.0K
7.25K
10.2%
3
6420K
61.8K
0%
For Benchmark 3 the cache is used (it is 0% cache miss) but the cpu is accessing 7 times more the cache from the 1st benchmark and more than 100 times than the 2nd benchmark.
Each level of cache memory is slower. What we cannot see here if the reference is for 1st level cache (fast) or for 2nd or 3rd level cache (much slower).
Note that there are much more kernel perf metrics not exposed by poop.
I believe wall_time and peak_rss are the most useful metrics. The other metrics are mostly for performance detectives.
It seems that the constant allocation and freeing is the main overhead. In benchmark 2, You can see how if I free the arena on every iteration instead of accumulating all memory until program exit, it manages the same memory footprint as the fine grained dod_gpa with way better performance.
In benchmark 4, the c_allocator performed much better albeit using twice the memory of dod_gpa. So although allocation/free is the main overhead, a better GPA will get you ~90% (wall_time) of the way there on less memory.
Here is the snippet for freeing the arena on every iteration. I leave it up to imagination how with GPA/c_allocator, fine-grained freeing happens within parse(...).
Okay I understand now. The main question is answered but I have follow-up. If memory is slow, what is implied to be the faster alternative? Is it CPU calculations/instructions?
I haven’t tested it, but in principle would std.mem.page_allocator as the arena’s child allocator be faster than the c_allocator, assuming page size increments fits your overall memory profile?
Using the page_allocator as arena’s child allocator means a page (for most systems 4KiB) for each arena allocation. If you allocate many a mix of small and large objects you might have increased memory usage. I am expecting the C allocator to have multiple small allocations in one page and that means management overhead.
But zig makes arena child allocator easy to change The best strategy for complex allocations is to measure your actual application performance with each option.
I’m surprised to hear this. I would expect the arena to go to its child allocator only when it runs out of capacity. So allocating many small objects smaller than a page size is essentially free.
I’ll have a look at the source later to check this as I’m using arenas under the assumption they act like growable bump allocators.
You are correct. I was wrong. I edited my answer for correctness.
The arena is a single list of nodes (child allocator allocations) and the arena allocator tries to allocate from these nodes, when no space is left it tries to resize the nodes, and finally it creates new nodes.
The best feature of arena allocator is that you can reset it without throwing away the nodes (child allocated memory). e.g. In a loop (or request processing thread), instead of recreating a new arena allocator instance, you can have only one instance and reset it for each iteration (or each request). That means almost no system calls after the first iteration, because the memory remains allocated. In this scenario there is no difference which child allocator you use, and a page_allocator look like a better fit.