What is a benchmark?
A benchmark is an automated way to methodically collect measurements, possibly using those results in more or less automated ways.
Benchmarking vs Profiling
The basic distinction between benchmarking and profiling, is that benchmarking measures how: how fast is a routine, how much memory does it allocate. Profiling is concerned with where: how much time within a process is spent on which functions, and which lines, and the same for allocations, syscalls, and other resources.
Benchmarking and profiling overlap from the implementation perspective, because they both need to take measurements and eventually present results in a useful way. Broadly speaking, benchmarks will collect single metrics, such as best, median, and mean performance, whereas profilers will build up a table of data, useful to graph the use of resources in the program in a fine-grained way, such as with a flame graph.
Typically benchmarks would be more focused on being run in an automated manner, for current and future versions of the software, while profilers often are more focused on identifying specific areas of the code which are āhotā, and will benefit the most from optimizations.
Benchmarks are generally implemented as test programs with timing mechanisms, which can be run in an automated manner to collect performance data. This makes them repeatable, and also allows the tracking of changes over time, for example as the software is being patched or changed. A detected negative change in the performance of a benchmark is called a regression; when automated as part of a testing suite this is regression testing.
Benchmarks allow you to:
- gain insight into how the analyzed element behaves
- compare and contrast the results for competing solutions of a problem
- track measurements as things are changed
- spot problems that cause abrupt changes in the measurements
- collect information in an automated and repeatable manner
- process the results, possibly triggering or alerting, when encountering unexpected results
- gain confidence that changes in the software have a positive effect
Different Kinds of Benchmarks
There are many different kinds of benchmarks, depending on the goals, from micro benchmarks that test tiny functions to application benchmarks that test specific applications, or even entire systems of multiple computers.
There is a multitude of things that can be analyzed, you can analyze small pieces, many pieces in combination and you can measure many different things about them.
Writing good benchmarks can be very difficult and has to be done carefully, else you might take measurements in a way that gives you wrong results. This could lead you into thinking something was an improvement, when it wasnāt.
Measurements
this is an (incomplete) list of properties you could measure:
- time (wall clock, program time, kernel time)
- memory
- clock cycles
- cache misses
- branch misses
- latency
- throughput
Accuracy
Some properties can be measured exactly, others only with varying degrees of accuracy, especially time can be difficult to measure accurately and may depend on os/platform, or architectures specific support.
Modern cpus have a variety of hardware counters that can be used to get accurate counts of specific events, making use of those requires using libraries or tools that help you with that, or diving into instruction set architecture (ISA) manuals and figuring out the specifics and how to use them.
Preparation
Modern computers can be hard to predict and measure, varying clock speeds, via cpu frequency boost features, thermal throttling, many different processes running on one computer at the same time, fighting over access to the cpu cores, cpu caches getting invalidated by other processes, branch predictors making mis-predictions because they havenāt adapted to the change in code or data yet, virtual memory mapping, randomized address layouts and so on.
So part of preparing a good environment for running a benchmark, can be to make sure the computer it runs on is actually in a state you want to test.
If you want to test how the system for a car performs while it is on fire, then putting a computer in some sort of oven could be a useful test setup. If you want to optimize a game, then we probably donāt care about the performance while the gaming rig is on literal fire, but we may care about the case where it is close to max temperatures and the cooling system is struggling. If we are benchmarking networking speeds, we may want a lot of other programs to use the network at the same time.
To differentiate between these cases, it can be useful to capture information about the system, what are the hardware specs, number of cpu cores, available memory, free disk space, cpu usage, number of running processes, temperature sensors. All of that could be useful in providing context information that could help in explaining the results we are about to capture.
You could even go so far as to measure some of these things to establish an baseline for the activity of the system (before we have started to do any testing on it) and maybe even continue to track these while the benchmark is running.
So there can be things that are directly part of the benchmark setup (preparing test data in buffers, pre-warming the cpu cache if you want to test cpu performance for operations on cached data) and others that are more of a testing-system setup, that may or may not be deliberate.
But ideally we would be able to tell the state of the system and store that information together with the results. Additionally setup of specific system states, could be interesting, if we want to see, whether that has an influence on the results.
Some of these setups may require physical setups, while others may be done with software.
What is important?
You need to decide what parts are important and useful and what isnāt.
What information is useful to find insights vs what is just wasted disk space, because you will never use it or analyze it, to get insights from it.
Taking the measurements
Often times benchmarking code will run more than just once, instead making thousands of tests for slower operations or millions or billions for micro benchmarks that test very small, very specific things.
The benefit of that, is that by having many repetitions, sampling something over and over again, we can actually get an idea about the distribution of the possible and expected results of the benchmark. If we only run it once, we canāt know whether we just got lucky, but if we have many iterations we can tell whether certain measurements are consistently the same or vary widely.
Sometimes there are also things that are extremely difficult to measure, for those it may be necessary to have many repetitions to make the problem big enough to be able to measure the difference between two different solutions.
To get results which are independent of code and memory layouts, techniques like used in Stabilizer[1] are needed.
Evaluating the measurements
Once you have many measurements you can apply statistical analysis to it.
Let me repeat that, you can do it (I donāt want to, I havenāt touched statistics since I finished school, I am only half joking )
This can tell you things like, whether the measurements have a high likely hood of being correct, how likely it is that certain measurements happen, whether there are different clusters, whether something is likely a fluke, how many more times you would need to test, to be reasonably sure (to what degree of confidence) that it actually was a fluke, etc.
Visualizing distributions of results can be very useful, because people are pretty good at seeing trends or abnormalities in data that is represented well visually / spatially, however this isnāt sound statistics [2].
The whole field of data science is useful, for coming up with ideas, for things you could do to analyze and process the measurements.
Another thing that can be useful is to have a general idea (ballpark estimation) of what results you would expect on your hardware [3].
Presenting results
Presenting the results with diagrams and charts, makes it more likely that it can be understood by people quickly, without having to study the numbers in detail.
It also can make it easier to spot trends or slow long term changes.
And many people (myself included) prefer looking at charts and pretty pictures, more than looking at raw tables of data.
Collecting and archiving results
To enable comparisons between different benchmark runs, it is useful to save the results of the runs and archive them, this allows us to see whether a month of working on the code, had a positive impact, or maybe it will show that one small change caused everything to become slower.
Program simulation and deterministic programs
If you can write your program in specific ways[4] that make it possible that it can be tested and simulated, that also may help with benchmarking specific situations and connecting specific program executions with the benchmark results. (For example combining fuzz testing, with benchmarking and storing the seed for the randomly generated inputs)
Getting practical
How to use std.mem.doNotOptimizeAway?
This section is about getting practical with benchmarking with Zig specifically.
Examples
possibly linking to other topics, blogs, examples or videos
A very simple zig example
- a very simple example TODO
[what is good about it / would could be improved, rough discription]
An example using the package manager and a library / extra tooling
- a more complex example TODO (possibly using a benchmarking library)
[rough description / notes about that, with links to details if necessary / available]
Comparison of existing benchmarking tools and libraries
TODO table of features?
Should the comparison be its own doc?
Projects:
- that might provide insights / could be studied in more detail
- could be used for benchmarking / as libraries
- should be compared for what features they provide
Zig Projects:
- GitHub - hendriknielaender/zBench: š zig benchmark
- integrates with zigs progress api
- GitHub - andrewrk/poop: Performance Optimizer Observation Platform
- uses perf
- GitHub - karlseguin/benchmark.zig: Simple Benchmarking for Zig
- GitHub - kprotty/zig-adaptive-lock: Benchmarking a faster std.Mutex implementation for Zig
- seems to have some benchmarking code
- GitHub - Hejsil/zig-bench: Simple benchmarking library
- GitHub - dweiller/zubench: Micro benchmarking in zig
- GitHub - ziglang/gotta-go-fast: Performance Tracking for Zig
- takes multiple different measurements
- seems bitrotted?
does this need volunteers, bringing it up to date with current zig? - has pretty diagrams: Performance Tracking ā” Zig Programming Language
- can the community help in getting this revived?
is there something making this difficult?
Other tools:
- perf
- hotspot (profiling using perf)
- ztracy (zig example for sampling and tracing profiler tracy)
- Stabilizer[1:1]
wanted features
(which are missing from the comparison of tools)
(maybe it makes sense to do the feature comparison of tools first)
collaborating on tools?
- links to topics / githubs / discussions
Research
This is about more theoretical things, research papers about implementing profilers / benchmarking etc.
Problems
Interesting research / other related work
do you know of good papers / blogs or other learning material related to this?
Further reading and references
[1:2] Performance Matters by Emery Berger
[2:1] Sound Statistics
[3:1] Napkin math
[4:1] TigerStyle!
Performance Matters by Emery Berger
https://www.youtube.com/watch?v=r-TLSBdHe1A
Stabilizer | Emery Berger ā©ļø ā©ļø ā©ļøSound Statistics https://youtu.be/r-TLSBdHe1A?t=1026 ā©ļø ā©ļø
Being able to do rough napkin math, to figure out whether something should run faster / better, or whether it is somewhat close to the physical limits.
Simple Performance Counters ā Handmade Hero ā Episode Guide ā Handmade Hero + next (TODO other links?) ā©ļø ā©ļøTigerStyle! (Or How To Design Safer Systems in Less Time) by Joran Dirk Greef
https://www.youtube.com/watch?v=w3WYdYyjek4 ā©ļø ā©ļø