Benchmarking

What is a benchmark?

A benchmark is an automated way to methodically collect measurements, possibly using those results in more or less automated ways.

Benchmarking vs Profiling

The basic distinction between benchmarking and profiling, is that benchmarking measures how: how fast is a routine, how much memory does it allocate. Profiling is concerned with where: how much time within a process is spent on which functions, and which lines, and the same for allocations, syscalls, and other resources.

Benchmarking and profiling overlap from the implementation perspective, because they both need to take measurements and eventually present results in a useful way. Broadly speaking, benchmarks will collect single metrics, such as best, median, and mean performance, whereas profilers will build up a table of data, useful to graph the use of resources in the program in a fine-grained way, such as with a flame graph.

Typically benchmarks would be more focused on being run in an automated manner, for current and future versions of the software, while profilers often are more focused on identifying specific areas of the code which are ā€˜hotā€™, and will benefit the most from optimizations.

Benchmarks are generally implemented as test programs with timing mechanisms, which can be run in an automated manner to collect performance data. This makes them repeatable, and also allows the tracking of changes over time, for example as the software is being patched or changed. A detected negative change in the performance of a benchmark is called a regression; when automated as part of a testing suite this is regression testing.

Benchmarks allow you to:

  • gain insight into how the analyzed element behaves
  • compare and contrast the results for competing solutions of a problem
  • track measurements as things are changed
  • spot problems that cause abrupt changes in the measurements
  • collect information in an automated and repeatable manner
  • process the results, possibly triggering or alerting, when encountering unexpected results
  • gain confidence that changes in the software have a positive effect

Different Kinds of Benchmarks

There are many different kinds of benchmarks, depending on the goals, from micro benchmarks that test tiny functions to application benchmarks that test specific applications, or even entire systems of multiple computers.

There is a multitude of things that can be analyzed, you can analyze small pieces, many pieces in combination and you can measure many different things about them.

Writing good benchmarks can be very difficult and has to be done carefully, else you might take measurements in a way that gives you wrong results. This could lead you into thinking something was an improvement, when it wasnā€™t.

Measurements

this is an (incomplete) list of properties you could measure:

  • time (wall clock, program time, kernel time)
  • memory
  • clock cycles
  • cache misses
  • branch misses
  • latency
  • throughput

Accuracy

Some properties can be measured exactly, others only with varying degrees of accuracy, especially time can be difficult to measure accurately and may depend on os/platform, or architectures specific support.

Modern cpus have a variety of hardware counters that can be used to get accurate counts of specific events, making use of those requires using libraries or tools that help you with that, or diving into instruction set architecture (ISA) manuals and figuring out the specifics and how to use them.

Preparation

Modern computers can be hard to predict and measure, varying clock speeds, via cpu frequency boost features, thermal throttling, many different processes running on one computer at the same time, fighting over access to the cpu cores, cpu caches getting invalidated by other processes, branch predictors making mis-predictions because they havenā€™t adapted to the change in code or data yet, virtual memory mapping, randomized address layouts and so on.

So part of preparing a good environment for running a benchmark, can be to make sure the computer it runs on is actually in a state you want to test.

If you want to test how the system for a car performs while it is on fire, then putting a computer in some sort of oven could be a useful test setup. If you want to optimize a game, then we probably donā€™t care about the performance while the gaming rig is on literal fire, but we may care about the case where it is close to max temperatures and the cooling system is struggling. If we are benchmarking networking speeds, we may want a lot of other programs to use the network at the same time.

To differentiate between these cases, it can be useful to capture information about the system, what are the hardware specs, number of cpu cores, available memory, free disk space, cpu usage, number of running processes, temperature sensors. All of that could be useful in providing context information that could help in explaining the results we are about to capture.

You could even go so far as to measure some of these things to establish an baseline for the activity of the system (before we have started to do any testing on it) and maybe even continue to track these while the benchmark is running.

So there can be things that are directly part of the benchmark setup (preparing test data in buffers, pre-warming the cpu cache if you want to test cpu performance for operations on cached data) and others that are more of a testing-system setup, that may or may not be deliberate.

But ideally we would be able to tell the state of the system and store that information together with the results. Additionally setup of specific system states, could be interesting, if we want to see, whether that has an influence on the results.

Some of these setups may require physical setups, while others may be done with software.

What is important?

You need to decide what parts are important and useful and what isnā€™t.
What information is useful to find insights vs what is just wasted disk space, because you will never use it or analyze it, to get insights from it.

Taking the measurements

Often times benchmarking code will run more than just once, instead making thousands of tests for slower operations or millions or billions for micro benchmarks that test very small, very specific things.

The benefit of that, is that by having many repetitions, sampling something over and over again, we can actually get an idea about the distribution of the possible and expected results of the benchmark. If we only run it once, we canā€™t know whether we just got lucky, but if we have many iterations we can tell whether certain measurements are consistently the same or vary widely.

Sometimes there are also things that are extremely difficult to measure, for those it may be necessary to have many repetitions to make the problem big enough to be able to measure the difference between two different solutions.

To get results which are independent of code and memory layouts, techniques like used in Stabilizer[1] are needed.

Evaluating the measurements

Once you have many measurements you can apply statistical analysis to it.
Let me repeat that, you can do it (I donā€™t want to, I havenā€™t touched statistics since I finished school, I am only half joking :stuck_out_tongue: )

This can tell you things like, whether the measurements have a high likely hood of being correct, how likely it is that certain measurements happen, whether there are different clusters, whether something is likely a fluke, how many more times you would need to test, to be reasonably sure (to what degree of confidence) that it actually was a fluke, etc.

Visualizing distributions of results can be very useful, because people are pretty good at seeing trends or abnormalities in data that is represented well visually / spatially, however this isnā€™t sound statistics [2].

The whole field of data science is useful, for coming up with ideas, for things you could do to analyze and process the measurements.

Another thing that can be useful is to have a general idea (ballpark estimation) of what results you would expect on your hardware [3].

Presenting results

Presenting the results with diagrams and charts, makes it more likely that it can be understood by people quickly, without having to study the numbers in detail.

It also can make it easier to spot trends or slow long term changes.

And many people (myself included) prefer looking at charts and pretty pictures, more than looking at raw tables of data.

Collecting and archiving results

To enable comparisons between different benchmark runs, it is useful to save the results of the runs and archive them, this allows us to see whether a month of working on the code, had a positive impact, or maybe it will show that one small change caused everything to become slower.

Program simulation and deterministic programs

If you can write your program in specific ways[4] that make it possible that it can be tested and simulated, that also may help with benchmarking specific situations and connecting specific program executions with the benchmark results. (For example combining fuzz testing, with benchmarking and storing the seed for the randomly generated inputs)

Getting practical

How to use std.mem.doNotOptimizeAway?

This section is about getting practical with benchmarking with Zig specifically.

Examples

possibly linking to other topics, blogs, examples or videos

A very simple zig example

  • a very simple example TODO
    [what is good about it / would could be improved, rough discription]

An example using the package manager and a library / extra tooling

  • a more complex example TODO (possibly using a benchmarking library)
    [rough description / notes about that, with links to details if necessary / available]

Comparison of existing benchmarking tools and libraries

TODO table of features?
Should the comparison be its own doc?

Projects:

  • that might provide insights / could be studied in more detail
  • could be used for benchmarking / as libraries
  • should be compared for what features they provide

Zig Projects:

Other tools:

  • perf
  • hotspot (profiling using perf)
  • ztracy (zig example for sampling and tracing profiler tracy)
  • Stabilizer[1:1]

wanted features

(which are missing from the comparison of tools)
(maybe it makes sense to do the feature comparison of tools first)

collaborating on tools?

  • links to topics / githubs / discussions

Research

This is about more theoretical things, research papers about implementing profilers / benchmarking etc.

Problems

Interesting research / other related work

do you know of good papers / blogs or other learning material related to this?

Further reading and references

[1:2] Performance Matters by Emery Berger
[2:1] Sound Statistics
[3:1] Napkin math
[4:1] TigerStyle!


  1. Performance Matters by Emery Berger
    https://www.youtube.com/watch?v=r-TLSBdHe1A
    Stabilizer | Emery Berger ā†©ļøŽ ā†©ļøŽ ā†©ļøŽ

  2. Sound Statistics https://youtu.be/r-TLSBdHe1A?t=1026 ā†©ļøŽ ā†©ļøŽ

  3. Being able to do rough napkin math, to figure out whether something should run faster / better, or whether it is somewhat close to the physical limits.
    Simple Performance Counters  ā€”  Handmade Hero  ā€”  Episode Guide  ā€”  Handmade Hero + next (TODO other links?) ā†©ļøŽ ā†©ļøŽ

  4. TigerStyle! (Or How To Design Safer Systems in Less Time) by Joran Dirk Greef
    https://www.youtube.com/watch?v=w3WYdYyjek4 ā†©ļøŽ ā†©ļøŽ

12 Likes

What a great start! While skimming through the article I collected some topics to suggest for addition, only to realize they all are already contained.

Thanks for repeating the references at the end. Bookmarked for re-read.

2 Likes

@mnemnion I am unsure if I agree with the claim of single metric, I can imagine benchmarks that collect many different metrics at the same time.

When you look at Performance Tracking āš” Zig Programming Language these benchmarks have many metrics.
Just because many benchmarks only consider a single metric, this doesnā€™t seem like a reason to me to see this as a defining feature.

Rather it makes me question why it isnā€™t standard, to capture more metrics in parallel to potentially get a better view into what was really going on while that benchmark was running.
For example it would be good if you could tell that your thread was sent to sleep by the os, or stuff like that.

With the wall time and user time, you at least can get an indication of that by looking at the difference between the two.

I do think your description of how vs where makes sense, for how these have a different focus.

There might be a better way to phrase that, feel free to rephrase. I do think that the distinction I was drawing out there is fundamental to the difference between a benchmark and a profile, though.

They do! And thereā€™s a lot of individual benchmarks as well. I also mentioned mean, median, and best time, letā€™s note that it says ā€œsingle metricsā€ plural, not single metric, singular. I considered the word scalar, but figured it was overly technical for this kind of introduction.

Each of the metrics in that link is a single number. When you run a benchmark, you get one best time, one median, one mean, one envelope (say, 3x difference between best and worst), and so on. There might be hundreds of metrics, but itā€™s one measure per metric that gets reported. Each benchmark gets run a bunch of times, but the purpose of that is to reduce random sources of variance. Many columns, one row.

A profile, on the other hand, collects samples, as many as it can, so rather than one number, itā€™s on the order of one number per line, and per situation that line is executed in (so taking the rest of the call stack into account). What you end up with is data which canā€™t be evaluated on its own, you need to graph it to use it. Each benchmark value is a meaningful number referring to a single benchmark, even if youā€™re tracking a lot of numbers. One or several columns, many rows, different possible joins on the data.

Perhaps something like ā€œfor each aspect being measured by a benchmark, the benchmark will generally deliver a single measure of that aspect, after being run enough times to average out variance due to factors (cache warmups, branch prediction, various interactions with the host system) which arenā€™t inherent to the code being measuredā€.

A post was merged into an existing topic: Research about Benchmarking

I have decided to put my more brain-stormy comments into a separate topic, so that this topic can be about things that we already know concretely.

I think it avoids confusion if I separate my interest in what maybe could be done, from the more practical things documented here.

1 Like

Nice idea to collect this information!

It may be worth mentioning hyperfine, a command-line benchmarking tool.

1 Like

Interesting, looks like they are not bothering to query system timers but use rusage on Linux instead (src). Seems to call into OS functionality on Windows as well (src).

Sorry, but how it is works is beyond my programming knowledge.
All I can say about the program is that:

  • it is very easy to use
  • can compare several programs with each other in a single run
  • and requires no intervention in the source code

Not as a disadvantage, but it measures an entire executable at a time, so the code run requires specific customization, for example only a part of an entire code.

For a quick benchmark solution and overview for a program it can be very useful.