Just write the function! - alanza.xyz (measuring the performance impact of indirection)

alanza · December 9, 2024, 5:46pm

Inspired by the map idiom of Advent of Code (there are no spoilers in the linked post, though), I got curious about the performance impact of having the memory layout of my data match my human understanding of it. As I suspected, keeping the map data contiguous as a byte-slice and manually computing the coordinates corresponding to an offset is measurably faster than creating a 2-dimensional memory structure!

Calder-Ty · December 9, 2024, 6:24pm

Nice writeup. It would be interesting to run the comparison using poop, as it will report information on cache-misses, which would help to underscore the point.

alanza · December 9, 2024, 6:54pm

i agree! unfortunately poop is linux-only, and i am not inclined to reinstall linux on either of my machines at the moment

Calder-Ty · December 9, 2024, 7:17pm

Fair, I didn’t think about that

andrewrk · December 9, 2024, 7:52pm

I don’t think there’s a difference between a 2D map and a 1D map where you calculate your own coordinates. It’s the same thing to the computer.

I suspect you’re running into Pass by reference "optimization" copies the entire struct on the stack when taking its address. (this didn't happen in stage1) · Issue #16343 · ziglang/zig · GitHub which I do apologize for and hope to have fixed by the next release.

matklad · December 9, 2024, 9:01pm

I don’t think this is true for this example? [n][m]u8 is the same as [n*m]u8, but this example compares []u8 and [][]u8, which are different.

alanza · December 9, 2024, 9:05pm

that’s right—I deliberately wanted to compare []u8 to [][]u8, so I carefully created separate allocations for each line.

andrewrk · December 9, 2024, 10:54pm

Ah, my mistake! I apologize for not reading carefully enough - especially since I went ahead and made a comment.

mnemnion · December 10, 2024, 3:45pm

This is totally worth comparing, and it was a good writeup. Slices are a pointer and a length, so using [][]u8 means that getting that second coordinate involves finding the actual memory holding the line, and that indirection will cost you. I would expect to see calculating the single offsets directly to be more expensive, along with the costs of allocating and freeing striped copies of memory the program already had. It adds up!

I wanted to point out that, when the data is comptime-known, there’s a third option: a multidimensional array. In Zig, these are row-major, and look like [m][n]u8: The first dimension m is the number of rows, and n is the number of columns in each row. Row major works like a page of text, so that’s lucky for us in this case.

It’s possible to count the rows by counting newlines (all of this assumes a final newline at the end) and the columns (needs to include the newline) will be the length of the first line (with the newline!).

What’s cool about this is that you can @bitCast the string directly into the multidimensional array with no copying. Then you can use ordinary array[row][column] indexing to reach coordinates, and the compiler can help more because it has more opportunities to turn those indices into constants. When it needs a dynamic index, it will find it using calculations effectively identical to the one your post shows.

This is of course, less general and more fiddly, and only works when the dimensions are known at comptime. The approach you’ve sketched out can handle runtime data as well, as long as the shape is correct. But I thought it was worth mentioning: shape-casting embedded data is an important optimization when doing big grid-shaped calculations, most languages where this is common use column major, but other than the order of the indices the principle is the same.

alanza · December 10, 2024, 6:41pm

that’s what i attempted to measure in the blog post if you read it, you’ll see that i tried to account for the allocation differences by performing the memory access many many times. i found that contrary to your expectation, computing the single index from the double indices was faster

geemili · December 10, 2024, 7:33pm

When I was doing Advent of Code I noticed the same thing and created a Grid abstraction. It allowed for a 2 dimensional grid of arbitrary data type, and also had a separate stride, which allowed “slicing” into the grid:

const full_grid = ConstGrid(u8){
    .data =
        \\123
        \\456
        \\789
        \\
    ,
    .stride = 4, // don't forget about the newlines!
    .size = .{ 4, 3 },
};

const actual_data_grid = ConstGrid(u8){
    .data = full_grid.data,
    .stride = 4, // stride remains the same
    .size = .{ 3, 3 }, // x is one smaller to crop out the newlines
};

// This one slices to the 2x2 region at the bottom right:
//
// ```
// 56
// 89
// ```
const y = 1;
const x = 1;
const small_region_of_grid = ConstGrid(u8){
    // the raw data (without taking `size` into account) will be:
    //
    // ```zig
    // const data = .{
    //     '5', '6', '\n',
    //     '7', '8', '9', '\n',
    // };
    // ```
    //
    // Which is to say, don't try to iterate over `data` without
    // using `size` and `stride`!
    .data = full_grid.data[y * full_grid.stride + x..],
    .stride = 4, // we still need to move 4 elements to get between rows
    .size = .{ 2, 2 },
};

mnemnion · December 10, 2024, 9:07pm

I definitely did! That sentence is just poorly constructed. It should have ended up like this:

That makes it coherent with everything else I said, rather than, as you astutely observed, self-contradictory.