Any thoughts on separating size & stride?

lunacookies · October 15, 2024, 7:52am

For those who aren’t aware, Swift doesn’t round the sizes of types up to a multiple of the alignment, instead allowing other data to occupy that space. You still need to include the trailing padding when stepping between elements in an array, and so Swift also has the idea of “stride”, which is what most languages call size.

An example in case I’m not making sense

Maybe it’d be clearer with an example:

const Thing = struct {
    big: u16,
    small: u8,
};

With how Zig currently works, this is laid out in memory as

offset field
====== =====
     0 big
     1 big
     2 small
     3 _padding

@sizeOf(Thing) is 4 bytes and @alignOf(Thing) is 2 bytes. If you have an array of them your memory is laid out like this:

+---+---+---+---+---+---+---+---+---+---+---+---+
| B | B | S | _ | B | B | S | _ | B | B | S | _ |
+---+---+---+---+---+---+---+---+---+---+---+---+

This is necessary so each Thing instance has an alignment of 2.

Now, imagine we have this:

const BigThing = struct {
    t: Thing,
    extra: u8,
};

It is laid out in memory as

offset field
====== =====
     0 t.big
     1 t.big
     2 t.small
     3 t._padding
     4 extra
     5 _padding

We end up with extra padding at the end of the t field, just so that if we have an array of Thing each instance is aligned – but we don’t have an array of Things here!

Swift fixes this by making @sizeOf(Thing) equal to 3 and @strideOf(Thing) equal to 4, i.e. the size doesn’t include the padding at the end. This is the resulting memory layout of BigThing:

offset field
====== =====
     0 t.big
     1 t.big
     2 t.small
     3 extra

@sizeOf(BigThing) and @strideOf(BigThing) are equal to 4.

Have people working on Zig considered this approach? Intuitively it feels like the sort of thing that Zig would do.

As far as I can tell this is a pure win in terms of cache utilization, memory consumption, etc, with the only downside being unfamiliarity. I suspect that this is the sort of thing that’s easy to learn once and not forget, though.

I’d love to hear everyone’s thoughts!

pierrelgol · October 15, 2024, 8:01am

The idea is interesting, but I’m not convinced it would fit with the way things are done in Zig, it seems like extra complexity, and I’m unsure whether it’s worth it, compared to what Zig tries to push for, which is DOD, the “Zig” way of doing this would probably be to have multiple arrays of each fields (SOA), which would be laid out in memory as such

+---+---+---+---+---+---+
| B | B | B | B | B | B | 
+---+---+---+---+---+---+
+---+---+---+---+---+---+
| S | S | S | S | S | S | 
+---+---+---+---+---+---+

This kind of solves the same problem, but in a different, way you have to be explicit about it and thoughtful. I wouldn’t be against a system that does what you suggest, and I don’t think it’s incompatible with the Zig philosophy, but I’m also not convinced that DOD/SOA isn’t an already good answer to that issue.

lunacookies · October 15, 2024, 8:05am

I suppose you could use that same line of reasoning to argue against automatic field reordering: why add the extra complexity of not using C-compatible struct layout when really you should reduce padding through SoA?

IMO SoA isn’t a catch-all answer here, since it’s often the case that you access all the fields of a struct together, a situation where you ideally wouldn’t use SoA. These sorts of cases are where automatic field reordering and potentially size/stride come in handy.

Your point about increased language complexity is taken, though, that’s irrefutable.

pierrelgol · October 15, 2024, 8:21am

Yes fair enough, and I agree that’s a fair argument for it, since field reordering is already a thing, I’m just not sure on whether this would work well on all hardware, because at the root of all this, the reason why padding is inserted, if because some hardware might not be able to load something if it’s not aligned properly. I think Zig wants to be an alternative to C, and I don’t think field reordering is an issue for that ends, but I can see this kind of system introducing some additional level of “Debugging your language knowledge” instead of your own program. If on platform X your code has weird performance issues, or straight up doesn’t work, or maybe you programmed some stuff around the assumption that this would be available. I’m not saying this isnt feasible and that there is not solution to that problem, I’m just curious on how that would be handled, if implemented. Do you have insights on how Swift does that ?

lunacookies · October 15, 2024, 8:30am

Sorry, I think there’s been some sort of misunderstanding. Swift’s approach still keeps everything aligned, it’s just that padding at the very tail end of a struct is only included if you have many instances of that type in a sequence. Outside of that, other data can inhabit the space where the trailing padding would have been.

dimdin · October 15, 2024, 8:39am

I cannot find any relative discussion about using padding when nesting.
I think this deserves a proposal ticket on github even if it is implemented after zig 1.0

pierrelgol · October 15, 2024, 9:20am

Oh My bad, yes I’ve actually misread your little graphic, I thought it was reordering here:

But now I see, I think this is indeed a very good idea, using the extra padding, to store the extra in a nested case is really good, sorry for going off topic, with DOD. Thanks for bringing this up

mnemnion · October 15, 2024, 7:16pm

It’s possible to get this result in Zig as well:

const Thing = struct {
    big: u16 align(1),
    small: u8,
};

const BigThing = struct {
    t: Thing,
    extra: u8,
};

test "@sizeOf BigThing" {
    try expectEqual(4, @sizeOf(BigThing));
}

This is Zig favoring the explicit over the easy.

That said, I actually agree with you that handling size and stride separately would be good for the language. A stride of 4 for BigThing is actually over-aligned for the largest member by a factor of 2. But then again, Zig’s struct layout will align the u16 to its natural alignment whenever it can:


test "@offsetOf (Big)Thing" {
    // For my system this is 0:
    std.debug.print("{d}\n", .{@offsetOf(BigThing, "t")});
    // This is also 0:
    std.debug.print("{d}\n", .{@offsetOf(Thing, "big")});
}

So in this case everything ends up as one might like (on aarch64 at least).

chung-leong · October 16, 2024, 10:23am

I don’t really need any benefit over using packed struct.

const std = @import("std");

const Thing = packed struct {
    big: u16,
    small: u8,
};

const BigThing = packed struct {
    thing: Thing,
    extra: u8,
};

pub fn main() void {
    std.debug.print("{d}\n", .{@sizeOf(Thing)});
    std.debug.print("{d}\n", .{@sizeOf(BigThing)});
}

4
4

It’s slightly annoying that we have to remember sometimes to use @bitSizeOf(T) / 8, but such situations are rare.

geemili · October 16, 2024, 2:24pm

The concept of stride would be useful when dealing with multidimensional arrays. If you have a pointer, width, height, and stride, you can select a rectangular region.

Of course such a stride would need to be runtime known, so its not quite the same thing.

mnemnion · October 16, 2024, 2:53pm

The main benefit of not using packed struct is that there are a bunch of types which you can’t put in a packed struct. A Zig-normal struct using align(1) can take anything as a field and still get the optimal size distinction.

There’s another reason though, which may or may not be what you want: stride.

const Thing = struct {
    big: u16 align(1),
    small: u8,
};

test "stride of Thing" {
    try expectEqual(3, @sizeOf(Thing));
}

A packed struct has to round up to the nearest natural integer type, an align(1) struct does not. On modern host-OS-grade chips, the cost of unaligned access is minimal to nothing, so saving that extra byte in a slice/array might add up. On other chips you might want the padding to guarantee aligned access though.

All of this is why I basically agree that Zig should separate size and stride: without the align qualifier, the Thing struct should have @sizeOf 3 and @strideOf 4. If you make it into a field of another struct it should only take three bytes, an array should take four each. We would still have explicit alignment to get a struct which has stride and size of 3. That would help with implementing #19917, the first issue I filed on the Zig tracker actually.

IMHO the missing trick here is a struct type where a) all Zig data is allowed and b) the user chooses everything about the layout. There are some occasions where this is pretty important, and existing solutions aren’t entirely satisfactory. There’s an issue tracking that, and I suspect it’s something we’ll get before 1.0, because getting e.g. a guarantee that certain specific data will fit in the first cache line is a necessary condition for writing optimal software. When you need it, you need it.

buzmeg · October 17, 2024, 12:44am

“Packed size” is only a proxy metric on modern CPUs.

The proper metric is “access pattern efficiency”. The goal is to match the structure to the access pattern such that the access pattern pipelines the cache and execution units correctly.

And the best way to do that would be to allow the compiler maximum freedom to do AoS/SoA transformations.

For example, if you have an array of struct:

[] struct {
  vertex: @Vector(3, f32),
  color: @Vector(4, f32),
};

You may very well want to transform that to

struct {
  vertex: @Vector(3, [] f32),
  color: []@Vector(4, f32),
};

(Note that the array was propagated differently to the two terms)

Vertex may be something that you process all the X coordinates first while color may be something that you always set all 4 fields on. Even if you don’t touch color (especially actually!) you don’t want color polluting the cache while you are iterating over a bunch of vertices. There are lots of reasons why you want those to be able to be rearranged in various ways.

The overall point is that it would be nice to write your code as the original array of struct and yet tell the compiler “Please rearrange the storage but don’t make me rewrite my executing code.” Note that I’m not really advocating for the compiler to be “magical” about this as the compiler will likely get things wrong as often as it gets them right. I just want annotations that let me specify the rearrangement on the data declaration.

A lot of the SIMD programming stuff pisses and moans about how often you have to rearrange []@Vector(4, f32) into @Vector(4, []f32) in order to get good processing speedups. Being able to do that with a language directive on the data declaration without having to rewrite execution code would likely be a much bigger win than saving a couple of bytes on a struct.

And, please do note, a struct of arrays is almost always better packed than an array of structs if the struct has any holes.

jamii · October 17, 2024, 11:44pm

Not consistently though - your example happened to work, but if we tweak the types a bit I get:

const Thing = struct {
    small: u8,
    big: u16 align(1), // offset 1
};

const BigThing = struct {
    extra: u16,
    t: Thing, // offset 2
};

Swift will exploit the extra packing where it can, but will also preserve natural alignment.

jamii · October 17, 2024, 11:52pm

swift/docs/ABI/TypeLayout.rst at main · swiftlang/swift · GitHub documents the layout rules for swift.

mnemnion · October 18, 2024, 1:37am

Begging your pardon, but my example did not simply happen to work. It worked. That was on purpose.

const Thing2 = struct {
    small: u8 align(1),
    big: u16 align(1),
};

const BigThing2 = struct {
    extra: u16 align(1),
    t: Thing2,
};

test "stride of Thing2" {
    try expectEqual(3, @sizeOf(Thing2));
}

test "@sizeOf BigThing2" {
    try expectEqual(5, @sizeOf(BigThing2));
}

The effect of the align keyword is a bit underdocumented, true.

lunacookies · October 18, 2024, 2:54am

Won’t your implementation here result in Thing and BigThing having an alignment of 1 instead of 2?

Sorry, I think you might be misunderstanding what “stride” means (or maybe I’m misunderstanding you!). Stride is the distance in memory you need to skip to iterate through an array. So, with Swift-style memory layout the stride of BigThing is the size (@sizeOf(Thing) + @sizeOf(u8) == 4) rounded up to the alignment (2), i.e. still 4. I don’t think this is over-aligned?

mnemnion · October 18, 2024, 3:18am

No I do understand what stride means, but I suppose I didn’t make myself clear.

The largest single field of BigThing has an alignment of two, not four. The word “member” was ambiguous there, I was referring to the u16. So by using an alignment directive, you can fit Thing into three bytes. The u16 field is still naturally aligned, because four is twice two, but since we gave it align(1) Zig will align it on an odd offset if it has to.

Zig doesn’t separate size and stride (it’s worth considering), but if you want a struct with no wasted space, you can have that. Or you can take it as it naturally comes out, and have alignment of four in an array.

But you can’t have it both ways, because the size (bytes taken up by one) and the stride (distance between two in an array) are the same number.

I’m not convinced that the under-alignment of a stride three for an array of Thing is much of a performance issue, if any, for modern architectures, and it does save space.

The rule is that layout is undefined, and all fields, unless otherwise directed, must carry their natural alignment.

It may or many not be worth the additional conceptual complexity of treating those two values separately. But the ability to use alignment directives to always get a minimal pack makes the issue less important, at least to me.

Does that make more sense?

lunacookies · October 18, 2024, 4:16am

It does, thank you :)

I’ve been thinking about this a fair bit recently too. In my testing even unaligned SIMD accesses have zero performance impact (at least on the Firestorm µarch), so removing padding entirely when targeting non-embedded platforms might maybe possibly be plausible? Triggering UB when passing unaligned pointers to C code seems bad, though I doubt that C compilers exploit the fact that unaligned accesses are UB with any frequency, if at all.

On the other hand, both Intel’s and Apple’s CPU optimization guides specifically recommend keeping everything aligned, so ¯\_(ツ)_/¯