Confused by unions and enum

eddie3716 · December 29, 2023, 4:23pm

I’m working through ziglings and stumbled on this example:

const TripItem = union(enum) {
    place: *const Place,
    path: *const Path,

    // This is a little helper function to print the two different
    // types of item correctly.
    fn printMe(self: TripItem) void {
        switch (self) {
            // Oops! The hermit forgot how to capture the union values
            // in a switch statement. Please capture both values as
            // 'p' so the print statements work!
            .place => |p| print("{s}", .{p.name}),
            .path => |p| print("--{}->", .{p.dist}),
        }
    }
};

This code works, but I’m confused by two things.

when would you ever NOT declare a union(enum)…seems like you would always want to declare this for the convenience that you may need it later…what use case is there for not having that just automatically set?
In previous examples of zinglings, excercises had me explicitly declare an enum type, and then assign that enum to the union, so in the above example, following previous conventions, the code would look like this:

const TripEnum = enum { place, path };
const TripItem = union(TripEnum) {
    place: *const Place,
    path: *const Path,

    // This is a little helper function to print the two different
    // types of item correctly.
    fn printMe(self: TripItem) void {
        switch (self) {
            // Oops! The hermit forgot how to capture the union values
            // in a switch statement. Please capture both values as
            // 'p' so the print statements work!
            .place => |p| print("{s}", .{p.name}),
            .path => |p| print("--{}->", .{p.dist}),
        }
    }
};

Both snippets of code work in the example, the compiler is able to resolve the code and execute. In both examples I get the same results. So it seems like all you really need is just enum, and not a TripEnum which I was led to believe.

What’s going on here and what is actually needed? Why does this work without declaring a TripEnum and when does it make sense to declare a explicit enum type?

LucasSantos91 · December 29, 2023, 4:43pm

You’re paying a cost for this. Even if your union only has two variants, you’ll need at least a bool inside there. This bool, in turn, might increase the size of the type by a up to a whole word, because of padding. If you can determine which variant is active by some other means, then you don’t want to pay this price. This can be done with a variable somewhere, or simply by the place in code. Sometimes you know that two variants’ lifetimes will never overlap, so you don’t need the tag to determine the active variant.

union(enum) is syntactic sugar. It will create an anonymous enum in the background. A lot of times you actually need the tag enum for other things, in which case you want to explicitly name it.

ianprime0509 · December 29, 2023, 4:54pm

Here are some concrete examples of both scenarios:

Untagged union: zig/lib/std/zig/Ast.zig at 27d4bf753467894836e960bced73740c95e61db8 · ziglang/zig · GitHub Here, the extra field is an untagged union, where the expected_tag field of the union is active if and only if the tag is expected_token. As @LucasSantos91 mentioned, there would be a memory cost to redundantly storing a union tag in this case.

Also, in safe build modes (Debug and ReleaseSafe), a hidden union tag is added to untagged unions so that a runtime check can be inserted for using the wrong field, so you still get some safety when testing.
Tagged union with explicit tag type: zig/src/Package/Fetch/git.zig at 27d4bf753467894836e960bced73740c95e61db8 · ziglang/zig · GitHub Here, the Type enum is also used on its own, and specific integer values are assigned to each field of Type to mirror the underlying data structure and help in parsing the data: zig/src/Package/Fetch/git.zig at 27d4bf753467894836e960bced73740c95e61db8 · ziglang/zig · GitHub

eddie3716 · December 29, 2023, 7:01pm

Untagged union: …link… Here, the extra field is an untagged union, where the expected_tag field of the union is active if and only if the tag is expected_token. As @LucasSantos91 mentioned, there would be a memory cost to redundantly storing a union tag in this case.Also, in safe build modes (Debug and ReleaseSafe), a hidden union tag is added to untagged unions so that a runtime check can be inserted for using the wrong field, so you still get some safety when testing.

Looking at that code, if I’m understanding this correctly, the extra union seems to be a clever trick to save a bit of memory. So if there aren’t any parsing errors, the extra is just a none: void, which I’m assuming means the extra union isn’t taking up any memory. Conversely, if extra was a union(Tag) or even just Tag, it would take up space. In that case you’d probably have to specify a Tag.NotApplicable enum value as the default case, so it’d be taking up memory in a situation where it wasn’t called for.

Tagged union with explicit tag type: …link… Here, the Type enum is also used on its own, and specific integer values are assigned to each field of Type to mirror the underlying data structure and help in parsing the data: …link…

This example makes sense, ‘read’ looks like a factory for EntryHeaders. EntryHeader is explicitly declared a union with the Type enum u8 because we know the first byte of data is going to be the EntryHeader Type.

Thanks for your help in understand all of this, you too @LucasSantos91 . I hope I got this right.

Validark · December 29, 2023, 9:27pm

There is ALWAYS a slot in the memory of the struct in question for the extra field, and it can fit whatever the biggest option is in the union. Yes, one of the options is void, meaning that the active value might be a 0-bit value, but the memory slot that can fit a Token.Tag is still there and still in the same location, regardless of whether the none: void is the active field or not.

When you have a union, you typically also have a separate field (an enum) which tells you which field is active. So you basically have a struct like struct { kind: enum { ... }, data: { ... } } where you treat the data differently based on kind, but data is always the same size, and kind also occupies memory, at least a byte.

Untagged unions are specifically for those cases where having a kind like I had in my example is redundant. Maybe I have struct { num_interesting_things: u32, data: { ... } } and I can tell how to interpret data based on num_interesting_things, so it’s not necessary to store an extra kind field to tell me how to interpret data. In Debug mode the compiler will add that extra field no matter what and make sure you don’t accidentally interpret the data as the wrong type, but for a correctly written program it should be redundant.