Allowing .packed, .extern, .struct, .fn, etc. (no @"...")

chung-leong · April 3, 2024, 10:54pm

The enum std.builtin.Type.ContainerType previously had fields with upper-case camcelCase. It’s currently defined as:

enum { 
    auto,
    @"extern",
    @"packed",
}

Because extern and packed are keywords, the use of @"..." is necessary. To set the layout of a struct for @Type() to packed, you have to escape the enum name:

    return @Type({
        .layout = .@"packed",
        // ...
    });

I was thinking, what if we allow keywords to be interpreted as identifiers whenever they come immediately after a .? That would allow a less ugly syntax:

    return @Type({
        .layout = .packed,
        // ...
    });

And permit snake_case to be used for enum throughout the standard library. Currently, the naming convention allows uppercase camelCase when clashes with keywords occur. This sounds like a reasonable compromise but this exception is absolutely pervasive because we switch on @typeInfo() all the time.

Allowing .struct, .packed, etc. should require only trivial changes to the lexer. Basically, immediately after a period we don’t perform keyword look-up. The fact that this hasn’t happened yet makes me wonder if there’s some reason not to do so that I’m overlooking.

Sze · April 3, 2024, 11:49pm

I don’t understand completely what you mean with the topic title.
I think the topic title makes more sense as “allow unescaped keywords as field accessors” or something like that?

I guess one good reason to avoid special allowances for field access, could be so that you have consistency between field access and field declaration.

I think with field declarations the explicit syntax is wanted so that you can tell immediately whether something is an extern keyword and not a extern: field.
And I guess needing @ in one place but not the other might be bad in another way, for example unexpected for beginners / more to keep in mind.

Personally I still think it might make sense to allow it without @ if there is a dot in front.

chung-leong · April 4, 2024, 8:53am

JavaScript allows this. You can use keywords as property names but var var = 5; or function const() {} would fail. It won’t be an alien concept to most programmers coming to Zig that you can’t use certain names in certain contexts. Beginners are more likely to be confused by .layout = .auto vs .layout = .@"extern". They definitely won’t get the impression that Zig is a clean language.

Sze · April 4, 2024, 9:15am

If I understand you correctly, you aren’t raising an Objection towards “allowing unescaped keywords as field accessors”, because that would mean that you are against being able to use keywords directly in that situation.

You object to the status quo of not being allowed to simply type .extern.
Or said another way you suggest that it should be allowed without having to use @.

I am not super decided on one way or the other, but I lean towards agreeing with you.
“Suggestion to allow unescaped keywords as field accessors”
or maybe the title could be:
“Objection towards status quo: allow unescaped keywords as field accessors”

I think your use of the word “Objection” is the inverse of the point you are trying to make.

chung-leong · April 4, 2024, 10:37am

I was wondering what objections people might have. I’ve updated the title of this post to reflect that.

The thing is, this is an easy change to make. I imagine a discussion must have occurred at some point and a decision was made not to allow this. I don’t want to sound like an idiot who keep bringing up ideas that have been struck down already

nyc · April 13, 2024, 1:19pm

I always consider the dot part of the name so things like ‘a.b’ are a single entity to me. I know the lookup happens in stages (lookup a then b in namespace a), but in my mind that is just breaking apart the single name a.b and then doing the binding lookups.

I didn’t know this wasn’t possible actually, so I totally agree. .packed is not the same token as packed in my head already.

chung-leong · April 30, 2024, 12:33pm

I finally have time to look more into this. So yeah, this is a simple change to make. We just need to add a bool to Tokenizer (lib/std/zig/tokenizer.zig):

pub const Tokenizer = struct {
    buffer: [:0]const u8,
    index: usize,
    pending_invalid_token: ?Token,
    is_last_token_period: bool = false,

At the bottom of Tokenizer.next(), set the field to true if the token is a period:

        self.is_last_token_period = result.tag == .period;

Then put a if() { ... } around the keyword lookup:

                .identifier => switch (c) {
                    'a'...'z', 'A'...'Z', '_', '0'...'9' => {},
                    else => {
                        if (!self.is_last_token_period) {
                            if (Token.getKeyword(self.buffer[result.loc.start..self.index])) |tag| {
                                result.tag = tag;
                            }
                        }
                        break;
                    },
                },

As a result of the change, tokenization would actually be faster, since we’re no longer doing a lot of unnecessary lookups.

Here’s binary (Linux-GNU) of the stage 2 compiler with the change:
https://www.dropbox.com/scl/fi/5diotuxuerlcivkke9cse/zig2.xz?rlkey=xd7faam56fhzpe06gp6f7x6vj&st=4e968fou&dl=0

It compiles the following code with no complaint:

const std = @import("std");

const Union = union {
    @"struct": u32,
    @"fn": u32,
    @"union": u32,    
};

pub fn main() void {
    const a: Union = .{ .struct = 1 };    
    const b: Union = .{ .fn = 2 };
    const c = .{ .layout = .packed };
    std.debug.print("{d} {d}\n", .{a.struct, b.fn});
    std.debug.print("{any}\n", .{c});
}

1 2
struct{comptime layout: @TypeOf(.enum_literal) = .packed}{ .layout = .packed }

Note: I’m changing the title of this thread again to clarify

chung-leong · September 13, 2024, 9:13pm

I’ve just noticed that std.builtin.Type has it fields switched to snake_case. It would be really nice if the change I proposed here is implemented before the next release. Having @"struct" and @"fn" appearing through out the code base just doesn’t look nice. Code migration would be much, much easier too if we don’t have to worry about the need to escape some of the names.

dimdin · September 13, 2024, 10:25pm

Although it doesn’t bother me too much, I also prefer reading and writing .fn instead of .@"fn".

Pinging @andrewrk since no core team member responded here.
Do you have any strong opinion on this or should we open a github issue for discussion?

Extending the zig grammar to handle DOT KEYWORD like DOT IDENTIFIER
is another way to implement it.
I commented the additions to the grammar clauses.

PrimaryTypeExpr 
    <- BUILTINIDENTIFIER FnCallArguments
     / CHAR_LITERAL
     / ContainerDecl
     / DOT IDENTIFIER
#    / DOT KEYWORD
     / DOT InitList
     ...

FieldInit <- DOT IDENTIFIER EQUAL Expr
#          / DOT KEYWORD EQUAL Expr

pachde · September 18, 2024, 12:01am

@"..." has use cases beyond using keywords as names: identifiers containing spaces or mathematical symbols, for example that a simple . prefix cannot give you. The proposal isn’t necessary: it’s just cleaner-looking for the cases where the name happens to be tokenizable. it’s also a second way to do something you already have a mechanism for (only less flexible).

IMO making it easier to use keywords as names goes against clarity when reading the code. I’m happy that you’re forced to use an uglier syntax. You can also simply use a different name. A thesaurus is a good aid to naming things when the name you want to use is taken.

mnemnion · September 18, 2024, 1:36am

This is clearly that case, and what’s more, I think it’s great that Zig has an all-purpose syntax for turning an arbitrary string into an identifier.

But this argument could be made for all identifiers. I don’t think anyone would argue that every identifier should look like @"this" just because it would mean we have only one way to declare identifiers instead of two.

Well, no, you can’t simply use a different name for the enum values of @typeInfo, that isn’t an available option here.

That said, I’m inclined to say this isn’t a good idea. It’s easy to hack into a bespoke top-down parser, and it’s also easy to add to a Parsing Expression Grammar. But it’s a bad fit for other parsing techniques.

Lexers are traditionally stateless, they can produce the next token with no reference at all to the last one. Adding “one little boolean” to the tokenizer isn’t a trivial change: it makes the lexer stateful in a way it wasn’t before. The single-switch architecture is a deliberate choice which makes the lexer recognize a regular language.

There is at least two decisions in Zig which follow that logic: no multi-line comments, and the syntax for multiline strings. It’s explicitly mentioned in the documentation:

There are no multiline comments in Zig (e.g. like /* */ comments in C). This allows Zig to have the property that each line of code can be tokenized out of context.

While that isn’t a direct connection to what I’m saying here, I think it illustrates that adding context sensitivity to the lexer is not something which should be done casually. It has implications for tooling, and would change the nature of the language, which would no longer have the stateless property which the tokenizer is painstakingly designed to preserve.

It’s a bigger change than it looks like, is what I’m saying. Perhaps the ergonomics justify it but I’m inclined to say that they don’t.

dhw9406 · September 18, 2024, 4:36am

I’m not going to argue for or against the proposed syntax change (without having thought it through, I’m very slightly for it), but in terms of the implementation it seems pretty easy to accomplish within the constraints of the current stateless single-switch architecture. My first thought is in zig/lib/std/zig/tokenizer.zig to change :state .period (around line 920) to:

.period => {
    self.index += 1;
    switch (self.buffer[self.index]) {
        '.' => continue :state .period_2,
        '*' => continue :state .period_asterisk,
        'a'...'z', 'A'...'Z', '_', '0'...'9' => {
            result.tag = .identifier;
            continue :state .builtin;
        },
        else => result.tag = .period,
    }
},

and that’s it! I got the idea from looking at what the tokenizer does when it sees an @ which is to then see if it finds a " and if so to set the result.tag to be .identifier, but continue in :state .string_literal. Then I also noticed that :state .builtin is just like .identifier except without the keyword check, so I figured jumping there would do the trick.

And reading this back to myself, I’m realizing that this will eat .period tokens and also cause identifiers to (sometimes!) have leading ‘.’ characters added to them. Hmm… well, I guess a new token type could be added, say .period_identifier and when such a token was emitted the downstream system would know that it was actually two tokens (a period and an identifier).

That doesn’t seem quite so elegant, but it’s still probably not too much code/ complexity. I mean, we already have .period_2 .period_asterisk .ellipsis2 and .ellipsis3 so… I’d have to look at what the downstream systems are that consume the tokenizer’s output, but I don’t currently have zig compiling, so I can’t test this and I’m going to bed now. Still, it does seem to be a relatively simple change even within the constraints you were (rightly!) pointing out the current tokenizer satisfies.

chung-leong · September 18, 2024, 2:03pm

This is a bit like the red dot on a cashmere sweater. If you accidentally leave a small ink stain on your favorite sweater, you’d probably dismiss it as no big deal. The same red dot would be a total deal-breaker on the other hand, if you spot it on a brand new sweater sitting on the shelf.

mnemnion · September 18, 2024, 3:21pm

This won’t actually work well with the rest of the system:

const the_enum = .



          legal;

sets the_enum to .legal. This would be a weird way to deal with a token. A case can be made for forbidding this, and saying that .something is one token, with no whitespace permitted. But we’d basically have to do that, and it would affect fields and member functions in a weird way.

As I said, larger change than it looks like.

I’m interpreting this as an argument that newcomers to Zig will have a WTF reaction to the current rule.

I’m not convinced of this. “keywords may not be used in dot-extended rules” is a very common policy, there are exceptions but not many. This is in fact traceable to the lexer issues I was pointing to.

By contrast, “any string may be used as an identifier with a special syntax” is fresh, unusual, and powerful. Zig’s rule is not as attractive on the screen as, say, Go’s: Go spoils it with the whole capitalize-to-export thing, but Zig wouldn’t need to do that. But Go’s rule is less powerful, especially in that it can’t accomodate spaces. Zig’s rule is especially nice because we have tools for converting strings to enums, and the @"anything" syntax means that this can be done with any system where it’s useful, without having to add conversion logic to fit the enum name into a narrower scheme.

I also like that Zig’s system allows using any string at all for an identifier, but it encourages the conventional ASCII identifier set. @"blah blah 😂" has a bit of friction, and it should, the rest of Unicode is just not as easy to work with as the ASCII subset. To me this is more about how professional tools should work, which is what programming language are, then some sort of English-language chauvinism. I’m a known Unicode-respecter but I think a pinch of friction here sets the right balance, anyone who programs has figured out how to input ASCII characters and it’s the only common subset for which that’s true.

Basically I think that “struct is a keyword, therefore .struct is not allowed” would not be the surprising part. It would be weird to be surprised by that unless JavaScript was your only language. The surprising part would be “struct is a valid field or enum name, but you have to spell it .@"struct", and this works for any string”. Surprising, that is, in a good way.

I’m not hugely opposed to the change, but I am against it. It does look better, we’d have to give up things like struct . field which no one does anyway, but I don’t think “people are going to make fun of us for the stain on our sweater” is a good case for the change.

chung-leong · September 18, 2024, 8:10pm

People won’t stay long enough to learn what’s a keyword in Zig and what isn’t. Initial impression matters. Aesthetics matter. Many people working in our industry are obsessed about neatness. As soon as they see something like this:

switch(@typeInfo(T)) {
    .array => {
        // ...
    },
    .@"struct" => {
        // ...
    },
}

They’re going to bail on us.

After thinking about this for a while, I wonder if we shouldn’t just junk std.builtin.Type altogether. I mean, that’s the source of the name collision. If instead of relying on the enum from this tagged union to differentiate between different types, we could just have a bunch of type-specific builtins (@isArray(), @isStruct(), etc.), then the collision with keywords would be gone.

The switch construct is nice in situations where you handle every type. But these are pretty rare. It’s more common that you’d handle one or two types. And the massive if-else tower that would result when you do need to handle every type is no so bad. It’s going to be verbose, but absolutely clear.

So instead of returning a union, @typeInfo() would just return the struct associated with the type given. You’d still be able to do something like this:

if (@isStruct(T) or @isUnion(T)) {
    inline for (@typeInfo(T).fields) |field| {
        // ...
    }
}

That’d be equivalent what we have currently:

switch(@typeInfo(T)) {
    inline .Struct, .Union => |st| {
        inline for(st.fields) |field| {
             // ...
        }
    }
}

Which is not very good because we’re forced to use the same variable name for two different struct types.

mnemnion · September 18, 2024, 8:16pm

chung-leong:

People won’t stay long enough to learn what’s a keyword in Zig and what isn’t. Initial impression matters. Aesthetics matter. Many people working in our industry are obsessed about neatness. As soon as they see something like this:
switch(@typeInfo(T)) {
    .array => {
        // ...
    },
    .@"struct" => {
        // ...
    },
}
They’re going to bail on us.

Yes, you did make that case clearly the first time. I’m willing to take the other side of that bet.

You really don’t like those at-identifiers do you. I’m sorry but I think this is an actually bad idea.

Modifying the lexer to allow .struct is not, however, it’s just something I come down on the other side of, and find myself unmoved by the “what about the cool kids” part of the argument.

chung-leong · September 19, 2024, 11:07am

I’m just describing the logics of the real world, where minor faults lead to instadeaths. You can’t expect people to trust you with matters of importance after you’ve visibly failed at something rudimentary.

Calder-Ty · September 19, 2024, 1:59pm

All languages have keywords and reserved words. The fact that they have to be escaped to use them as identifiers is not a failure. Expecting otherwise is, in my opinion, a bit naive. People coming from other languages will be familiar with the concept of keywords, and having to escape them shouldn’t be a big surprise. such escapes are used in other languages (Rust uses the #“…” syntax IIRC).

I’m opposed to this change. In general, I agree with @pachde that you should avoid using keywords as identifiers, and this helps that way. Also, It creates a difference between point of definition and point of use. The text used to create the identifier and to use it are different. That could be even more jarring to new users.

I think you are overstating the reaction outside users will have. As an example with another language, I personally think it is ludicrous that Go decided to use Capitalization to declare public members of modules. It goes against the most common way to do this (using a pub keyword), and forces certain orthographic styles. My objection to this has not seemed to have caused Go to be a language discarded by the industry. Not only that, but the times i’ve used it Go, i’ve been able to adapt rapidly to the requirement. Yes I’m still annoyed by it, but every language will have it’s annoyances.

mocompute · September 20, 2024, 6:01am

Very much agree. For code generation, this makes the generating code much simpler, not having to ensure generated identifiers conform to a more restricted subset of the domain-specific language driving the generation (for example).

Not just theoretical in my case, I’m just finishing up a generator from an ecosystem that uses “.” in its identifiers.

mnemnion · September 20, 2024, 2:05pm

This is relevant for sure. Even if this change to the lexer and grammar were accepted, we’d still have to do things like this:

const @"type": type = type;

So the fact that would be valid to have a .type enum would be an inconsistency, even if it looks nicer (I do agree that it would make switching on @typeInfo bit bit cleaner looking, no question).

Plus I don’t see a way to apply it to enums but not to field access, and that gets weird, because it would look like this:

const WeirdDecl = struct {
    blah: usize

    pub fn @"type"(w: WeirdDecl) usize {
        return w.blah;
    }
};

// called like this:
weird_decl.type();

So the rule we have is both powerful and consistent. The formatter will actually un-stringify @"identifier into identifier, to make sure that definitions and uses are identical tokens. Fun fact, it will also turn @"\x66\x6f\x6f" into foo, significant effort has been expended there.

It’s impractical in the extreme to try and juke the grammar so that there’s a consistent “keyword position” and any use in “identifier position” is ok. I’m pretty sure that it can be done using PEGs, but I very much think we should not, it makes regex-based syntax highlighting impossible and that’s just one problem, there are more.

Point is that there are real user-facing consequences to a context-sensitive lexer, it isn’t just a matter of technical purity. There are contexts where it can be good: Oil shell, which is a very interesting project, has lexer modes and that makes sense for what Andy is doing. But it’s a bad fit for Zig, and for most programming languages.