The line between tokenization and parsing?

mnemnion · August 26, 2024, 3:56pm

The meta-answer to this category of question is: the lexer feeds the parser, and the parser feeds the compiler. The specifics of how this is handled are just the ones that the coders who have worked on that par of Zig decided would be most useful.

Parsing has a lot of undecidability. For instance, we don’t know how to statically prove that two grammars recognize the same ‘universe of strings’ (language, basically), and that might not be possible.

So there are an unbounded number of ways to get the job done. If the coder who made that specific decision stops by with an answer, then we’ll have an answer.

A reasonable guess when you see something like that, which looks overly specific, is that it allows for easier generation of a good error message. That is easily the hardest problem in parsing, so it’s the most likely candidate.

I wanted to stress that there’s a truly unlimited number of these little decisions to make, even given the constraint that every valid parse produces the same parse tree. That creates a corresponding number of opportunities to ask “why this way, and not that way?”, and that can be a useful question to ask. But there isn’t an objective way to decide one implementation detail is better than another, and probably there never will be.

Sometimes there are definitely better and worse tactics, but not always by any means.