The line between tokenization and parsing?

timfayz · April 3, 2024, 9:12am

Hi all!

I’m in the process of writing a tokenizer and parser. Recently, I decided to take a look at std/zig/tokenizer.zig for the sake of learning and inspiration. I found some things written super clean and wise, and others I simply cannot understand (in a sense of “design choices” and whys). For example, here are some states the tokenizer has:

pub fn next(self: *Tokenizer) Token {
     while (true) : (self.index += 1) {
        ...
        // string related tokenization
        .string_literal => switch (c) {
            '\\' => {
                state = .string_literal_backslash;
            },
            '"' => {
                self.index += 1;
                break;
            },
            0 => {
                if (self.index == self.buffer.len) {
                    result.tag = .invalid;
                    break;
                } else {
                    self.checkLiteralCharacter();
                }
            },
            '\n' => {
                result.tag = .invalid;
                break;
            },
            else => self.checkLiteralCharacter(),
        },
        ... 

        // number related tokenization
        .int => switch (c) {
            '.' => state = .int_period,
            '_', 'a'...'d', 'f'...'o', 'q'...'z', 'A'...'D', 'F'...'O', 'Q'...'Z', '0'...'9' => {},
            'e', 'E', 'p', 'P' => state = .int_exponent,
            else => break,
        },
        .int_exponent => switch (c) {
            '-', '+' => {
                state = .float;
            },
            else => {
                self.index -= 1;
                state = .int;
            },
        },
        .int_period => switch (c) {
            '_', 'a'...'d', 'f'...'o', 'q'...'z', 'A'...'D', 'F'...'O', 'Q'...'Z', '0'...'9' => {
                state = .float;
            },
            'e', 'E', 'p', 'P' => state = .float_exponent,
            else => {
                self.index -= 1;
                break;
            },
        },
        .float => switch (c) {
            '_', 'a'...'d', 'f'...'o', 'q'...'z', 'A'...'D', 'F'...'O', 'Q'...'Z', '0'...'9' => {},
            'e', 'E', 'p', 'P' => state = .float_exponent,
            else => break,
        },
        .float_exponent => switch (c) {
            '-', '+' => state = .float,
            else => {
                self.index -= 1;
                state = .float;
            },
        },

Here, I couldn’t understand why Andrew (I assume, it is he who wrote it) decided, for example, in case of scanning string literals to parse them more carefully (ie. applying checkLiteralCharacter to check validity for every ASCII/UTF-8 byte it eats), and in case of scanning numbers, tokenizer accepts all kind of rubbish (ie. 1QZZQ or 0hello_world._bye_bye). For example, in my tokenizer I do not accept those combinations but only valid int/float literals (you can look at std/json/scanner.zig as an example because I do essentially the same).

So, the question is: why wouldn’t he validate floats/ints during the tokenization stage rather than somewhere down the road*? Especially considering the fact that tokens can be in .invalid state to propagate the info where something went wrong.

[*] I found std/zig/Parse.zig a bit overwhelming, so I can’t say where exactly the validity of number literals happens.

timfayz · April 3, 2024, 9:32am

I understand that tokenization should be kept as simple as possible. But at the same time, it still performs some level of parsing in one way or another. So maybe the essence of my concern is whether the tokenizer for the <= should return .less_equal as a single token, or treat this input as two separate .l_angle and .equal tokens, allowing downstream parsing to handle them accordingly.

dimdin · April 3, 2024, 10:06am

Everything that starts with 0…9 is considered a number.
Zig postpones the validation of numbers, but partially validates the number when scanning.
Digits 0-9, hex digits A-F,a-f and characters bBoOxXpPeE_-+. are valid in the number depending a lot of conditions.

timfayz · April 3, 2024, 10:15am

I understand and there are examples of it in docs:

const floating_point = 123.0E+77;
const another_float = 123.0;
const yet_another = 123.0e+77;
const hex_floating_point = 0x103.70p-5;
const another_hex_float = 0x103.70;
const yet_another_hex_float = 0x103.70P-5;
const lightspeed = 299_792_458.000_000;
const nanosecond = 0.000_000_001;
const more_hex = 0x1234_5678.9ABC_CDEFp-10;

However, there are some chars like Q or K that I can’t think of any condition that would make it part of a valid number.

squeek502 · April 3, 2024, 10:59am

One thing to think about is what you’d like your error messages to look like, and how you’d like to deal with invalid tokens generally. If the tokenizer rejects invalid number tokens after the first invalid character, then:

What do you do with the invalid number token up to the invalid character?
What do you do with the invalid character itself?
Do you continue tokenizing after the invalid number? If so, how do you break up the tokens in such a way that minimizes unintended knock-on effects?

Zig’s approach here allows the tokenizer to continue after invalid tokens without much risk of misinterpreting the code. For example:

const foo = 1QZZQ234;
const bar = 0xinvalid;

Zig’s error message for this looks like:

invalid-numbers.zig:1:14: error: invalid digit 'Q' for decimal base
const foo = 1QZZQ234;
             ^~~~~~~
invalid-numbers.zig:2:15: error: invalid digit 'i' for hex base
const bar = 0xinvalid;
              ^~~~~~~

If the tokenizer validated the number tokens, it’d take a bit more effort to (a) get similar error messages, and (b) avoid misinterpreting the file after an invalid number (i.e. what tokens would the number-validating-tokenizer emit for 1QZZQ234? would that lead to confusing error messages later on?).

(note: I’m not positive this is actually the reason for this Zig tokenizer behavior, but it’s a possible reason that makes sense to me)

timfayz · April 3, 2024, 12:39pm

I think I missed the very important detail you hinted on – tokenization doesn’t stop after the first invalid token! In my approach, I don’t do full tokenization in advance. Instead, I just look at the current one and 1 token ahead (at max). As soon as something goes wrong – I report. In Zig, it instead does full tokenization until the end of file and continue parsing even after parsing error happens (not sure, however, how zig cuts off the invalid part to reasonably continue parsing other expressions).

I think this should be the main reason behind such a behaviour. It really makes sense to eat up more than necessary and let the parser to collect, decide and report on several occasions.

Update: Thank you for this insight!

matklad · April 3, 2024, 1:23pm

Parse, don’t validate!

You don’t actually need to validate numbers, you want to convert a string to the appropriate numeric value.

If you lexer returns union(enum) { number: u64 }, you want the validation code to be a part of the lexer. If your lexer returns struct { tag: enum(u8) { ... }, offset_start: u32, offset_end: u32}, then you want the relevant validation to happen at the point where you convert that to a number further down compiler’s pipeline.

I would actually expect the same logic to apply to string literals, and was surprised to learn that there’s some validation in the lexer.

But this actually isn’t the case! checkLiteralCharacter doesn’t check for valid escape sequencies or some such.

Rather, it just enforces that the underlying text document, irrespective of it’s lexical structure:

is valid utf8
doesn’t have weird characters (ascii bell and some such)

So the lexer just checks that the input source is correctly encoded and slices its up into disjoint tokens of different kinds. The actual parsing/validation of tokens happens further down the line, I would guess in astgen

gonzo · April 3, 2024, 4:26pm

There is such a thing as scannerless parsing, so strictly speaking you can do everything in the parser. But if you have a lexer, it is customary for it to return a single token .less_equal for the two characters <=.

dimdin · April 3, 2024, 4:44pm

There is also the other extreme: “parserlesss scaning”.

In FORTH the line scanner separates the line by spaces, everything between spaces is a word. That’s all the scanning and there is no parsing.
In lisp the scanner produces the parse tree, because the source code is written in the parse tree.
e.g. (* (+ 1 2) 3) for (1 + 2) * 3

timfayz · April 3, 2024, 4:45pm

I didn’t say it checks for valid “escape sequencies”. I said it checks whether what comes after something that signifies text is a valid utf-8/ascii sequence (eg. no control chars for ascii or invalid byte sequences for utf-8).

Yes, that is clear now. We didn’t define what validation means but don’t you think that this IS a validation (at least some sort of it)?

That is why I was confused by this reaction. However now I see that validation means a bit more to you, ie. it includes checking the validity of escape sequences.

matklad · April 3, 2024, 6:45pm

Note that this is checked for every single token. It’s just that only strings and comments require this check to be explicit function code in the source code. All other tokens do this validation implicitly, by virtue of matching on exact set of allowed characters.

So zig lexer does equal amount of validation for numbers and strings.

timfayz · April 4, 2024, 7:33am

Yeah, it wasn’t obvious and was tricky to see but after some time following all the states and “paths”, you can see that the scanner leaves no space for improper byte/char sequences.

Also, why tokenizer uses null-terminated slice ([:0]const u8) as an input instead of just slice? What are the benefits or pitfalls doing otherwise?

matklad · April 4, 2024, 9:37am

In lexers and parsers, you always match on the next token/character. Bit there’s a possibility that there’s no next character, so the correct type to match on is ?u8 or ?Token.

But it almost always the case that “no next token” is handled exactly the same way as "a wrong next token”. Eg, while scanning a number you eat chars while they are numbers, stopping at either eof or first non-digit.

So it helps to introduce an explicit eof token/character, to avoid the extra code to unpack ?. Instead of null, use 0.

You could do that by adding a length check to the “get next character” routine, but it’s just faster and simpler when there is an eof character at the end already.

timfayz · April 4, 2024, 2:03pm

Could please elaborate a bit more here? Because I also speculated that it could be one of the reasons. However, I can’t see it clearly. Even though you eliminate the condition on every iteration, ie. while (self.cursor < self.input.len) {...}, you still have to introduce a new switch case 0 => ... and perform computation there instead. Yet, if you want to exclude null chars from the source, you still have to keep this case

kristoff · April 4, 2024, 2:08pm

You will still have to handle one way or another when the string ends “prematurely”.

I ended up going for the same solution for the tokenizer of my JSON replacement data language.

If you try implementing a tokenizer yourself you will be able to see firsthand why this a necessary concept that will have to be solved with either a sentinel byte, optionals, or something similar.

github.com

kristoff-it/ziggy/blob/53fd0a4ea876124956ec55e6a191a024eb0f3f76/src/ziggy/Tokenizer.zig#L154


      
              comment,
          };
          
          pub fn peek(self: *Tokenizer, code: [:0]const u8) Token {
              const idx = self.idx;
              const t = self.next(code);
              self.idx = idx;
              return t;
          }
          
          pub fn next(self: *Tokenizer, code: [:0]const u8) Token {
              var state: State = .start;
              var res: Token = .{
                  .tag = .invalid,
                  .loc = .{
                      .start = self.idx,
                      .end = undefined,
                  },
              };
          
              while (self.idx < code.len) : (self.idx += 1) {

matklad · April 4, 2024, 5:26pm

Yeah, that’s a good illustration of what I am talking about! you can see that there’s a duplication between the else branch on the while, and various cases inside the main loop. I believe a diff like this could get rid of the duplication:

diff --git a/src/ziggy/Tokenizer.zig b/src/ziggy/Tokenizer.zig
index 67695cc..41be783 100644
--- a/src/ziggy/Tokenizer.zig
+++ b/src/ziggy/Tokenizer.zig
@@ -161,8 +161,8 @@ pub fn next(self: *Tokenizer, code: [:0]const u8) Token {
         },
     };
 
-    while (self.idx < code.len) : (self.idx += 1) {
-        const c = code[self.idx];
+    while (self.idx <= code.len) : (self.idx += 1) {
+        const c = code[self.idx]; // might get trailing zero
         switch (state) {
             .start => switch (c) {
                 0 => {
@@ -323,20 +323,7 @@ pub fn next(self: *Tokenizer, code: [:0]const u8) Token {
                 else => {},
             },
         }
-    } else {
-        switch (state) {
-            .start => res.tag = .eof,
-            .comment => {
-                if (self.want_comments) {
-                    self.finishComment(&res, code);
-                } else res.tag = .eof;
-            },
-            .identifier => self.finishIdentifier(&res, code),
-            .number => self.finishNumber(&res, code),
-            else => res.tag = .invalid,
-        }
-        res.loc.end = self.idx;
-    }
+    } else unreachable;
 
     return res;
 }

what you write in the while else is (or at least should be) identical to what you have in your explicit logic for “a different token from the one I’ve expected”. Don’t vouch for this particular diff, but 0.8 sure that a diff like this would work, and would simplify the code somewhat.

cancername · April 6, 2024, 10:36pm

QuickBMS does this, and lord, it is cursed.

gonzo · April 7, 2024, 7:46am

If you are referring to scannerless parsing, I understand it is more powerful because you don’t have to tokenise your input, so it is applicable to non-traditional languages; Perl comes to mind – they allow some strange things in the source, which makes it impossible to tokenize consistently.

On the other hand, scannerless parsing is more expensive to process because you are using heavier machinery (a parser, which allows for nested structures) to do lexing, which can usually be done by non-recursive means, such as regexps.

I am not familiar with QuickBMS – it would be interesting to see an example of its cursedness.

VoilaNeighbor · April 7, 2024, 1:53pm

I believe a rule of thumb is to do more things as soon as you have enough information, so I an supportive of what the OP said: if validating the number is not context aware and can be done in a tokenizer, then it’s better to do it there to reduce the universe of states that we pass to next stages.

amesaine · August 26, 2024, 10:10am

What’s the purpose of a specific invalid token like invalid_periodasterisks? I’m imagining that .** creates relatively too many vague parsing scenarios. But I can’t concretize why this needs to be separated from a simple invalid tag or not be handled as a State instead.