Why Digit Separator is not a state in the Tokenizer

amesaine · August 24, 2024, 7:33pm

Number literals can be separated by _. Consecutive digit separators are an error. My question is why tokenizer permits this kind of state.

lib/std/zig/tokenizer.zig: In the int state, _ is simply ignored.

.int => switch (c) {
    '.' => state = .int_period,
    '_', 'a'...'d', 'f'...'o', 'q'...'z', 'A'...'D', 'F'...'O', 'Q'...'Z', '0'...'9' => {},
    'e', 'E', 'p', 'P' => state = .int_exponent,
    else => break,
},

Then the validation of the literal is up to the parser(?) Along with consecutive digit separators, other validation errors can be found in lib/std/zig/number_literal.zig

Why doesn’t the tokenizer just handle valid states? I’m assuming this can’t be done for all aspects of tokenization, but why not here in number_literal specifically? Handled through a digit_separator state for example.

permutationlock · August 24, 2024, 8:31pm

This earlier thread discusses a similar question.