Hi all!
I’m in the process of writing a tokenizer and parser. Recently, I decided to take a look at std/zig/tokenizer.zig
for the sake of learning and inspiration. I found some things written super clean and wise, and others I simply cannot understand (in a sense of “design choices” and whys). For example, here are some states the tokenizer has:
pub fn next(self: *Tokenizer) Token {
while (true) : (self.index += 1) {
...
// string related tokenization
.string_literal => switch (c) {
'\\' => {
state = .string_literal_backslash;
},
'"' => {
self.index += 1;
break;
},
0 => {
if (self.index == self.buffer.len) {
result.tag = .invalid;
break;
} else {
self.checkLiteralCharacter();
}
},
'\n' => {
result.tag = .invalid;
break;
},
else => self.checkLiteralCharacter(),
},
...
// number related tokenization
.int => switch (c) {
'.' => state = .int_period,
'_', 'a'...'d', 'f'...'o', 'q'...'z', 'A'...'D', 'F'...'O', 'Q'...'Z', '0'...'9' => {},
'e', 'E', 'p', 'P' => state = .int_exponent,
else => break,
},
.int_exponent => switch (c) {
'-', '+' => {
state = .float;
},
else => {
self.index -= 1;
state = .int;
},
},
.int_period => switch (c) {
'_', 'a'...'d', 'f'...'o', 'q'...'z', 'A'...'D', 'F'...'O', 'Q'...'Z', '0'...'9' => {
state = .float;
},
'e', 'E', 'p', 'P' => state = .float_exponent,
else => {
self.index -= 1;
break;
},
},
.float => switch (c) {
'_', 'a'...'d', 'f'...'o', 'q'...'z', 'A'...'D', 'F'...'O', 'Q'...'Z', '0'...'9' => {},
'e', 'E', 'p', 'P' => state = .float_exponent,
else => break,
},
.float_exponent => switch (c) {
'-', '+' => state = .float,
else => {
self.index -= 1;
state = .float;
},
},
Here, I couldn’t understand why Andrew (I assume, it is he who wrote it) decided, for example, in case of scanning string literals to parse them more carefully (ie. applying checkLiteralCharacter
to check validity for every ASCII/UTF-8 byte it eats), and in case of scanning numbers, tokenizer accepts all kind of rubbish (ie. 1QZZQ
or 0hello_world._bye_bye
). For example, in my tokenizer I do not accept those combinations but only valid int/float literals (you can look at std/json/scanner.zig as an example because I do essentially the same).
So, the question is: why wouldn’t he validate floats/ints during the tokenization stage rather than somewhere down the road*? Especially considering the fact that tokens can be in .invalid
state to propagate the info where something went wrong.
[*] I found std/zig/Parse.zig
a bit overwhelming, so I can’t say where exactly the validity of number literals happens.