300 MiB/s Zig lexer, fixing an edge case in the grammar

mnemnion · April 15, 2025, 3:11pm

In the absence of solid data, my stance is that a bottleneck due to this lexing quirk hasn’t been established.

I would probably advance that argument if such a thing were established, which as of now it has not been.

Mainly I wanted to establish some rhetorical distance between “possible optimization available” and “grammar bug”.

In the absence of attention paid to efficiency in parsing, you get C++. No one wants that. But I’m sure that switching Zig to s-expressions would make parsing faster, and almost no one wants that either.

mnemnion · April 15, 2025, 3:55pm

Y’know, it doesn’t actually have to:

.int_period => {
    switch (self.buffer[self.index + 1]) {
        '_', 'a'...'d', 'f'...'o', 'q'...'z', 'A'...'D', 'F'...'O', 'Q'...'Z', '0'...'9' => {
            self.index += 2;
            continue :state .float;
        },
        'e', 'E', 'p', 'P' => {
            self.index += 1;
            continue :state .float_exponent;
        },
        else => {},
    }
},

I don’t actually know if this is better, but it might be worth benchmarking, maybe? The effect is going to be very small if there is one.

In fact I’d want to check first just to see if optimized versions generate different machine code at all.

mnemnion · April 15, 2025, 4:05pm

Here it is as a patch against Zig master, if anyone is interested in playing around with it.

I don’t have the time to build zig today and run the tests so apologies if it’s simply broken in some way I can’t perceive.

diff --git a/lib/std/zig/tokenizer.zig b/lib/std/zig/tokenizer.zig
index 09c79d3562..7ea1003284 100644
--- a/lib/std/zig/tokenizer.zig
+++ b/lib/std/zig/tokenizer.zig
@@ -1068,18 +1068,19 @@ pub const Tokenizer = struct {
                 }
             },
             .int_period => {
-                self.index += 1;
-                switch (self.buffer[self.index]) {
+                switch (self.buffer[self.index + 1]) {
                     '_', 'a'...'d', 'f'...'o', 'q'...'z', 'A'...'D', 'F'...'O', 'Q'...'Z', '0'...'9' => {
-                        self.index += 1;
+                        self.index += 2;
                         continue :state .float;
                     },
                     'e', 'E', 'p', 'P' => {
+                        self.index += 1;
                         continue :state .float_exponent;
                     },
-                    else => self.index -= 1,
+                    else => {},
                 }
             },
+
             .float => switch (self.buffer[self.index]) {
                 '_', 'a'...'d', 'f'...'o', 'q'...'z', 'A'...'D', 'F'...'O', 'Q'...'Z', '0'...'9' => {
                     self.index += 1;

squeek502 · April 15, 2025, 5:24pm

Sorry, I didn’t mean to discourage you, just wanted to suggest that more apples-to-apples benchmarking is needed.

It’d be helpful if you could provide instructions on exactly how you benchmarked your implementation and how others can reproduce your results (you mentioned Sema.zig, but your BM_Lex function uses an embedded tracy.zig?). As of now, I’m not sure exactly how to run your benchmark or what benchmark you’re actually using.

It’d also be good if you could run the Zig benchmark code in my comment that you replied to and post your results (remember to compile with -OReleaseFast).