300 MiB/s Zig lexer, fixing an edge case in the grammar

marler8997 · April 7, 2025, 2:52pm

EDIT: my math is wrong here, I calculated std able to do 12 to 125 MiB/s, but it’s more like 400 to 700 MiB/s.

Interesting stuff. A couple weeks ago I just so happened to create a benchmark to test out a theory with the lexer. Here’s the discussion about it and here’s the benchmark.

The benchmark is taking anywhere from 40 to 400ms depending on the machine and is lexing about 50 MiB of data, which means it’s lexing from about 12 MiB/S up to 125 MiB/S (a big range I know). The newer macOS silicon outperforms by quite a bit. It lexes the same assortment of source code a thousand times to help take IO out of the equation, so, your number of 30 MiB/S for the current implementation seems consistent.

I’d be very curious if we could try your code in my benchmark. Would you be able to provide me with some zig code to try it out? Here’s what the current code looks like for both benchmarks:

    switch (impl) {
        .std => for (0 .. loop_count) |_| {
            var tokenizer: std.zig.Tokenizer = .{
                .buffer = tokens,
                .index = 0,
            };
            while (true) {
                const token = tokenizer.next();
                if (token.tag == .eof) break;
                token_count += 1;
            }
        },
        .custom =>for (0 .. loop_count) |_| {
            var tokenizer: custom.Tokenizer = .{
                .index = 0,
            };
            while (true) {
                const token = tokenizer.next(tokens);
                if (token.tag == .eof) break;
                token_count += 1;
            }
        },
        .rewrite => {
            // code that uses your lexer goes here
        }
    }