300 MiB/s Zig lexer, fixing an edge case in the grammar

I implemented a zig lexer. Early benchmarks show a promising speed of ~300MiB/s. It seems to be ~10x faster than anything I found.

There’s a catch, though: a bug in the grammar. Consider this input: 0.. The lexer can’t tell if it’s a float literal (0.0) or a number literal followed by a .. token (0..3). That’s a problem because my implementation needs to decide when a token starts based only on the current state and the next character. It emits the token start info tied to the current position.

You can figure out later which case it is, but by then it’s too late - the lexer has already moved on, making the location info for ../... wrong.

Edge-casing this would slow the lexer down a lot, since it’s table-driven.

The current implementation has to backtrack in this situation

There’s a simple fix, though: replace .. and ... with a different operator that does not start with a dot. For example, :, which I used during testing.

I’m curious how the community would feel about this. Changing the grammar without a strong reason is annoying - but here, the benefit is clear. Personally, I really don’t care whether I write ... or :.

Note: a fast lexer is step one for a fast formatter. Who wouldn’t want a 300MiB/s formatter? :grin:

A simple transition plan could be to accept both tokens for a release cycle or two and have zig fmt auto-convert code to the new token.

Thoughts?

PS: I measured the current implementation to be around 30 MiB/s, but it is doing the parsing, IO and allocation as well. So this is not a fair comparison at all. I do believe that I can use similar techniques to create a parser for my lexer an keep a lot of the speed. That is the plan anyway.

5 Likes

How does it compare to the current lexer? It’s 10x faster? If so do you have any insights as to why?

1 Like

The main speedup is that the entire lexer is ~ 15 instructions loop with dependency chain 5 instructions long that indexes through a baked lookup table instead of a big switch that relies on the branch predictor a lot. I wasn’t able to measure just the lexer but the current lexer+parser is around 30MiB/s but that is including IO syscalls and such that my 300MiB/s does not include.

1 Like

EDIT: my math is wrong here, I calculated std able to do 12 to 125 MiB/s, but it’s more like 400 to 700 MiB/s.

Interesting stuff. A couple weeks ago I just so happened to create a benchmark to test out a theory with the lexer. Here’s the discussion about it and here’s the benchmark.

The benchmark is taking anywhere from 40 to 400ms depending on the machine and is lexing about 50 MiB of data, which means it’s lexing from about 12 MiB/S up to 125 MiB/S (a big range I know). The newer macOS silicon outperforms by quite a bit. It lexes the same assortment of source code a thousand times to help take IO out of the equation, so, your number of 30 MiB/S for the current implementation seems consistent.

I’d be very curious if we could try your code in my benchmark. Would you be able to provide me with some zig code to try it out? Here’s what the current code looks like for both benchmarks:

    switch (impl) {
        .std => for (0 .. loop_count) |_| {
            var tokenizer: std.zig.Tokenizer = .{
                .buffer = tokens,
                .index = 0,
            };
            while (true) {
                const token = tokenizer.next();
                if (token.tag == .eof) break;
                token_count += 1;
            }
        },
        .custom =>for (0 .. loop_count) |_| {
            var tokenizer: custom.Tokenizer = .{
                .index = 0,
            };
            while (true) {
                const token = tokenizer.next(tokens);
                if (token.tag == .eof) break;
                token_count += 1;
            }
        },
        .rewrite => {
            // code that uses your lexer goes here
        }
    }

That sounds really neat. My benchmark is really basic. I will send you the code when I return to my laptop in ~5 hours.

1 Like

I don’t think we should sacrifice readability for small performance improvements. And I personally find : less readable than .. since it adds less space (and more noise) between identifiers, making it harder to see at a glance where/if there is a range at all.

Couldn’t you just add an if to your implementation to handle that case? A branch that’s almost never taken is basically free thanks to branch prediction.

1 Like

Also I do wonder if things can be improved from here.
300 MB/s seems rather small given the capabilities of modern hardware. I think it should be possible to use SIMD somehow, like you could easily use SIMD to find the next newline in a comment, or the next non-alphanumeric character in an identifier, or the next non-whitespace character.

Yes GitHub - Validark/Accelerated-Zig-Parser: A high-throughput parser for the Zig programming language.

3 Likes

I unfortunately can’t. The lexer is a single hot loop: (pardon my c++ on the zig forum)

static constexpr auto
do_transition(State s, u8 input, SourceIndex i,
              std::vector<Token>::iterator &types_head,
              std::vector<SourceIndex>::iterator &start_head) -> State {
  auto entry = dfa[s][input];

  *types_head = state_to_token[s];
  *start_head = i;

  types_head += !!(entry & 1);
  start_head += !!(entry & 4);

  return static_cast<State>(entry >> 3);
}
// loop over input omitted 

Adding anything here especially something that would recompute the state index would slow down the lexer significantly since the bottleneck is the data dependency chain of instructions computing the index of the next state’s LUT

My first implementation did do aggressive SIMD and achieved a disappointing 100 MiB/s, being bottlenecked by the CPU instructions throughput, rather than the current implementations bottleneck on instruction latency.

You can find my SIMD code on a different branch in the same repo.

The peak performance would probably use a hybrid approach, using SIMD to skip over whitespace, identifiers and comments. Rest of the lexical structures can be parsed with SIMD, but it doesn’t seem to be worth it.

Well, to me that sounds like you could absolutely afford an extra branch, as long as it doesn’t impact the data dependency chain in the likely case. And in the unlikely case it doesn’t really matter anyways. You can probably afford to be 100× slower in the unlikely path without impacting overall performance by a lot.

1 Like

Here’s my attempt at reimplementing your benchmark in Zig, running only the tokenizer:

Benchmark code
const std = @import("std");

pub fn main() !void {
    const iterations = 100000;
    var timer = try std.time.Timer.start();
    for (0..iterations) |_| {
        var tokenizer = std.zig.Tokenizer.init(source);
        while (true) {
            const token = tokenizer.next();
            if (token.tag == .eof) break;
        }
    }
    const ns_taken = timer.read();
    const seconds_taken = @as(f64, @floatFromInt(ns_taken)) / std.time.ns_per_s;
    const bytes_processed = iterations * source.len;
    const mib_processed = @as(f64, @floatFromInt(bytes_processed)) / 1024 / 1024;
    const mib_per_s = mib_processed / seconds_taken;
    std.debug.print("{d:.2}MiB/s\n", .{mib_per_s});
}

const source =
    \\const std = @import("std");
    \\const builtin = @import("builtin");
    \\const build_options = @import("build_options");
    \\
    \\pub const enable = if (builtin.is_test) false else build_options.enable_tracy;
    \\pub const enable_allocation = enable and build_options.enable_tracy_allocation;
    \\pub const enable_callstack = enable and build_options.enable_tracy_callstack;
    \\pub const callstack_depth = if (enable_callstack and build_options.tracy_callstack_depth > 0) build_options.tracy_callstack_depth else 10;
    \\
    \\const ___tracy_c_zone_context = extern struct {
    \\    id: u32,
    \\    active: c_int,
    \\
    \\    pub inline fn end(self: @This()) void {
    \\        ___tracy_emit_zone_end(self);
    \\    }
    \\
    \\    pub inline fn addText(self: @This(), text: []const u8) void {
    \\        ___tracy_emit_zone_text(self, text.ptr, text.len);
    \\    }
    \\
    \\    pub inline fn setName(self: @This(), name: []const u8) void {
    \\        ___tracy_emit_zone_name(self, name.ptr, name.len);
    \\    }
    \\
    \\    pub inline fn setColor(self: @This(), color: u32) void {
    \\        ___tracy_emit_zone_color(self, color);
    \\    }
    \\
    \\    pub inline fn setValue(self: @This(), value: u64) void {
    \\        ___tracy_emit_zone_value(self, value);
    \\    }
    \\};
    \\
    \\pub const Ctx = if (enable) ___tracy_c_zone_context else struct {
    \\    pub inline fn end(self: @This()) void {
    \\        _ = self;
    \\    }
    \\
    \\    pub inline fn addText(self: @This(), text: []const u8) void {
    \\        _ = self;
    \\        _ = text;
    \\    }
    \\
    \\    pub inline fn setName(self: @This(), name: []const u8) void {
    \\        _ = self;
    \\        _ = name;
    \\    }
    \\
    \\    pub inline fn setColor(self: @This(), color: u32) void {
    \\        _ = self;
    \\        _ = color;
    \\    }
    \\
    \\    pub inline fn setValue(self: @This(), value: u64) void {
    \\        _ = self;
    \\        _ = value;
    \\    }
    \\};
    \\
    \\pub inline fn trace(comptime src: std.builtin.SourceLocation) Ctx {
    \\    if (!enable) return .{};
    \\
    \\    const global = struct {
    \\        const loc: ___tracy_source_location_data = .{
    \\            .name = null,
    \\            .function = src.fn_name.ptr,
    \\            .file = src.file.ptr,
    \\            .line = src.line,
    \\            .color = 0,
    \\        };
    \\    };
    \\
    \\    if (enable_callstack) {
    \\        return ___tracy_emit_zone_begin_callstack(&global.loc, callstack_depth, 1);
    \\    } else {
    \\        return ___tracy_emit_zone_begin(&global.loc, 1);
    \\    }
    \\}
    \\
    \\pub inline fn traceNamed(comptime src: std.builtin.SourceLocation, comptime name: [:0]const u8) Ctx {
    \\    if (!enable) return .{};
    \\
    \\    const global = struct {
    \\        const loc: ___tracy_source_location_data = .{
    \\            .name = name.ptr,
    \\            .function = src.fn_name.ptr,
    \\            .file = src.file.ptr,
    \\            .line = src.line,
    \\            .color = 0,
    \\        };
    \\    };
    \\
    \\    if (enable_callstack) {
    \\        return ___tracy_emit_zone_begin_callstack(&global.loc, callstack_depth, 1);
    \\    } else {
    \\        return ___tracy_emit_zone_begin(&global.loc, 1);
    \\    }
    \\}
    \\
    \\pub fn tracyAllocator(allocator: std.mem.Allocator) TracyAllocator(null) {
    \\    return TracyAllocator(null).init(allocator);
    \\}
    \\
    \\pub fn TracyAllocator(comptime name: ?[:0]const u8) type {
    \\    return struct {
    \\        parent_allocator: std.mem.Allocator,
    \\
    \\        const Self = @This();
    \\
    \\        pub fn init(parent_allocator: std.mem.Allocator) Self {
    \\            return .{
    \\                .parent_allocator = parent_allocator,
    \\            };
    \\        }
    \\
    \\        pub fn allocator(self: *Self) std.mem.Allocator {
    \\            return .{
    \\                .ptr = self,
    \\                .vtable = &.{
    \\                    .alloc = allocFn,
    \\                    .resize = resizeFn,
    \\                    .free = freeFn,
    \\                },
    \\            };
    \\        }
    \\
    \\        fn allocFn(ptr: *anyopaque, len: usize, ptr_align: u8, ret_addr: usize) ?[*]u8 {
    \\            const self: *Self = @ptrCast(@alignCast(ptr));
    \\            const result = self.parent_allocator.rawAlloc(len, ptr_align, ret_addr);
    \\            if (result) |data| {
    \\                if (len != 0) {
    \\                    if (name) |n| {
    \\                        allocNamed(data, len, n);
    \\                    } else {
    \\                        alloc(data, len);
    \\                    }
    \\                }
    \\            } else {
    \\                messageColor("allocation failed", 0xFF0000);
    \\            }
    \\            return result;
    \\        }
    \\
    \\        fn resizeFn(ptr: *anyopaque, buf: []u8, buf_align: u8, new_len: usize, ret_addr: usize) bool {
    \\            const self: *Self = @ptrCast(@alignCast(ptr));
    \\            if (self.parent_allocator.rawResize(buf, buf_align, new_len, ret_addr)) {
    \\                if (name) |n| {
    \\                    freeNamed(buf.ptr, n);
    \\                    allocNamed(buf.ptr, new_len, n);
    \\                } else {
    \\                    free(buf.ptr);
    \\                    alloc(buf.ptr, new_len);
    \\                }
    \\
    \\                return true;
    \\            }
    \\
    \\            // during normal operation the compiler hits this case thousands of times due to this
    \\            // emitting messages for it is both slow and causes clutter
    \\            return false;
    \\        }
    \\
    \\        fn freeFn(ptr: *anyopaque, buf: []u8, buf_align: u8, ret_addr: usize) void {
    \\            const self: *Self = @ptrCast(@alignCast(ptr));
    \\            self.parent_allocator.rawFree(buf, buf_align, ret_addr);
    \\            // this condition is to handle free being called on an empty slice that was never even allocated
    \\            // example case: `std.process.getSelfExeSharedLibPaths` can return `&[_][:0]u8{}`
    \\            if (buf.len != 0) {
    \\                if (name) |n| {
    \\                    freeNamed(buf.ptr, n);
    \\                } else {
    \\                    free(buf.ptr);
    \\                }
    \\            }
    \\        }
    \\    };
    \\}
    \\
    \\// This function only accepts comptime-known strings, see `messageCopy` for runtime strings
    \\pub inline fn message(comptime msg: [:0]const u8) void {
    \\    if (!enable) return;
    \\    ___tracy_emit_messageL(msg.ptr, if (enable_callstack) callstack_depth else 0);
    \\}
    \\
    \\// This function only accepts comptime-known strings, see `messageColorCopy` for runtime strings
    \\pub inline fn messageColor(comptime msg: [:0]const u8, color: u32) void {
    \\    if (!enable) return;
    \\    ___tracy_emit_messageLC(msg.ptr, color, if (enable_callstack) callstack_depth else 0);
    \\}
    \\
    \\pub inline fn messageCopy(msg: []const u8) void {
    \\    if (!enable) return;
    \\    ___tracy_emit_message(msg.ptr, msg.len, if (enable_callstack) callstack_depth else 0);
    \\}
    \\
    \\pub inline fn messageColorCopy(msg: [:0]const u8, color: u32) void {
    \\    if (!enable) return;
    \\    ___tracy_emit_messageC(msg.ptr, msg.len, color, if (enable_callstack) callstack_depth else 0);
    \\}
    \\
    \\pub inline fn frameMark() void {
    \\    if (!enable) return;
    \\    ___tracy_emit_frame_mark(null);
    \\}
    \\
    \\pub inline fn frameMarkNamed(comptime name: [:0]const u8) void {
    \\    if (!enable) return;
    \\    ___tracy_emit_frame_mark(name.ptr);
    \\}
    \\
    \\pub inline fn namedFrame(comptime name: [:0]const u8) Frame(name) {
    \\    frameMarkStart(name);
    \\    return .{};
    \\}
    \\
    \\pub fn Frame(comptime name: [:0]const u8) type {
    \\    return struct {
    \\        pub fn end(_: @This()) void {
    \\            frameMarkEnd(name);
    \\        }
    \\    };
    \\}
    \\
    \\inline fn frameMarkStart(comptime name: [:0]const u8) void {
    \\    if (!enable) return;
    \\    ___tracy_emit_frame_mark_start(name.ptr);
    \\}
    \\
    \\inline fn frameMarkEnd(comptime name: [:0]const u8) void {
    \\    if (!enable) return;
    \\    ___tracy_emit_frame_mark_end(name.ptr);
    \\}
    \\
    \\extern fn ___tracy_emit_frame_mark_start(name: [*:0]const u8) void;
    \\extern fn ___tracy_emit_frame_mark_end(name: [*:0]const u8) void;
    \\
    \\inline fn alloc(ptr: [*]u8, len: usize) void {
    \\    if (!enable) return;
    \\
    \\    if (enable_callstack) {
    \\        ___tracy_emit_memory_alloc_callstack(ptr, len, callstack_depth, 0);
    \\    } else {
    \\        ___tracy_emit_memory_alloc(ptr, len, 0);
    \\    }
    \\}
    \\
    \\inline fn allocNamed(ptr: [*]u8, len: usize, comptime name: [:0]const u8) void {
    \\    if (!enable) return;
    \\
    \\    if (enable_callstack) {
    \\        ___tracy_emit_memory_alloc_callstack_named(ptr, len, callstack_depth, 0, name.ptr);
    \\    } else {
    \\        ___tracy_emit_memory_alloc_named(ptr, len, 0, name.ptr);
    \\    }
    \\}
    \\
    \\inline fn free(ptr: [*]u8) void {
    \\    if (!enable) return;
    \\
    \\    if (enable_callstack) {
    \\        ___tracy_emit_memory_free_callstack(ptr, callstack_depth, 0);
    \\    } else {
    \\        ___tracy_emit_memory_free(ptr, 0);
    \\    }
    \\}
    \\
    \\inline fn freeNamed(ptr: [*]u8, comptime name: [:0]const u8) void {
    \\    if (!enable) return;
    \\
    \\    if (enable_callstack) {
    \\        ___tracy_emit_memory_free_callstack_named(ptr, callstack_depth, 0, name.ptr);
    \\    } else {
    \\        ___tracy_emit_memory_free_named(ptr, 0, name.ptr);
    \\    }
    \\}
    \\
    \\extern fn ___tracy_emit_zone_begin(
    \\    srcloc: *const ___tracy_source_location_data,
    \\    active: c_int,
    \\) ___tracy_c_zone_context;
    \\extern fn ___tracy_emit_zone_begin_callstack(
    \\    srcloc: *const ___tracy_source_location_data,
    \\    depth: c_int,
    \\    active: c_int,
    \\) ___tracy_c_zone_context;
    \\extern fn ___tracy_emit_zone_text(ctx: ___tracy_c_zone_context, txt: [*]const u8, size: usize) void;
    \\extern fn ___tracy_emit_zone_name(ctx: ___tracy_c_zone_context, txt: [*]const u8, size: usize) void;
    \\extern fn ___tracy_emit_zone_color(ctx: ___tracy_c_zone_context, color: u32) void;
    \\extern fn ___tracy_emit_zone_value(ctx: ___tracy_c_zone_context, value: u64) void;
    \\extern fn ___tracy_emit_zone_end(ctx: ___tracy_c_zone_context) void;
    \\extern fn ___tracy_emit_memory_alloc(ptr: *const anyopaque, size: usize, secure: c_int) void;
    \\extern fn ___tracy_emit_memory_alloc_callstack(ptr: *const anyopaque, size: usize, depth: c_int, secure: c_int) void;
    \\extern fn ___tracy_emit_memory_free(ptr: *const anyopaque, secure: c_int) void;
    \\extern fn ___tracy_emit_memory_free_callstack(ptr: *const anyopaque, depth: c_int, secure: c_int) void;
    \\extern fn ___tracy_emit_memory_alloc_named(ptr: *const anyopaque, size: usize, secure: c_int, name: [*:0]const u8) void;
    \\extern fn ___tracy_emit_memory_alloc_callstack_named(ptr: *const anyopaque, size: usize, depth: c_int, secure: c_int, name: [*:0]const u8) void;
    \\extern fn ___tracy_emit_memory_free_named(ptr: *const anyopaque, secure: c_int, name: [*:0]const u8) void;
    \\extern fn ___tracy_emit_memory_free_callstack_named(ptr: *const anyopaque, depth: c_int, secure: c_int, name: [*:0]const u8) void;
    \\extern fn ___tracy_emit_message(txt: [*]const u8, size: usize, callstack: c_int) void;
    \\extern fn ___tracy_emit_messageL(txt: [*:0]const u8, callstack: c_int) void;
    \\extern fn ___tracy_emit_messageC(txt: [*]const u8, size: usize, color: u32, callstack: c_int) void;
    \\extern fn ___tracy_emit_messageLC(txt: [*:0]const u8, color: u32, callstack: c_int) void;
    \\extern fn ___tracy_emit_frame_mark(name: ?[*:0]const u8) void;
    \\
    \\const ___tracy_source_location_data = extern struct {
    \\    name: ?[*:0]const u8,
    \\    function: [*:0]const u8,
    \\    file: [*:0]const u8,
    \\    line: u32,
    \\    color: u32,
    \\};
;

These are my results:

$ zig build-exe tokenizerbench.zig -OReleaseFast
$ ./tokenizerbench
406.75MiB/s
3 Likes

Ah thanks Squeek, looks like I made a couple mistakes in my math. Here’s the actual numbers/breakdown for the std tokenizer from my benchmark.

Benchmark processes the same 218 KiB tokens.zig file 1000 times, which comes out to 214 MiB total processed in the elapsed time.

It does this in about 0.46 seconds, some platforms are slightly faster, more like 0.3 seconds. So at the slower 0.46 second time that gives us, 214 / 0.46 which is about about 465 MiB/S. A time of 0.3 would be over 700 MiB/s.

Seems like this alternative implementation could actually be performance degradation, Would like to see both variants tested on the same machine with the same benchmark to know for sure.

With Zig tokenizer in C at d1172acadf reading Zig 0.14.0 src/Sema.zig (1.65 mb) with -march=icelake-client, GCC 15 fdd95e1cf29137a19baed25f8c817d320dfe63e3 reads at 21.1ms (78.19 mb/s) while Clang 19.1.7 reads at 7.47 ms (220.883534137 mb/s).

This is not a bug in the grammar.

That is a bug in your implementation. It is unable to lex Zig source code, which is your goal.

This is a downstream problem, not an upstream problem.

1 Like

True, but it would be reasonable to argue that imposing an additional grammar limitation such as this would be valuable, since it unlocks implementations like this.

Of course, it’s also reasonable to argue that it’s not worth a worse syntax for slightly more lexing perf.

8 Likes

Elpisis is also a thing (0...), so we would need another one byte char for that too? It is used in ranged switch prongs:

Ah, never mind you already proposed changing that to :.

If you have the patience for some slower speed content (about fast code), check out this series of talks from Utah Zig about using SIMD and data level parallelism to make zig parsing faster:

3 Likes

Note that “implementations like this” being faster is still theoretical. The benchmark in the OP seems to be comparing their lexer-only implementation to zig fmt from my reading.

3 Likes

Wow what a plot twist. I am really confused by these results. That makes my current implementation a diappointment. Or perhaps your computer is 10x faster than mine, that would make me happy /j :grin:

I suppose that it could be related to the fact that ~80% of source code (sampled on Sema.zig) is either identifiers, whitespace or comments, something I ignored while writing the lexer. This could mean that the branches are actually very predictable and not a problem at all.

Maybe a lexer that accelerates skipping over these using SIMD would finally be a speed improvement…

I have to say that the results so far are very discouraging so I was lacking the motivation to invest more time into this.