New Compaction-based Zig Tokenizer

Validark · April 16, 2025, 12:39pm

Hello everyone, it’s Niles (@Validark).

Today I have finally open-sourced my new compaction-based tokenizer for the Zig programming language. This does not mean it is done, but that I want to work on it publicly from now on so people can see how I progress each week, since I have started working on it regularly. Currently, it only works on x86-64 machines supporting AVX-512, and that platform will continue to be my focus for the next two or three weeks. After that I will optimize for RISC-V and ARM, and then make sure we have a good SWAR target for everyone else. (LLVM vector support is in its infancy for other platforms, so I don’t feel the effort is worth it for me to hand-optimize for other platforms)

It currently runs at a throughput of 1.4GB/s (single-core) on my laptop (~2.75x faster than current). One can only imagine how blazingly fast it will be when I finish all the optimizations and we test it on AVX-512 hardware that doesn’t double-pump under the hood. (Double-pumping means that my CPU supports 64-byte vectors by using 2 32-byte vectors in the micro-architecture. This cuts the throughput of almost all AVX-512 operations in half for my machine.)

Check out my article on the subject here:

squeek502 · April 17, 2025, 4:28am

Love to see progress on this, looking forward to the next talk about it.

breadtom · April 17, 2025, 1:13pm

Are there plans to export this to C?

Validark · April 17, 2025, 1:27pm

What do you have in mind?

breadtom · April 17, 2025, 1:39pm

const Keywords = struct {:
and more functions that match keywords.

Validark · April 17, 2025, 2:35pm

You mean make a C tokenizer? That could be done.

mnemnion · April 17, 2025, 3:22pm

The classic approach to tokenizing C requires context injection, via the venerable lexer hack.

This has always been completely optional in that resolving ambiguities can be done after lexing is complete. Although “ultrafast SIMD C compacting lexer complete with lexer hack” also sounds like a fun project, for some values of fun.

I mention it mostly because it’s interesting, and to express that I’m very pleased Zig avoided doing this.

Validark · April 17, 2025, 3:44pm

The Zig Tokenizers don’t differentiate between identifiers being variables vs types. I’m not sure why you’d do that for C unless you were trying to hook into a pre-existing implementation that made the assumption that the lexer did a little bit of parsing work.

mnemnion · April 17, 2025, 4:41pm

Yes, I believe I said that this is a nice property which Zig has, the parser doesn’t need to know that because the grammar makes it unambiguous.

Someone might want to make the fastest ever lexer-hack-hacking C lexer for about the same reason anyone tries for the fastest ever version of something. It’s fine if you don’t, just a thought is all.

Perhaps something like recognizing typedefs as they occur, and building up Aho-Corasick to catch them when it matters? Spitballing here.

Classically, the value is that it means parsing can be context-free. C requires something to be context-sensitive, either the lexer or the parser.

Validark · April 17, 2025, 11:47pm

From wikipedia:

the lexer hack is a solution to parsing context-sensitive grammars such as C, where classifying a sequence of characters as a variable name or a type name requires contextual information

Perhaps I’m misunderstanding, but isn’t Zig context-sensitive too? You can’t tell whether an identifier is a type or a variable name unless you know the context. It’s actually completely ambiguous in Zig. E.g. const x = Foo(y); both the x and y could be a type, or both not a type, or just one could be a type. This is simply not a problem addressed by Zig Tokenizers. You could, of course, but I don’t think it would be a good idea.

I think the reason ancient C compilers were architected differently is because ancient computers were extremely RAM-starved. They couldn’t even read a whole file into RAM, so they had to stream small pieces of the source file(s) through the whole compiler pipeline. For them, it made the most sense to do it that way. In 2025, I don’t see the point in optimizing for such constraints unless you are trying to run your compiler on your thermometer.

Validark · April 17, 2025, 11:53pm

I guess a more appropriate comparison would be const x: Foo(Bar) = ...

mnemnion · April 17, 2025, 11:53pm

It isn’t actually, because the context sensitivity in C is about the parse tree, not the semantics. In Zig an identifier can be a type because types are values, but “type position” (illegal unless an identifier is a type) is always clear.

In C, you can see something like this:

a * b;

If it’s preceded by this:

typedef int a;
a * b;

Then it’s declaring a pointer to an a named b.

But if it looks like this:

int a = 2, b = 3;
a * b;

Then it’s multiplication. Zig does not have this problem, value resolution doesn’t affect the parse tree. It’s better that way.

Validark · April 18, 2025, 12:00am

That makes sense to me. So then what is the advantage of doing value resolution in the lexer instead of the parser?

mnemnion · April 18, 2025, 1:18am

As advantages go, I think, based on most C compilers handling it with the parser, the answer is ‘not many’.

But it does let the parser use a context-free grammar. Basically the context dependence has to go somewhere. Sometimes they use pull parsing and feed type names back to the lexer.

The lexer hack is something I find interesting. Making a super-fast C lexer which doesn’t use it is also a valid use of one’s time.

Validark · April 18, 2025, 8:26am

Maybe I’ll put that on my list of side projects. It shouldn’t be too hard to port the Zig tokenizer to a C tokenizer. But personally, I believe “lexical analysis” should go in the parser. Well… maybe. Not for Zig though. I actually have some radically different ideas on how parsers should be done too. My hope is I’ll show the world how it’s done this year.

motiejus · April 20, 2025, 3:38pm

Here is a straight C port of the zig tokenizer: zig0/tokenizer.c at main - zig0 - gitea: Gitea Service ; I am testing it with gcc, clang and tcc.

It passes all unit tests of zig 0.14. I did not make any effort to optimize it, since my goal is to have it as close to the Zig version as possible.

Feel free to copy/reuse. I will specify it’s under MIT license sometime soon.

breadtom · April 20, 2025, 5:11pm

I do have a similar project, inspired from your work. One function you optimize on all three compilers is the getKeyword(). Instead of strncmp for loop every keyword, replace it with a switch on keyword length, then memcmp the string.. It made GCC 14 go from 22 ms to 10.5 ms. TCC 0.9.27 went from around 35 ms to 22 ms. Clang 19 stayed the same at around 7.5 ms. All reading from Zig 0.14 src/Sema.zig

Validark · April 20, 2025, 7:22pm

Are you interested in seeing this tokenizer included in a speed test? I guess I am wondering what I should do with this information.

motiejus · April 21, 2025, 5:06am

sorry, I did not mean to hijack your thread.

I am not interested in speed comparisons. It was a heads-up link to anyone who may be looking to implement Zig tokenizer or parser in C.