Hello everyone, it’s Niles (@Validark).
Today I have finally open-sourced my new compaction-based tokenizer for the Zig programming language. This does not mean it is done, but that I want to work on it publicly from now on so people can see how I progress each week, since I have started working on it regularly. Currently, it only works on x86-64 machines supporting AVX-512, and that platform will continue to be my focus for the next two or three weeks. After that I will optimize for RISC-V and ARM, and then make sure we have a good SWAR target for everyone else. (LLVM vector support is in its infancy for other platforms, so I don’t feel the effort is worth it for me to hand-optimize for other platforms)
It currently runs at a throughput of 1.4GB/s (single-core) on my laptop (~2.75x faster than current). One can only imagine how blazingly fast it will be when I finish all the optimizations and we test it on AVX-512 hardware that doesn’t double-pump under the hood. (Double-pumping means that my CPU supports 64-byte vectors by using 2 32-byte vectors in the micro-architecture. This cuts the throughput of almost all AVX-512 operations in half for my machine.)
Check out my article on the subject here:
31 Likes
Love to see progress on this, looking forward to the next talk about it.
7 Likes
Are there plans to export this to C?
What do you have in mind?
const Keywords = struct {:
and more functions that match keywords.
You mean make a C tokenizer? That could be done.
1 Like
The classic approach to tokenizing C requires context injection, via the venerable lexer hack.
This has always been completely optional in that resolving ambiguities can be done after lexing is complete. Although “ultrafast SIMD C compacting lexer complete with lexer hack” also sounds like a fun project, for some values of fun.
I mention it mostly because it’s interesting, and to express that I’m very pleased Zig avoided doing this.
1 Like
The Zig Tokenizers don’t differentiate between identifiers being variables vs types. I’m not sure why you’d do that for C unless you were trying to hook into a pre-existing implementation that made the assumption that the lexer did a little bit of parsing work.
1 Like
Yes, I believe I said that this is a nice property which Zig has, the parser doesn’t need to know that because the grammar makes it unambiguous.
Someone might want to make the fastest ever lexer-hack-hacking C lexer for about the same reason anyone tries for the fastest ever version of something. It’s fine if you don’t, just a thought is all.
Perhaps something like recognizing typedefs as they occur, and building up Aho-Corasick to catch them when it matters? Spitballing here.
Classically, the value is that it means parsing can be context-free. C requires something to be context-sensitive, either the lexer or the parser.
From wikipedia:
the lexer hack is a solution to parsing context-sensitive grammars such as C, where classifying a sequence of characters as a variable name or a type name requires contextual information
Perhaps I’m misunderstanding, but isn’t Zig context-sensitive too? You can’t tell whether an identifier is a type or a variable name unless you know the context. It’s actually completely ambiguous in Zig. E.g. const x = Foo(y);
both the x and y could be a type, or both not a type, or just one could be a type. This is simply not a problem addressed by Zig Tokenizers. You could, of course, but I don’t think it would be a good idea.
I think the reason ancient C compilers were architected differently is because ancient computers were extremely RAM-starved. They couldn’t even read a whole file into RAM, so they had to stream small pieces of the source file(s) through the whole compiler pipeline. For them, it made the most sense to do it that way. In 2025, I don’t see the point in optimizing for such constraints unless you are trying to run your compiler on your thermometer.
I guess a more appropriate comparison would be const x: Foo(Bar) = ...
It isn’t actually, because the context sensitivity in C is about the parse tree, not the semantics. In Zig an identifier can be a type because types are values, but “type position” (illegal unless an identifier is a type) is always clear.
In C, you can see something like this:
a * b;
If it’s preceded by this:
typedef int a;
a * b;
Then it’s declaring a pointer to an a
named b
.
But if it looks like this:
int a = 2, b = 3;
a * b;
Then it’s multiplication. Zig does not have this problem, value resolution doesn’t affect the parse tree. It’s better that way.
5 Likes
That makes sense to me. So then what is the advantage of doing value resolution in the lexer instead of the parser?
1 Like
As advantages go, I think, based on most C compilers handling it with the parser, the answer is ‘not many’.
But it does let the parser use a context-free grammar. Basically the context dependence has to go somewhere. Sometimes they use pull parsing and feed type names back to the lexer.
The lexer hack is something I find interesting. Making a super-fast C lexer which doesn’t use it is also a valid use of one’s time.
Maybe I’ll put that on my list of side projects. It shouldn’t be too hard to port the Zig tokenizer to a C tokenizer. But personally, I believe “lexical analysis” should go in the parser. Well… maybe. Not for Zig though. I actually have some radically different ideas on how parsers should be done too. My hope is I’ll show the world how it’s done this year.
4 Likes
Here is a straight C port of the zig tokenizer: zig0/tokenizer.c at main - zig0 - gitea: Gitea Service ; I am testing it with gcc, clang and tcc.
It passes all unit tests of zig 0.14. I did not make any effort to optimize it, since my goal is to have it as close to the Zig version as possible.
Feel free to copy/reuse. I will specify it’s under MIT license sometime soon.
3 Likes
I do have a similar project, inspired from your work. One function you optimize on all three compilers is the getKeyword()
. Instead of strncmp for loop every keyword, replace it with a switch on keyword length, then memcmp the string.. It made GCC 14 go from 22 ms to 10.5 ms. TCC 0.9.27 went from around 35 ms to 22 ms. Clang 19 stayed the same at around 7.5 ms. All reading from Zig 0.14 src/Sema.zig
3 Likes
Are you interested in seeing this tokenizer included in a speed test? I guess I am wondering what I should do with this information.
sorry, I did not mean to hijack your thread.
I am not interested in speed comparisons. It was a heads-up link to anyone who may be looking to implement Zig tokenizer or parser in C.