New Compaction-based Zig Tokenizer

Validark · July 17, 2025, 3:00am

No, no reallocations necessary. You can just allocate a buffer with a length equal to the number of bytes in the source file, because you can never end up with more tokens than bytes. After writing all the tokens out, you can use the resize facility of the allocator interface to give back the unused memory.

ww520 · July 17, 2025, 7:14am

The project looks cool! Congrat on getting a huge performance boost on the tokenizer.

I had a piss contest on HNews to produce the fewest ASM opcodes to detect identifiers in chunks using SIMD a while back. It used the same strategy, except I used 32-byte chunk because I wanted fewer instructions. Glad to see it actually works in a real tokenizer!

Validark · July 17, 2025, 11:20am

Well, next time you should pass the compiler flags properly so at least optimizations are enabled. That alone cuts your instruction count dramatically.

You can cut instruction count further on AVX-512, using one of the vpermi2b/vpermt2b instructions to map the target characters to something readily convertable to a bitstring.

squeek502 · August 16, 2025, 2:59am

Latest talk:

kj4tmp · August 16, 2025, 4:34am

Really appreciate the videos, helps me at least understand how SIMD can enable parallel processing of text. The visualization of the bit strings are awesome.

Questions:

After writing a bunch of SIMD code (using vectors), and targeting a CPU which does not have SIMD instructions, is it possible that the advanced SIMD tokenizer is slower than the “basic tokenizer”. My intuition says yes, because the non-simd lowering of vector code may be hundreds of instructions per operation? If so, would you want specialized code per CPU capability?

Eisenhauer · August 16, 2025, 8:45am

I expect SIMD code to effectively degenerete to manual loop unrolling on CPU targets not supporting SIMD instructions.

Against your intuition, if the CPU you have in mind is a Superscalar procesor with Out-of-Order execution, loop unrolling can still be beneficial.

If you want to be sure, you have to measure alternative implementations.

Validark · August 18, 2025, 4:13pm

For the heat-seeking tokenizer, the answer was that the speedup I got on my risc-v sifive_u74 was about the same I got on my desktop. 2.5x vs 2.75x.

The compacting tokenizer hasn’t been given SWAR implementations yet, but I’ll test it on hardware without vectors once I add those in.

I suspect that it will be okay, probably about as fast as a naïve tokenizer. This one does rely on a bigger and wider CPU to get speed. If you don’t have one of those, you have an uphill battle.