Make Zig tokenizer faster using threads

breadtom · March 6, 2025, 9:24pm

The implementation of tokenizer.next() used by ast.parse() is single threaded. It can be potentially faster by using at least two threads. The second thread reads the source and returns where ‘a’…‘z’, ‘A’…‘Z’, ‘_’, ‘0’…‘9’ start and end. The first thread then uses this data to return token.start and token.end in a modified next(). This can be extended to many threads matching different characters, syncing threads, and return token.start and token.end.

matklad · March 6, 2025, 9:33pm

Zig tokenization in general is embarrassingly parallel (for non-degenerate inputs), because unlike pretty much every other language, it can be lexed line by line (there are neither multiline tokens, nor line-continuing backslash).

So, if you have N threads and a large Zig file, you can split it in equal chunks, than locally adjust chunk boundaries to the nearest \n, and off the N threads go!

But I don’t think this makes sense to implement practically — Zig already parses files in parallel, and that should be parallel enough for most cases. Even if there are some outlier gigantic files, they can simply be put in front of the work queue.

squeek502 · March 7, 2025, 2:36am

You’d probably be interested in this project by @Validark (not using threads, but SIMD):

A series of talks about it:

Validark · March 7, 2025, 8:23am

If you want two threads to tag team a single file, you probably want to be at least twice as fast, because otherwise, the second thread could have just tokenized the next file for the same speed. Unless you only have one file, of course. Or, maybe you only care about one file, like on an incremental compile or in your code editor when you edit only one file at once.

It’s possible that outside of a cold-compile of multiple files, that kind of scheme can be considered. But, I believe that you are optimizing the wrong thing, because you are seeing tokenization as, ideally, “unnecessarily” flipping between producing alphanumeric and underscore classification data and then doing something with it.

But those two parts are not even close to even. I can produce the classification data faster than I can write the data out, and separately, faster than I could stream it in.

Numberworld provides the following analysis:
(Link: https://www.numberworld.org/blogs/2024_8_7_zen5_avx512_teardown/)

Now, this is assuming you wanted to fully saturate all cores. One of my machines can do a read-modify-write loop at ~11GB/s on a single core (you can’t get this high of bandwidth on all cores simultaneously). 11 / 5GHz (screw my battery life, I want speed!) is 2.2 bytes per cycle. Dividing that into 64 bytes means it read in 64 bytes every ~29 cycles.

I can do all the classification you can imagine in less than half of that time. vpermi2b+vpcmpeqb is 11 cycles end-to-end latency (the vperm can map 64 7 bit values to 64 8 bit values, i.e. ascii to whatever you want, this is beneficial when doing lots of classification, not just one kind). If we did without the vperm, we could get it down to I think 9 cycles, at the cost of more instructions. If I wrote a loop to do this, the loop would be pipelined by the CPU, and I’d theoretically be able to get a steady-state throughput of 1 iteration per cycle! But of course, I can’t stream even close to enough data into the CPU for that level of speed. Streaming data would need to be at least 29 times faster. So like 320 GB/s.

So that thread will necessarily be underutilized unless you do more work. As we said before, why not just stick that thread on the next file instead? Well let’s assume for the sake of argument there isn’t another file.

As @matklad said, we could cut the file in half, and have the second thread work on the second half of the file while the first thread works on the first half. Then we wait in the first thread to hear the second is done, and we memcpy the data over. This should actually be beneficial on large enough files. And trivially so, because we know for certain each thread is running at capacity because it isn’t a “cut down” workload.

Within a week or so I plan to release a new Zig Tokenizer engineered as I described in that third talk (https://m.youtube.com/watch?v=NM1FNB5nagk), although obviously with some improvements over my initial design. (When I gave that talk, I actually only spent a short amount of time on the compression-based tokenizer because I got sick for 2 of the 3 weeks I set aside to focus on it and was pretty unproductive while sick. The last few weeks I’ve been putting time in on the side and I’m almost done!!)