Minimal BPE tokenizer in Zig

git clone https://github.com/jaco-bro/tokenizer
cd tokenizer
zig build exe --release=fast

# Encode
zig build run -- --encode "hello world"

# Decode
zig build run -- --decode "{14990,1879}"

# Specify a model
zig build run -- --model "phi-4-4bit" --encode "hello world"

Feedback, issues, and pull requests are welcome on github. Give it a try and let me know how it works for your projects!

4 Likes

Cool! Any plans to port to zig-0.14?

Thanks! The zig-0.14 port I think will have to wait for the pcre2 v10.46 release.