git clone https://github.com/jaco-bro/tokenizer
cd tokenizer
zig build exe --release=fast
# Encode
zig build run -- --encode "hello world"
# Decode
zig build run -- --decode "{14990,1879}"
# Specify a model
zig build run -- --model "phi-4-4bit" --encode "hello world"
Feedback, issues, and pull requests are welcome on github. Give it a try and let me know how it works for your projects!
4 Likes
Cool! Any plans to port to zig-0.14?
Thanks! The zig-0.14 port I think will have to wait for the pcre2 v10.46 release.
I decided to feed the source code into NotebookLM to generate an overview of the project. It actually did a surprisingly good job breaking down how the Zig build system, C interop (PCRE2), and Python bindings all fit together in the project.
Here is the generated video: https://youtu.be/4ECSMZdmilQ