segmentation (grapheme / word / sentence / line — UAX #14 and #29)
East Asian Width
Script + Script_Extensions, full UAX #9 bidi including reordering
Numeric_Type/Value
Blocks
Hangul (with algorithmic composition)
Derived Age (just tags which version the codepoint was released)
Conformance: tested against GraphemeBreakTest.txt, WordBreakTest.txt, SentenceBreakTest.txt, LineBreakTest.txt, and NormalizationTest.txt under a build flag. Bidi has the rule-numbered adversarial suite for UAX #9.
Tried to cover complete UCD for entire text processing, CLDR is deliberately not covered in this.
Supported Zig versions
Since the release for 0.17 is right around the corner, this is built by chasing master branch, builds correctly on 0.17.0-dev.607+456b2ec07
AI / LLM usage disclosure
All the tests are generated by llm. And more than half of the generator code and lookups is/are generated by llm.
Comparing surface area and APIs, ezi-code covers quite lot of UCD. Currently both zg and uucode don’t provide multiple codecs support (utf8/16/32), though this is a small feature but core of entire library. Have complete UAX #9 bidi algorithm with full conformance test. Has all three segmentation variants (line, word, sentence) which neither zg or uucode support or support partially. Also collation(DUCET) support which neither of the libraries give.
If comparing performance, currently except for lookups, algorithms(segmentation, Bidi, ..) are not yet performance-tuned. This is in roadmap to tackle. And have plans to surgically add simd support in hotpaths. Both zg and uucode are highly optimised for what they do, ezi-code is not there yet.
uucode has great build time configuration to only opt what is required, ezi-code is completely dependent. on zig’s native dead-code elimination mechanism which does more than enough for most of the use cases.
ezi-code currently supports unicode 17 like uucode, whilst zg supports 16
Only thing ezi-code currently is really behind is lack of maturity and real-world feedback
A few things that landed. Most of these were driven by what ezi_gex needed, which turned out to be things ezicode should have had anyway.
The main addition is enumerable range tables for Unicode properties. The per-code-point page tables are fast for individual lookups but cannot be iterated without walking all 1.1M code points — which is not useful when you need to resolve \p{Script=Greek} or \w into a sorted list of code-point ranges at comptime. The new range tables (properties.category_runs, properties.derived_runs, scripts.script_runs, and the rest) give exactly that: a compact, comptime-enumerable representation of each property. A regex engine’s HIR builder can resolve any Unicode class to sorted ranges without touching the page tables at all. Though I am thinking whether to expose both two-level trie and range based codepoints, it can be used based on users usecase.
A side effect of the range tables: size-sensitive consumers can now avoid linking the per-code-point page tries entirely. isWord, isIdentifierStartByRanges, and isIdentifierContinueByRanges resolve through the range tables with an ASCII fast path and never pull in the two-level trie.
SIMD additions landed in encoding.utf8 — asciiRunLength, countScalarsSimd, and a SIMD-accelerated lossy decode iterator. All portable @Vector compares, no target intrinsics, strides std.simd.suggestVectorLength(u8) bytes at a time with a scalar tail. UTF-8 validate now uses asciiRunLength to skip ASCII runs before feeding bytes to the Höhrmann DFA. Line break iteration is 25–37% faster and sentence iteration 10–15% faster from targeted algorithmic improvements — the look-ahead computation now only runs when a look-ahead-dependent rule can actually fire.
Emoji properties got their own module, can be accessed ezi_code.unicode.emoji namespace.