Ezicode — a Unicode library for Zig

ezi-code — a Unicode library for Zig


Released v0.1.0 of ezi-code, a Unicode library for Zig. No dependencies. UCD tables are generated into Zig source and committed. https://github.com/shaik-abdul-thouhid/ezi-code

Three layers:

  • encoding — UTF-8, UTF-16, UTF-32 codecs. Strict / unchecked / lossy decode flavours.
  • transcoding — cross-encoding converters and a chunked UTF-8 stream decoder.
  • unicode — properties and algorithms backed by the UCD.

The unicode layer covers:

  • General Category / Bidi Class / CCC / Derived Core Properties
  • casing (incl. Turkic)
  • normalization (NFC/NFD/NFKC/NFKD + streaming Normalizer + Quick_Check)
  • segmentation (grapheme / word / sentence / line — UAX #14 and #29)
  • East Asian Width
  • Script + Script_Extensions, full UAX #9 bidi including reordering
  • Numeric_Type/Value
  • Blocks
  • Hangul (with algorithmic composition)
  • Derived Age (just tags which version the codepoint was released)

Conformance: tested against GraphemeBreakTest.txt, WordBreakTest.txt, SentenceBreakTest.txt, LineBreakTest.txt, and NormalizationTest.txt under a build flag. Bidi has the rule-numbered adversarial suite for UAX #9.

Tried to cover complete UCD for entire text processing, CLDR is deliberately not covered in this.

Supported Zig versions

Since the release for 0.17 is right around the corner, this is built by chasing master branch, builds correctly on 0.17.0-dev.607+456b2ec07

AI / LLM usage disclosure

All the tests are generated by llm. And more than half of the generator code and lookups is/are generated by llm.

4 Likes

A new version v0.2.0 just dropped.

Changes in this version:

  • Added DUCET collation support under the collation module.
  • Expanded API coverage for encoding, transcoding, and Unicode functionality (normalization, width, segmentation).
  • Added BidiTest.txt and BidiCharacter.txt to the conformance testing pipeline.
  • Various bug fixes.

You can find the full changelog in the release notes.

Upcoming roadmap:

  • Table tuning for faster lookups
  • Expanded fuzzing pipeline
  • API cleanup and performance improvements

How does it compare to zg and uucode ?

2 Likes

Comparing surface area and APIs, ezi-code covers quite lot of UCD. Currently both zg and uucode don’t provide multiple codecs support (utf8/16/32), though this is a small feature but core of entire library. Have complete UAX #9 bidi algorithm with full conformance test. Has all three segmentation variants (line, word, sentence) which neither zg or uucode support or support partially. Also collation(DUCET) support which neither of the libraries give.

If comparing performance, currently except for lookups, algorithms(segmentation, Bidi, ..) are not yet performance-tuned. This is in roadmap to tackle. And have plans to surgically add simd support in hotpaths. Both zg and uucode are highly optimised for what they do, ezi-code is not there yet.

uucode has great build time configuration to only opt what is required, ezi-code is completely dependent. on zig’s native dead-code elimination mechanism which does more than enough for most of the use cases.

ezi-code currently supports unicode 17 like uucode, whilst zg supports 16

Only thing ezi-code currently is really behind is lack of maturity and real-world feedback :slight_smile:

2 Likes

ezicode v0.3.0

A few things that landed. Most of these were driven by what ezi_gex needed, which turned out to be things ezicode should have had anyway.

The main addition is enumerable range tables for Unicode properties. The per-code-point page tables are fast for individual lookups but cannot be iterated without walking all 1.1M code points — which is not useful when you need to resolve \p{Script=Greek} or \w into a sorted list of code-point ranges at comptime. The new range tables (properties.category_runs, properties.derived_runs, scripts.script_runs, and the rest) give exactly that: a compact, comptime-enumerable representation of each property. A regex engine’s HIR builder can resolve any Unicode class to sorted ranges without touching the page tables at all. Though I am thinking whether to expose both two-level trie and range based codepoints, it can be used based on users usecase.

A side effect of the range tables: size-sensitive consumers can now avoid linking the per-code-point page tries entirely. isWord, isIdentifierStartByRanges, and isIdentifierContinueByRanges resolve through the range tables with an ASCII fast path and never pull in the two-level trie.

SIMD additions landed in encoding.utf8asciiRunLength, countScalarsSimd, and a SIMD-accelerated lossy decode iterator. All portable @Vector compares, no target intrinsics, strides std.simd.suggestVectorLength(u8) bytes at a time with a scalar tail. UTF-8 validate now uses asciiRunLength to skip ASCII runs before feeding bytes to the Höhrmann DFA. Line break iteration is 25–37% faster and sentence iteration 10–15% faster from targeted algorithmic improvements — the look-ahead computation now only runs when a look-ahead-dependent rule can actually fire.

Emoji properties got their own module, can be accessed ezi_code.unicode.emoji namespace.

sh

zig fetch --save git+https://github.com/shaik-abdul-thouhid/ezi-gex.git#v0.3.0

Changelog: ezi-code/CHANGELOG.md at main · shaik-abdul-thouhid/ezi-code · GitHub