ezi_gex v0.3.0
Architecture: ezi-gex/docs/architecture.md at main · shaik-abdul-thouhid/ezi-gex · GitHub
Usage guide: ezi-gex/docs/usage-guide.md at main · shaik-abdul-thouhid/ezi-gex · GitHub
I am shamelessly creating a new thread, because 0.3.0’s changes and updates brings in many new features to the library and kind of derailed a lot, the previous thread will make it feel out of place.
Before anything else, why ezi_gex when there are other great community libraries.
I built ezi_gex because Zig didn’t have a regex engine with real Unicode — \p{Script=Greek}, \p{ID_Start}, \X, case folding that turns ß into ss — and I already had ezi_code sitting there with all the tables. So the purpose is narrow: a regex engine that’s Unicode-correct, linear-time and safe to point at untrusted input (Thompson/RE2 lineage — no backtracking blowup, ever), working at comptime and runtime through the same API, over memory you own.
Like any other zig library/project, my mantra: No hidden allocations, one immutable program you can share, one scratch per thread (a convention that works well for most of the cases… If you want a truly thread-safe regex engine, then this is not for you).
The libraries core value proposition is for letting users write their own algorithm/backends/engines and easily plug and use it. The components are as standalone as possible. You can hand write tuned algorithm for your use case and duck type and use it without having to too much care of handling flags, unicode code points, etc.
But the design bet underneath all of it is the part I care about most: matching isn’t hardwired. It lives behind a small comptime duck-typed contract, and a generic Engine(Backend) layer implements every user-facing operation — find, captures, findAll, count, split, replaceAll, the iterators — exactly once, on top of two primitives any backend provides (locate a match; fill a slots array). The seven backends the library ships (auto, edfa, dfa, pikevm, backtrack, literal, bytepike) are all just implementations of that contract. So is anything you write.
A backend doing literal substring search, with the whole operation surface for free:
const std = @import("std");
const gex = @import("ezi_gex");
pub const MyBackend = struct {
pub const caps = Caps{ .captures = true, .stateless = true };
pub const Program = struct { needle: []const u8 };
pub const Scratch = struct {}; // stateless — nothing carried between searches
pub fn search(p: *const Program, _: *Scratch, input: []const u8, o: SearchOptions) ?Match {
const i = std.mem.indexOfPos(u8, input, o.start, p.needle) orelse return null;
return .{ .start = i, .end = i + p.needle.len };
}
pub fn isMatch(p: *const Program, s: *Scratch, input: []const u8, o: SearchOptions) bool {
return search(p, s, input, o) != null;
}
pub fn searchCaptures(p: *const Program, s: *Scratch, input: []const u8, slots: []?usize, o: SearchOptions) ?Match {
const m = search(p, s, input, o) orelse return null;
if (slots.len >= 2) { slots[0] = m.start; slots[1] = m.end; }
return m;
}
// + buildAlloc (Hir → Program) and freeProgram; add buildComptime to run it at comptime too
};
comptime gex.verifyBackend(MyBackend); // compile error if the contract isn't met
var re = try gex.compileRuntimeWith(MyBackend, gpa, "abc", &diag, .{});
// re.find / re.captures / re.findAll / re.split / re.replaceAll all work now — you wrote none of them
verifyBackend checks the shape at compile time, and because the backend is a comptime parameter the whole thing monomorphizes and inlines — no vtable, no runtime cost for the indirection. The full walkthrough (including buildAlloc and the comptime path) is in usage guide §8; the contract itself is in architecture.md.
what’s in v0.3.0
0.2.0 ended on “a lazy DFA engine next…” 0.3.0 is that step — except the DFA that ended up as auto’s default span engine is the eager one, fully determinized at build, not the lazy cache I’d planned. The lazy DFA still exists and it’s the fallback engine now, for patterns the eager one won’t fit in its state budget. So headline is just that: auto no longer matches by NFA simulation only… The default path is a frozen DFA table and a state = trans[state][class] loop. The Thompson/Pike VM is still in there — it fills captures and it’s the correctness backstop everything is conformance-checked against — but it isn’t what scans your input anymore.
The eager DFA. It determinizes the whole byte automaton (the substrate from 0.2.0) at build and freezes the result into an immutable states × byte_classes table. Matching is then a bare table walk: one ByteMap lookup per input byte, no decode, no live-thread bookkeeping. On class scans — \w+, [A-Za-z]+, \p{L}+ — that lands at Rust regex parity, ~1.1–1.3× (Spent few nights to reach here, previously the parity was 8-10x on average with my naive implementation). 0.2.0 was NFA-simulation throughput (“approaching Go’s regexp”); the same patterns are now 4–20× faster than my own code-point NFA depending on the shape, and at parity with Rust on the scans. (please take these ‘x’ factor parity numbers with a grain of salt, the difference may vary based on the machine, pattern and input. All these numbers were measured and computed on my machine for ‘ascii’, ‘multilingual’ and ‘pathological’ inputs. Though for literal matching rust-regex is currently really fast because of its specialized simd path).
Because it’s eager and not a runtime cache, it also builds at comptime. A comptime-compiled pattern freezes its DFA straight into ro_data and matches by table walk — no determinizer, no compiler code in the binary. And if both the pattern and the input are comptime-known, the whole match folds to a constant; nothing engine-related survives into the binary at all.
const gex = @import("ezi_gex");
// pattern baked into ro_data at build; match runs at runtime
const re = comptime gex.compileComptime("\\d{3}-\\d{4}", .{});
// (a) match at runtime over the frozen table
var buf: [@TypeOf(re).Scratch.bufferLen(&re.program)]@TypeOf(re).Scratch.Buf = undefined;
var sc = try @TypeOf(re).Scratch.initBuffer(&buf, &re.program); // no allocator
_ = re.find(&sc, "call 555-1234").?.slice("call 555-1234"); // "555-1234"
// (b) match entirely at comptime — folds to a compile-time constant
const ok = comptime re.isMatchComptime("call 555-1234"); // true, baked in
As you can see the weird @TypeOf sugar used in the snippet, this weird syntax is the result of my design decision. Abstracting behind a method or a function call will leak the implementation, so just left it as it is. Hopefully someone can give suggestion on how to smoothen this edge.
That second form is the one I keep reaching for now — validating embedded constants, version strings, config literals at build time. A bad value becomes a @compileError instead of a runtime surprise, and the engine doesn’t ship.
find is O(n) on every pattern now, and it’s a build-time decision, not a per-search probe. computeProne looks at the determinized program once and picks one of three arms which are only sane strategies I could think of, these cover lots of edge-cases:
- consuming loop is itself accepting (
\w+,\d+,[A-Za-z]+) → plain anchored restart, one greedy table walk per match. - a non-accepting cycle reachable from a start (
\w+@\w+on a long word run — the pre-@run is a cycle that never accepts) → anchored restart is Θ(n²), so it takes a reverse-DFA two-pass instead: a forward pass finds the end, a frozen reverse DFA finds the leftmost start. Two linear passes. (The Cox/RE2 reverse trick.) - a trailing
$(\d+$,\w+$,\w+@\w+$) was the one I had to go and kill on its own. The end is pinned toinput.len, so there’s nothing to scan forward for — one reverse pass from the end finds the start. No anchored restart, which would otherwise be Θ(n²) on the same begin-but-don’t-complete shapes. A$that isn’t a genuine trailing anchor (a$|b, optional(a$)?, multiline(?m)$) is detected and declined to the NFA rather than silently mishandled.
Tier-1 literal prefilter is wired. The literal backend now scans with memchr / Boyer–Moore–Horspool skip tables instead of an eql per position (~20× on memchr-friendly needles, never slower). And auto consumes the HIR Analysis (usage guide §7) on NFA patterns: a leading-literal memchr jumps to candidate starts, the rarest required byte drives a fast-reject (no @ in the input → no \w+@\w+ match, bail at once), and a min-length gate drops inputs too short to hold a match. Every fact is a one-sided bound, so the prefilter never drops a real match. Toggle with strategy.prefilter.
What it looks like
Runtime compile, caller-owned scratch, no surprises:
var diag: gex.Diagnostic = .{};
var re = try gex.compileRuntime(gpa, "\\w+", &diag, .{});
defer re.deinit();
var sc = try @TypeOf(re).Scratch.init(gpa, &re.program); // one per thread
defer sc.deinit(gpa);
_ = re.isMatch(&sc, "abc123"); // bool
_ = re.find(&sc, "x abc123 y"); // ?Match → "abc123"
Named captures resolve into a caller-owned slots array:
var re = try gex.compileRuntime(gpa, "(?<user>\\w+)@(?<host>\\w+)", &diag, .{});
const slots = try gpa.alloc(?usize, re.slotCount());
if (re.captures(&sc, slots, "ping bob@example")) |c| {
_ = c.namedSlice("user").?; // "bob"
_ = c.namedSlice("host").?; // "example"
}
Unicode is the reason the project exists — all of this goes through ezi_code and resolves to sorted codepoint ranges at build, zero table lookups at match time:
_ = try gex.compileRuntime(gpa, "\\p{Script=Greek}+", &diag, .{});
_ = try gex.compileRuntime(gpa, "\\p{ID_Start}\\p{ID_Continue}*", &diag, .{});
_ = try gex.compileRuntime(gpa, "(?i)\\p{L}+", &diag, .{}); // case-insensitive Unicode letters; (?i)ß matches ss
You can pin a backend instead of letting auto choose — same API on the returned type:
var re = try gex.compileRuntimeWith(gex.backends.edfa, gpa, "\\w+", &diag, .{}); // force the eager DFA
// or .pikevm / .backtrack / .literal / .dfa / .auto
findAll/count/split/replaceAll (with $1/${name} templates) all sit on the same operation layer — see usage guide §3. And the matching strategy is still behind a comptime duck-typed contract — a custom backend in ~40 lines inherits the entire operation layer for free (usage guide §8).
Smaller things that landed
- Literal alternation isn’t quadratic anymore —
foo|bar|baz|quxscans with a singleindexOfAnypass instead of one scan per branch. - The eager DFA builds only the tables it’ll use:
utransand the reverse table are constructed only for prone / trailing-$patterns, so a plain\w+stores just its forwardtrans(~141 KB) instead of all three (~1 MB). - ~300 tests. Conformance pins every backend’s spans and captures to the Pike VM and fuzzes the strategy knobs, so the DFA can’t silently disagree with the reference.
Honest perf state
Class scans are at Rust parity. Where I still lose, and lose badly, is anything that wants a SIMD literal prefilter — literal alternations, interior literals, case-insensitive literal prefixes are 10–30× behind Rust (and in one test it was ~990x ahead
…) because there’s no Teddy / memmem yet. (?m)^ line anchors stay on the NFA and are several times behind. Those are the next release; the analysis facts to drive them are already computed on every pattern, just not wired to a scanner.
Caveats
- Binary size is the eager DFA’s cost. A Unicode-class DFA is a large dense table (
\wis a few hundred states), so size grows with Unicode-heavy patterns. Hopcroft minimization + a sparse encoding aren’t done (planned for upcoming releases). Comptime patterns also bake their class ranges (~6.3 KB per distinct\w; identical classes are interned). If size matters, pin a cheaper backend or usecompileRuntime(no ceiling). The trade-off is documented in architecture.md §3. - Captures never use the DFA. They always run on the Pike VM, anchored at the DFA’s span. Capture-heavy workloads get the DFA’s fast span location but Pike-VM-speed extraction.
\Xgrapheme clusters route to the backtracker (the Pike VM steps one codepoint and can’t consume a variable-width cluster). The backtracker’s visited set grows during a match, so the zero-alloc-per-match property has an asterisk on that one backend; the Pike VM and a buffer-backed scratch allocate nothing while matching.- Line anchors and
\baren’t in the DFA yet.(?m)^,(?m)$,\bkeep those patterns on the NFA. - The lazy DFA fallback is runtime-only, and its transition cache grows on demand (bounded). It’s the path for patterns whose eager DFA overflows the state budget. Makes no sense to add to comptime, instead it would perform worse instead because of lots of memory restrictions.
- Comptime matching is bounded by the eval-branch quota. Big/pathological patterns can blow it or grow
ro_data— prefercompileRuntimefor those. - No backreferences, no lookaround, ever. full stop. Thought of implementing this and turned out to be massive pain not worth taking.
- One scratch per thread. The compiled regex is immutable and shareable; the
Scratchis the mutable per-search state and is not safe to share across threads (usage guide §9). - Pre-1.0. Public API may still change; everything stable is annotated
@stable-since.
Upcoming verssions
Really need to work on prefilters. This is something that was on the todo list for past one version. Also need to heavily optimise on the bloat that eager-dfa adds to unicode range patterns. And Api stabilisation.
Supported Zig versions
Zig 0.17.0-dev (master). Will not compile on 0.16 stable — it uses APIs that changed recently, and tracks master because ezi_code does.
** Feedback and suggestions welcome, as always.
AI / LLM usage disclosure
This project is AI-copiloted. The architecture, the algorithms, and every design decision — the backend contract, the three-arm strategy, compilation pipeline (Ast, Hir lowering, byte lowering, backend contract, redos safety, matching strategies, etc), are mine.