Regex.zig: a native Zig regex engine in the RE2 family

quangd42 · March 26, 2026, 2:14pm

I’ve been working on regex.zig, a native regular expression engine for Zig in the RE2 family.

The main goal is an eventual production-grade Zig regex package with linear-time matching semantics, instead of a limited-scoped engine or a wrapper around another library.

Current status:

Pike VM backend
literals, concatenation, alternation
capturing and non-capturing groups
repetition operators including lazy forms
Perl classes, bracket classes, POSIX classes
assertions and boundaries: ^, $, \A, \z, \b, \B
global flags through compile options: i, m, s, U
leftmost-first search semantics
ASCII-only for now

The repo is here:

Small example:

const std = @import("std");
const Regex = @import("regex");

pub fn main() !void {
    const gpa = std.heap.page_allocator;

    var re = try Regex.compile(gpa, "(\\d\\d)/(\\d\\d)/(\\d\\d\\d\\d)", .{});
    defer re.deinit();

    if (re.find("date=03/18/2026")) |m| {
        std.debug.print("match at [{}, {})\n", .{ m.start, m.end });
    }
}

Public API surface (with comments):


pub fn compile(gpa: Allocator, pattern: []const u8, options: Options) !Regex
pub fn match(re: *Regex, haystack: []const u8) bool
pub fn find(re: *Regex, haystack: []const u8) ?Match
pub fn findCaptures(re: *Regex, haystack: []const u8, buffer: []?Match) ?Captures
pub fn findCapturesAlloc(re: *Regex, gpa: Allocator, haystack: []const u8) !?Captures

A few things that might be different in this package compared to other Zig regex work:

It is a RE2-family engine, which means that it has O(m * n) matching, but the downside is not every PCRE-style feature will be implemented.
The test setup is fairly serious already. Supported and unsupported behavior is tracked in an explicit capability matrix, which is used during development as capability gate, as well as to test other backends later on. This is inspired by rust-lang/regex setup.
The available Pike VM backend has typical optimizations of a Pike VM backend: query-cost split between match/find/findCaptures, sparse-set thread dedup, reused thread lists; plus literal-prefix fast path for unanchored search.

There are many things in the pipeline, including:

fuller syntax coverage
API refinement, better docs
more backends beyond Pike VM
longer term, Unicode support

So it is far from finished, but I want to post it here get some opinions from the community:

The current public API set is mostly inspired by rust-regex. I wonder what would feel right for Zig?
findCaptures and findCapturesAlloc() try to follow Zig-style memory ownership model, but I find it a bit clunky. I wonder if there is any obvious improvement here?
Would a future explicit input struct for bounds/anchoring be preferable to adding more top-level methods?
Part of the reason for this repo is that I want to push SoA-style layouts where they actually make sense in Zig. In particular, I’ve thought about using something closer to the Zig compiler’s ExtraData style for variable node payloads in the parser instead of scattered slices. I held off because regex patterns are often small, and maybe it will be worse performance wise to have so much machinery? Of course to know for sure I’ll have to measure it, but I’d still like to hear how others think about that tradeoff.

If anyone wants to look at the repo and comment on API shape, package ergonomics, or internal representation choices, that would be very welcome!

cryptocode · March 26, 2026, 2:38pm

Yeah, the allocating version at least needs docs since it frees if there are no capture matches, otherwise the callee frees. That seems a bit brittle.

Since capturesLen is part of the API, you could also just delete findCapturesAlloc and then return error.BufferTooSmall in findCaptures if it’s less than group_count.

You already assert this fact elsewhere, but if you make it an error I think you have a nice API where the caller can decide if they want to pass a stack variable or if they need to heap allocate when calling findCaptures

gurgeous · March 26, 2026, 6:55pm

Excited to check this out, thank you for creating/posting. I was immediately struck by lack of regex when starting on my own zig project. Might be interesting to compare against the regex.zig I created - zig-atoms/regex.zig at main · gurgeous/zig-atoms · GitHub. My version is intended to be a stopgap while we await dedicated regex efforts in zig. Good timing!

A few quick suggestions:

I would love to see many more usage examples in the README.
Maybe a slightly different presentation of supported features would be help - see the comment at the top of my thing, for example. Helpful for scanning.
Maybe avoid calling it “regex” and pick something that’s easier for google/llms. The “zig regex” phrase is already crowded with abandoned projects, and your version looks like a more serious effort.

For API design:

I wonder if compile should be re. If you want people to compile often, keep it short! Also make it easier for regex to be case_insensitive, same reason.
Consider isMatch instead of matchfor clarity. This would also allow you to use match for the other method names, which pairs up nicely with Match or MatchData (which is used by some other languages)
Move the important types into Regex.zig so I can see them at a glance
Merge find and findCaptures? Also requires combinging Match and Captures which would be another nice change IMO.
Probably capturesLen should be private?

Feature requests… These are easy to add and create a ton of value. Esp coming from other languages:

scan to return all matches.
replace for find and replace on a string
grep to filter an array of strings
grep_vthe inverse of grep

One other suggestion - my regex.zig comes with zillions of tests, around 50% implementation and 50% tests. Might be useful to quickly port and run in your repo.

Happy to help with any of the above if that would be useful.

quangd42 · March 26, 2026, 8:16pm

Thanks for the suggestions!

Yes I definitely meant to add doc comment on the returned type Captures that it is for the callee to free, but I forgot. Is that what you mean?

That makes a lot of sense, I’ll have that fixed. Still I think findCapturesAlloc is good to have, as it allows users to skip the ceremony that is the body of findCapturesAlloc itself. Theoretically, one could use the std.heap.FixedBufferAllocator and use it to stack allocate the result. Would you agree?

quangd42 · March 26, 2026, 9:11pm

I see that yours was posted just a few hours before mine, crazy timing!

Those are helpful suggestions for documentation!

Outside of demonstrating the core APIs, what would you be looking for?
Totally agreed on presentation of supported features. I’ll address that soon.
On the name, I know what you mean. I thought about it quite a bit when deciding to post this. I don’t know if naming it something else would help - I just hope that if this is useful and get more usage then it will start to become more dominant in search results!

find and findCaptures have different workloads. If you don’t need to iterate through all the captures in the match, find is much faster. It is explained briefly here in the README. This is why Captures carries more information than Match. From your implementation, I think my Match ↔ your Match, and my Captures ↔ your MatchData.
capturesLen is to enable caller to size their buffer, so they have a choice between heap and stack allocation.

Thanks for these! Some of them are on my radar, will keep them in mind!

Will look at your tests! I also need to look into getting set up to receive PRs. And thanks your reply!

cztomsik · March 27, 2026, 6:47am

Nice! I have one mini-impl too tokamak/src/regex.zig at main · cztomsik/tokamak · GitHub

BTW: There was a nice article about RE# I didn’t know about (different approach, heavier in mem usage if I understand it correctly, but easier to implement backrefs and lookarounds)

It also reminded me this approach from PyPy guy (apparently)
I have that in my TODO but so far I didn’t have time to check it out more in depth.
There’s a comment down there about limited backrefs

One question, I don’t understand your comment in sparseset.zig, about not inserting into the list, that’s exactly what I do, why is that a bad thing? I know I was once trying to understand Rust regex create and it seemed to be doing way more but I couldn’t understand why.

github.com/cztomsik/tokamak

src/regex.zig

main


      
          fn pikevm(code: []const Op, clist: *Sparse, nlist: *Sparse, text: []const u8) bool {
              var sp: usize = 0;
              clist.add(0);
          
              while (true) : (sp += 1) {
                  var i: u32 = 0;
          
                  // NOTE: It is safe to insert during iteration, and this is also how we can avoid recursion.
                  //       It's also a bit similar to what we did in the previous bitset-based impl
                  //       https://github.com/cztomsik/tokamak/blob/7d313d0b4f54192480cfc0684d4fe1731327ff03/src/regex.zig#L497
                  while (i < clist.len) : (i += 1) {
                      const pc = clist.dense[i];
                      const op = code[pc];
          
                      switch (op) {
                          // Anchors
                          .begin => {
                              if (sp == 0) clist.add(pc + 1);
                          },
                          .end => {
                              if (sp == text.len) clist.add(pc + 1);

Cheers!

quangd42 · March 27, 2026, 11:38am

Thanks for the resources! I’m happy to see someone digging through the internal!

I know the feeling!

I see that captures is TODO in your implementation! I think I was doing the same thing you do before implementing captures.
The gist is that it’s a runtime optimization for captures. In Pike’s implementation a Thread is:

struct Thread
{
	Inst *pc;
	char *saved[20];  /* This is the capture data */
};

I’m sure you read this because the article I’m refering to is also linked in your code. So every time a thread is inserted into nlist, the current thread’s capture data needs to be copied to that thread’s capture data, to preserve its capture history. Only matcher states (non-epsilon) will modify capture data, so it’s inefficient to mindlessly copy capture data through all the epsilons, and you can see that there could be as many if not more epsilons than matchers. I think this is roughly what Rust regex crates does as well, if I remember correctly.

github.com/quangd42/regex.zig

src/engine/PikeVm.zig

575e552e0


      
          const Count = u32;
          
          fn init(gpa: Allocator, state: Count, matcher: Count, slot: Count) !ThreadList {
              return .{
                  .set = try .init(gpa, state, matcher),
                  .slots = try initSlots(gpa, matcher * slot),
                  .slot_count = slot,
              };
          }
          
          fn add(l: *ThreadList, comptime mode: Mode, id: StateId, slots: []const Offset) void {
              if (!l.set.add(id)) return;
              switch (mode) {
                  .none => {},
                  .bounds => @memcpy(l.slotsFor(id)[0..2], slots[0..2]),
                  .full => @memcpy(l.slotsFor(id), slots),
              }
          }
          
          /// This function must only be called when iterating over the result of `slice()`,
          /// i.e. `id` is assumed to be a member of the set.

Hope that helps!

cztomsik · March 28, 2026, 12:44pm

Ah, that makes sense. I wanted to implement captures later because it felt like doing a lot of work when we are in a match, just to advance those captures. I’m not sure if I can come up with something better but I wanted to give it a try whenever I will have a free week for thinking.

BTW: Here is also one interesting implementation in C (based on re1 but with significant improvements) but I didn’t have enough time to dig in

neurocyte · March 30, 2026, 12:07pm

Very promising library! Thanks for working on this. The zig ecosystem desperately needs a robust regex solution.

neurocyte · March 31, 2026, 7:46pm

I’m interested in adding regex search to flow using your library, but I kind of need unicode support. Unicode support is listed as “longer term”. Are we talking weeks, months or years? I can live without proper unicode in the short term as it probably doesn’t mean a whole lot for most programming usecases, but eventually someone is going to complain that they can’t match emojis or something.

quangd42 · March 31, 2026, 9:14pm

Hey! I actually had your editor in mind when developing this library, glad you found it!

“longer term” just means that I had it at the bottom of the list, but I’d be happy to reprioritize for a real user. I haven’t really looked into what it takes yet, but perhaps a couple of months at most. Is there any (low hanging fruit) syntax feature you would want?

Btw, just looked at your project again after a while, it has grown quite impressive!

pachde · April 1, 2026, 1:15am

For regex impl’s, other that processing syntax for unicode properties or escapes, at the engine level, you’re matching byte patterns, and with care at the parsing level, the fact that you’re processing some encoding of text doesn’t matter.

awesomo4000 · April 1, 2026, 6:23pm

I went down the rabbit hole on that RE# implementation - here’s a start on a zig version:

disclaimer: ai used

quangd42 · April 15, 2026, 9:57pm

A few updates!

From the feedback here, I cleaned up the public search/capture API a bit:
- findCaptures now returns an error if the passed in buffer is not large enough.
- findCapturesAlloc variant is removed.
Update the supported-syntax.md doc to be a more scannable version.
inline flags are now supported for i, m, s, U. That includes:
- global forms like (?imsU)
- scoped forms like (?i:...)
- toggled forms like (?i-m:...)
named captures are now supported (?<name>...) or (?P<name>...)
- named capture metadata is available from the compiled regex with captureIndex(name) and captureNames()
- Captures.name(name) lookup the Match by name.

Small example:

const std = @import("std");
const Regex = @import("regex");

pub fn main() !void {
    const gpa = std.heap.page_allocator;

    var re = try Regex.compile(
        gpa,
        "(?i)(?<month>\\d\\d)/(?<day>\\d\\d)/(?P<year>\\d\\d\\d\\d)",
        .{},
    );
    defer re.deinit();

    var buf: [8]?Regex.Match = undefined;
    const haystack = "Date=03/18/2026";
    const caps = try re.findCaptures(haystack, &buf) orelse unreachable;

    std.debug.print(
        "month index={}, year={s}\n",
        .{ re.captureIndex("month").?, caps.name("year").?.bytes(haystack) },
    );
}

Would love to hear some more feedback!

gurgeous · April 16, 2026, 12:25am

Excellent! Can you give us a sense of build size (with ReleaseSmall)? I would definitely consider adding to my project if it doesn’t plump up the binary too badly. Small binaries is one of my favorite zig features. I am addicted to regex across many languages and it felt strange to create adhoc string scanners. Weird flex but I’ve personally written thousands of regex for use in production systems…

I know I mentioned this offhand last time, but I’m happy to pitch in a bit. If nothing else, I can burn some credits asking the various LLMs to look for code smells and missing tests. I probably need to update my kcov stuff to work with 0.16 as well. If you aren’t using kcov yet, that’s a nice way to look for test gaps.

guotie · April 16, 2026, 2:07am

what about the performance

quangd42 · April 16, 2026, 3:50pm

When I build the demo main.zig with ReleaseSmall, the binary size is 111kb. This is of course without unicode support, which I think may increase the binary size quite a bit if you use it.

I’m curious, what is your project and how does it use regex? I would love to know which regex features are more useful!

I appreciate the offer! I want to work on the features myself, because I want to get the reps in and experiment with how the language helps to solve different problems. I lose out on all that if I have LLMs work on it. However, if you do find code smells or code quality issues in general feel free to open a PR! Same for tests, however in my experience LLMs can generate many many tests that are overly specific and get outdated quickly, so please be careful there too. I’m just trying to be honest, hopefully that makes sense.

Admittedly I haven’t spent much time thinking about fuzzing. I was hoping that by the time this project is some what stable zig builtin fuzzing would also be usable, but we’ll see.

quangd42 · April 16, 2026, 3:57pm

Algorithm wise, you might want to read about it here Regular Expression Matching Can Be Simple And Fast

Implementation wise, I haven’t done any benchmarking yet, because the feature set is incomplete, and I’m aware of many optimizations to be done after all (desired) features are in. But I’m hoping that by the end it will be at least competitive with rust-regex, which is one of the reference of this project.

Does that answer your question?

quangd42 · April 16, 2026, 4:19pm

@cztomsik @gurgeous I hope it’s ok to tag you guys, but I want to pick your brains a bit if that’s ok, because you have expertise and showed some interest in this.

Currently the two public APIs I have are:

pub fn find(re: *Regex, haystack: []const u8) ?Match
pub fn findCaptures(re: *Regex, haystack: []const u8, buffer: []?Match) !?Captures

pub const Match = struct { start: usize, end: usize };
pub const Captures = struct {
    items: []?Match,
    // more internals
};

I know that I need to implement a findAll* set of APIs that returns all matches (with or without captures data), which might look like this:

pub fn findAll(re: *Regex, haystack: []const u8) Iterator(Match)
pub fn findAllCaptures(re: *Regex, haystack: []const u8, buffer: []?Match) !Iterator(Captures)

I’m trying to think of a way to maybe have this shape:

pub fn find(re: *Regex, haystack: []const u8, opts: Options) ReturnType

where Options would control whether on each match, return only match span or all capturing spans. I wonder if ReturnType can be one type Match that can serve as a Span or Captures cleanly? (Not a union, because then caller has to switch on it). However, findCaptures currently takes caller-supplied buffer, and find does not.

Either way, with one ReturnType findAll call can just return Iterator(ReturnType).

At the end of the day I can just make 4 functions, but I want to try to see if there is a better way. Any recommendation is appreciated!

gurgeous · April 16, 2026, 5:30pm

My project is tennis, a cli which takes csv/json/sqlite as input and prints out nice color-coded tables. There are many features which I would normally do with regex, but instead I had to hand write parsing code. One example - if input is coming from stdin, how do you detect json vs csv? With regex that would be a one-liner like /^\s*\[\s*\{\s*"/. Without regex, well, I think I ended up removing whitespace from the first 16 bytes and then doing a string compare.

I also have clunky code for detecting ints or floats in columns. I need my regex!

I appreciate the offer! I want to work on the features myself, because I want to get the reps in and experiment with how the language helps to solve different problems. I lose out on all that if I have LLMs work on it. However, if you do find code smells or code quality issues in general feel free to open a PR! Same for tests, however in my experience LLMs can generate many many tests that are overly specific and get outdated quickly, so please be careful there too. I’m just trying to be honest, hopefully that makes sense.

No worries! Personally I am happy to let the LLMs write and maintain the tests. You gotta have great test coverage if you want LLMs to do anything useful with the codebase. For actual features I prefer to get my hands dirty and mostly use the LLMs for code review.