Regex.zig: a native Zig regex engine in the RE2 family

I’ve been working on regex.zig, a native regular expression engine for Zig in the RE2 family.

The main goal is an eventual production-grade Zig regex package with linear-time matching semantics, instead of a limited-scoped engine or a wrapper around another library.

Current status:

  • Pike VM backend
  • literals, concatenation, alternation
  • capturing and non-capturing groups
  • repetition operators including lazy forms
  • Perl classes, bracket classes, POSIX classes
  • assertions and boundaries: ^, $, \A, \z, \b, \B
  • global flags through compile options: i, m, s, U
  • leftmost-first search semantics
  • ASCII-only for now

The repo is here:

Small example:

const std = @import("std");
const Regex = @import("regex");

pub fn main() !void {
    const gpa = std.heap.page_allocator;

    var re = try Regex.compile(gpa, "(\\d\\d)/(\\d\\d)/(\\d\\d\\d\\d)", .{});
    defer re.deinit();

    if (re.find("date=03/18/2026")) |m| {
        std.debug.print("match at [{}, {})\n", .{ m.start, m.end });
    }
}

Public API surface (with comments):


pub fn compile(gpa: Allocator, pattern: []const u8, options: Options) !Regex
pub fn match(re: *Regex, haystack: []const u8) bool
pub fn find(re: *Regex, haystack: []const u8) ?Match
pub fn findCaptures(re: *Regex, haystack: []const u8, buffer: []?Match) ?Captures
pub fn findCapturesAlloc(re: *Regex, gpa: Allocator, haystack: []const u8) !?Captures

A few things that might be different in this package compared to other Zig regex work:

  • It is a RE2-family engine, which means that it has O(m * n) matching, but the downside is not every PCRE-style feature will be implemented.
  • The test setup is fairly serious already. Supported and unsupported behavior is tracked in an explicit capability matrix, which is used during development as capability gate, as well as to test other backends later on. This is inspired by rust-lang/regex setup.
  • The available Pike VM backend has typical optimizations of a Pike VM backend: query-cost split between match/find/findCaptures, sparse-set thread dedup, reused thread lists; plus literal-prefix fast path for unanchored search.

There are many things in the pipeline, including:

  • fuller syntax coverage
  • API refinement, better docs
  • more backends beyond Pike VM
  • longer term, Unicode support

So it is far from finished, but I want to post it here get some opinions from the community:

  • The current public API set is mostly inspired by rust-regex. I wonder what would feel right for Zig?
  • findCaptures and findCapturesAlloc() try to follow Zig-style memory ownership model, but I find it a bit clunky. I wonder if there is any obvious improvement here?
  • Would a future explicit input struct for bounds/anchoring be preferable to adding more top-level methods?
  • Part of the reason for this repo is that I want to push SoA-style layouts where they actually make sense in Zig. In particular, I’ve thought about using something closer to the Zig compiler’s ExtraData style for variable node payloads in the parser instead of scattered slices. I held off because regex patterns are often small, and maybe it will be worse performance wise to have so much machinery? Of course to know for sure I’ll have to measure it, but I’d still like to hear how others think about that tradeoff.

If anyone wants to look at the repo and comment on API shape, package ergonomics, or internal representation choices, that would be very welcome!

18 Likes

Yeah, the allocating version at least needs docs since it frees if there are no capture matches, otherwise the callee frees. That seems a bit brittle.

Since capturesLen is part of the API, you could also just delete findCapturesAlloc and then return error.BufferTooSmall in findCaptures if it’s less than group_count.

You already assert this fact elsewhere, but if you make it an error I think you have a nice API where the caller can decide if they want to pass a stack variable or if they need to heap allocate when calling findCaptures

1 Like

Excited to check this out, thank you for creating/posting. I was immediately struck by lack of regex when starting on my own zig project. Might be interesting to compare against the regex.zig I created - zig-atoms/regex.zig at main · gurgeous/zig-atoms · GitHub. My version is intended to be a stopgap while we await dedicated regex efforts in zig. Good timing!

A few quick suggestions:

  • I would love to see many more usage examples in the README.
  • Maybe a slightly different presentation of supported features would be help - see the comment at the top of my thing, for example. Helpful for scanning.
  • Maybe avoid calling it “regex” and pick something that’s easier for google/llms. The “zig regex” phrase is already crowded with abandoned projects, and your version looks like a more serious effort.

For API design:

  • I wonder if compile should be re. If you want people to compile often, keep it short! Also make it easier for regex to be case_insensitive, same reason.
  • Consider isMatch instead of matchfor clarity. This would also allow you to use match for the other method names, which pairs up nicely with Match or MatchData (which is used by some other languages)
  • Move the important types into Regex.zig so I can see them at a glance
  • Merge find and findCaptures? Also requires combinging Match and Captures which would be another nice change IMO.
  • Probably capturesLen should be private?

Feature requests… These are easy to add and create a ton of value. Esp coming from other languages:

  • scan to return all matches.
  • replace for find and replace on a string
  • grep to filter an array of strings
  • grep_vthe inverse of grep

One other suggestion - my regex.zig comes with zillions of tests, around 50% implementation and 50% tests. Might be useful to quickly port and run in your repo.

Happy to help with any of the above if that would be useful.

1 Like

Thanks for the suggestions!

Yes I definitely meant to add doc comment on the returned type Captures that it is for the callee to free, but I forgot. Is that what you mean?

That makes a lot of sense, I’ll have that fixed. Still I think findCapturesAlloc is good to have, as it allows users to skip the ceremony that is the body of findCapturesAlloc itself. Theoretically, one could use the std.heap.FixedBufferAllocator and use it to stack allocate the result. Would you agree?

1 Like

I see that yours was posted just a few hours before mine, crazy timing!

Those are helpful suggestions for documentation!

  • Outside of demonstrating the core APIs, what would you be looking for?
  • Totally agreed on presentation of supported features. I’ll address that soon.
  • On the name, I know what you mean. I thought about it quite a bit when deciding to post this. I don’t know if naming it something else would help - I just hope that if this is useful and get more usage then it will start to become more dominant in search results!
  • find and findCaptures have different workloads. If you don’t need to iterate through all the captures in the match, find is much faster. It is explained briefly here in the README. This is why Captures carries more information than Match. From your implementation, I think my Match ↔ your Match, and my Captures ↔ your MatchData.
  • capturesLen is to enable caller to size their buffer, so they have a choice between heap and stack allocation.

Thanks for these! Some of them are on my radar, will keep them in mind!

Will look at your tests! I also need to look into getting set up to receive PRs. And thanks your reply!

1 Like

Nice! I have one mini-impl too :smiley: tokamak/src/regex.zig at main · cztomsik/tokamak · GitHub

BTW: There was a nice article about RE# I didn’t know about (different approach, heavier in mem usage if I understand it correctly, but easier to implement backrefs and lookarounds)

It also reminded me this approach from PyPy guy (apparently)
I have that in my TODO but so far I didn’t have time to check it out more in depth.
There’s a comment down there about limited backrefs

One question, I don’t understand your comment in sparseset.zig, about not inserting into the list, that’s exactly what I do, why is that a bad thing? I know I was once trying to understand Rust regex create and it seemed to be doing way more but I couldn’t understand why.

Cheers!

2 Likes

Thanks for the resources! I’m happy to see someone digging through the internal!

I know the feeling! :smiley:

I see that captures is TODO in your implementation! I think I was doing the same thing you do before implementing captures.
The gist is that it’s a runtime optimization for captures. In Pike’s implementation a Thread is:

struct Thread
{
	Inst *pc;
	char *saved[20];  /* This is the capture data */
};

I’m sure you read this because the article I’m refering to is also linked in your code. So every time a thread is inserted into nlist, the current thread’s capture data needs to be copied to that thread’s capture data, to preserve its capture history. Only matcher states (non-epsilon) will modify capture data, so it’s inefficient to mindlessly copy capture data through all the epsilons, and you can see that there could be as many if not more epsilons than matchers. I think this is roughly what Rust regex crates does as well, if I remember correctly.

Hope that helps!

2 Likes

Ah, that makes sense. I wanted to implement captures later because it felt like doing a lot of work when we are in a match, just to advance those captures. I’m not sure if I can come up with something better but I wanted to give it a try whenever I will have a free week for thinking.

BTW: Here is also one interesting implementation in C (based on re1 but with significant improvements) but I didn’t have enough time to dig in :slight_smile:

1 Like

Very promising library! Thanks for working on this. The zig ecosystem desperately needs a robust regex solution.

2 Likes

I’m interested in adding regex search to flow using your library, but I kind of need unicode support. Unicode support is listed as “longer term”. Are we talking weeks, months or years? I can live without proper unicode in the short term as it probably doesn’t mean a whole lot for most programming usecases, but eventually someone is going to complain that they can’t match emojis or something. :sweat_smile:

4 Likes

Hey! I actually had your editor in mind when developing this library, glad you found it!

“longer term” just means that I had it at the bottom of the list, but I’d be happy to reprioritize for a real user. I haven’t really looked into what it takes yet, but perhaps a couple of months at most. Is there any (low hanging fruit) syntax feature you would want?

Btw, just looked at your project again after a while, it has grown quite impressive!

3 Likes

For regex impl’s, other that processing syntax for unicode properties or escapes, at the engine level, you’re matching byte patterns, and with care at the parsing level, the fact that you’re processing some encoding of text doesn’t matter.

1 Like

I went down the rabbit hole on that RE# implementation - here’s a start on a zig version:

disclaimer: ai used

3 Likes