Minimum Viable Zig Regex

mnemnion · July 16, 2024, 12:16am

mvzr: The Minimum Viable Zig Regex Library

Finding myself in need of a regular expressions library for a Zig project, and needing it to build regex at runtime, not just comptime, I ended up speedrunning a little library for just that purpose.

This is that library. It’s a simple bytecode-based Commander Pike-style VM. Less than 1000 lines of load-bearing code, no dependencies other than std.

Features

Zero Allocation
Possessive qualifiers: *, +, ?
Lazy qualifiers: *?, +?, ??
Alternation: foo|bar|baz
Grouping foo|(bar|baz)+|quux
Sets: [abc], [^abc], [a-z], [^a=z]
Built-in character groups (ASCII): \w, \W, \s, \S, \d, \D
Escape sequences: \t, \n, \r, \xXX hex format
Begin and end ^ and $

Limitations and Quirks

Only 64 operations per regex
Only 8 character sets per regex (ASCII only)
- These values could probably be made comptime-configurable, but I didn’t
No Unicode support to speak of
No fancy modifiers (you want case-insensitive, great, lowercase your string)
. matches anything (split into lines first if you don’t like this)
Backtracks (sorry. For this to work without backtracking, we need async back)
Compiler does some best-effort validation but I haven’t really pounded on it
No min/max {X,}, {X,Y}. Might get around to it (escape your curlies just in case)

As long as you color within the lines, it should be fine.

Bugs

I wrote this thing in two days flat (check the log). There are bugs. Point them out and I might even fix them.

Feel free to kick the tires and try it out, let me know how it goes!

AndrewCodeDev · July 16, 2024, 1:07am

Two days - not all heroes are born wearing capes…

Leonardo-Dicaprio-Cheers

I’d be happy to move some of our testing suite from fluent into your runtime library. We have a bunch of edge-case tests for regex. You’re also free to strip-mine that for examples… it really helps to have.

Either way, great work! I’ll give it a spin and maybe PR some testing code!

AndrewCodeDev · July 16, 2024, 1:10am

Also, backtracking is fairly simple in my experience. If you get the body of your work solved, it’s not terrible to do (I’d have to read your code to see what it would take to add that). The easiest thing to do is just peek ahead and see if the next match would work. If it does, you set that as the current max match when you’re in a quantifier that can consume multiple characters.

I suppose that’s like “best next potential max” where you can backtrack by at least one. If you want to do the entire tree, you can guarantee it, but it’s more expensive.

mnemnion · July 16, 2024, 1:46am

Might work, I’m sure I’ll putter away at this code more over time. It’s a bit in tension with the divide-and-conquer approach that makes alt and grouping easy. The bytecode approach is fairly different from building up a DFA/NFA in terms of how the regex program is structured.

Async would make a zero-backtrack version of this almost trivial, not to mention elegant: alts could split, and call qualifiers with a suspend point, any alt which isn’t keeping up gets dropped, and that would work recursively also. Getting lockstep behavior out of the VM for alts is eh, that’s a lot of state to keep around, and it gets worse for recursive groups which are one of the alts.

This library doesn’t have “docile” qualifiers, just possessive and lazy, so it’s less prone to catastrophic backtracking. One can build some quadratic behavior pretty easily with a lazy qualifier and a greedy qualifier of the same pattern, but with a $ that doesn’t match the string.

So yeah, there are no doubt ways to spruce it up a bit. Siempre.

AndrewCodeDev · July 16, 2024, 1:52am

Yeah, I saw an implementation that splits on different threads - interesting but not comptime friendly. I guess it’s all about your goals, right?

One thing I thought you might consider is a technique like standard options (think of the logging utilities) for the size of the regex operations (you currently have it set to 64). If you don’t detect that the user has provided an option, you default to a specific size. Have you considered trying that?

JPL · July 16, 2024, 7:41am

hello,
I’m particularly interested in a non-comptime regex for a form code generator

I’m going to test it, but I would like it to work in UTF8
I provided a test on github

mnemnion · July 16, 2024, 11:41am

mvzr compiles at comptime just fine, fwiw. It would probably match at comptime as well, although I haven’t tried it. Literal full-fat OS threads would be, well, a very different program.

This library (the hint is in the name) is aiming squarely at bang-for-buck. I’ll probably cover a few more features, improve compiler validation a bit, and fix any bugs, but one of the goals is to have something small, zero dependency, with practical and pedagogical value.

I might change the qualifier behavior to be greedy rather than possessive, because that’s what people are used to. Should be tractable to add that without blowing the complexity budget.

And if and when we get frames back, I’ll for sure want to modify the code to use them to get a lockstep VM. Meanwhile, it’ll backtrack. Shoganai.

That’s a possibility, yes. Another one is to have the type specialize on the number of operations and sets as comptime parameters. Probably want to provide Regex as an omakase zero-config type, and pick some other name (idk, SizedRegex maybe) for one which is configured with different limits.

The downside of doing it as a build option is that that then becomes the stack size of every regex, so if you need one big one, then all of them are big now. The upside is that it’s a trivial change, but I don’t think adding comptime configurability will prove to be that difficult.

UTF-8 multi-byte characters work fine right now, but sets of UTF-8 characters won’t be supported. Bit ironic, given that I just released a library for UTF-8 character sets, but that design isn’t compatible with mvzr. In some cases, you can fake it, with a regex like "(é|è|ç|[a-z])+", just be sure that your needle and your haystack are normalized the same way.

Time permitting, this won’t be the last pattern-matching library I release in Zig. And if someone wants to tackle “production ready regex” for Zig, I’d enjoy chipping in on that. Serious offer, it’s something the ecosystem should have, and it would be a fun project. Just, not one I intend to spearhead.

JPL · July 16, 2024, 5:05pm

This is what I do with the pcre2 lib and it works well and even with one character (for the test)

mnemnion · July 16, 2024, 7:04pm

Fair warning, you’d need to do (á)*, not just á*, as well. I might be able to handle that at the compiler level at some point.

mnemnion · July 17, 2024, 2:31am

v0.0.4 of mvzr adds the accustomed behavior for qualifiers .*, .+, .?. The possessive versions are renamed .*+, .++, and .?+, as is the fashion.

Also added are numbered repetitions, {M}, {M,} and {M,N}, up to one u8 for each (and we do check to see if the range makes sense). Be aware that up to N - M stack frames are taken up with backtrack points when the last of these is used.

The interface has also been normalized such that everything returns a Match struct, with a slice of the match, and the bookends which created that slice. Oh and there’s a boolean version of the match as well.

mnemnion · July 18, 2024, 12:33am

This turned out to be almost startlingly easy to do.

JPL · July 19, 2024, 7:11am

I did tests the borrowing is useless, and functional thank you

mnemnion · July 19, 2024, 9:03pm

mvzr is now at v0.1.0. The list of features is now the following:

Zero allocation, comptime and runtime compiling and matching
X operations per regex (defaults to 64)
Y unique character sets per regex (defaults to 8)
Greedy qualifiers: *, +, ?
Lazy qualifiers: *?, +?, ??
Possessive/eager qualifiers: *+, ++, ?+
Alternation: foo|bar|baz
Grouping foo|(bar|baz)+|quux
Built-in character groups (ASCII): \w, \W, \s, \S, \d, \D
Sets: [abc], [^abc], [a-z], [^a-z], [\w+-], [\x04-\x1b]
Escape sequences: \t, \n, \r, \xXX hex format
- Same set as Zig: if you need the weird C ones, use \x format
Begin and end ^ and $
Word boundaries \b, \B
{M}, {M,}, {M,N}, {,N}

Which is basically the set I intend to support.

Time to switch from working on the library to working with it…

jiacai2050 · September 6, 2024, 9:39am

It’s a simple bytecode-based Commander Pike-style VM

I wonder what Pike-style means…

mnemnion · September 6, 2024, 2:28pm

Here’s a good explanation of the history, and implementation choices. Rob “Commander” Pike is an important figure in Unix and Plan 9 history, as well as the creator of the Go language.

mvzr is decidedly a bytecode VM, rather than compiling a DFA/NFA, for example. In its specifics, it’s actually most similar to the LPeg style of parsing VM, itself based on Donald Knuth’s parsing machine of 1971. Both of those use a ‘user’ stack, but since regular expressions can’t describe recursive grammars, that wasn’t necessary for mvzr.

The main constraint guiding the implementation is zero allocation: I was willing to trade some performance (particularly for ‘adversarial’ regexes, because tolerating adversarial grammars is a non-goal of the library), and some features (grouping) to work within that constraint.

I may be able to squeeze a few more affordances into the design, but not many. The goal was to write a Zig-native regular expression engine, which uses only stack space, compiles and matches at both comptime and runtime, and is a simple drop-in library with a lot of bang-for-buck. If we ever get async back, I could use that to add lockstepping (see the first link) with few other changes to the design.

I’m interested in writing something quite a bit more sophisticated, and have been laying groundwork for that starting with the first Zig repo I ever created. But regular expressions are a strict subset of what I’m interested in there, and it operates in a very different design space.

If someone wants to tackle a ‘production-grade’ regex engine in Zig, I think that would be a great project and would be happy to pitch in. It isn’t something I want to spearhead however.

mnemnion · November 10, 2024, 5:41pm

Version 0.2.3 is out, this is a bugfix release.

yataro · November 11, 2024, 6:50am

Thank you! After updating to 0.2.3 I got my tests broken. After taking a close look I discovered that I’ve had a logic error somewhere that depended on regex bug and it was not directly addressed but fixed in 0.2.3:

const std = @import("std");
const mvzr = @import("mvzr");

const regex = mvzr.compile("ab*$").?;

test {
    try std.testing.expect(!regex.isMatch("abc"));
}

This test will pass on 0.2.3 but fail on 0.2.2. mvzr works really well for me. The only thing I miss is capture groups, but I managed to workaround that with some crude slicing .

EDIT: Found a bug in new version:

const std = @import("std");
const mvzr = @import("mvzr");

const regex = mvzr.compile("a*$").?;

test {
    try std.testing.expect(regex.isMatch("bbbbb"));
}

slonik-az · November 11, 2024, 2:25pm

I would kill for capture groups, certainly would sacrifice a non-allocation purity if one can optionally enable allocations on per reg-expr basis.

yataro · November 11, 2024, 3:22pm

I think it is possible to infer count of capture groups from regex on comptime and use generic return type for match, no allocations introduced. For runtime I guess something like SizedMatch will do the trick.
I only do simple regexes, so I don’t really need all the power of capture groups. The minimum in mvzr probably means that there will be no full-featured implementation. I would love to see MAXvzr though

mnemnion · November 11, 2024, 5:52pm

Not a bug, it turns out. a* matches nothing, and $ matches the end of the string, so it just tries all the b matches and then matches $. Regex101 shows a match.