Regex: Comptime-Generated Stateless Parsing Trees and Iterators

AndrewCodeDev · April 28, 2024, 10:10am

Hello again, I’m back again with more updates

I just did a big update to fluent implementing comptime regular expressions and our first iterator that uses them - MatchIterator.

Edit: More regex utilities have been added since this first post such as split and fluent string algorithms.

Regular expressions follow the PCRE syntax. This update has a full regex parser and finite-state-automaton generator that builds parsing trees at comptime. All parsing trees are stateless and have @sizeOf(T) == 0.

To use Regex, just look for the fluent.match function and provide your expression string and source string: fluent.match("[abc]\\d+", str)

It’s an on going process, but it’s at a point where people can start using them and be on the lookout for more updates! So far, here’s what’s been implemented:

Special Characters:

\d - digits
\D - no-digits
\w - alphanumeric
\W - no-alphanumeric
\s - whitespace
\S - no-whitespace
 . - any character
^ - starts with
$ - ends with

Quantifiers:

+ - one or more
* - any quantity
? - none or one
{n} - exactly n
{m,n} - between m and n (inclusive)

Operators:

| - alternation (or clause)
() - capture group
[] - character set
[^] - negated character set
[a-z] - character spans
(?=) - positive look ahead
(?!) - negative look ahead

Examples


    { // match special characters (typical) - one or more
        var itr = match("\\d+", "123a456");
        try std.testing.expectEqualSlices(u8, itr.next() orelse unreachable, "123");
        try std.testing.expectEqualSlices(u8, itr.next() orelse unreachable, "456");
        try std.testing.expect(itr.next() == null);
    }
    { // match special characters (typical) - exact
        var itr = match("\\d{3}", "123456");
        try std.testing.expectEqualSlices(u8, itr.next() orelse unreachable, "123");
        try std.testing.expectEqualSlices(u8, itr.next() orelse unreachable, "456");
        try std.testing.expect(itr.next() == null);
    }
    { // match special characters (typical) - between
        var itr = match("\\d{3,4}", "123456");
        try std.testing.expectEqualSlices(u8, itr.next() orelse unreachable, "1234");
        try std.testing.expect(itr.next() == null);
    }
    { // match special characters (inverse)
        var itr = match("\\D+", "123a456");
        try std.testing.expectEqualSlices(u8, itr.next() orelse unreachable, "a");
        try std.testing.expect(itr.next() == null);
    }
    { // pipe-or clauses
        var itr = match("abc|def", "_abc_def_");
        try std.testing.expectEqualSlices(u8, itr.next() orelse unreachable, "abc");
        try std.testing.expectEqualSlices(u8, itr.next() orelse unreachable, "def");
        try std.testing.expect(itr.next() == null);
    }
    {
        var itr = match("(a+bc)+", "_aaabc_abcabc_bc_abc_");
        try std.testing.expectEqualSlices(u8, itr.next() orelse unreachable, "aaabc");
        try std.testing.expectEqualSlices(u8, itr.next() orelse unreachable, "abcabc");
        try std.testing.expectEqualSlices(u8, itr.next() orelse unreachable, "abc");
        try std.testing.expect(itr.next() == null);
    }
    { // character sets (typical)
        var itr = match("[a1]+", "_a112_21aa112_a_1_x_2");
        try std.testing.expectEqualSlices(u8, itr.next() orelse unreachable, "a11");
        try std.testing.expectEqualSlices(u8, itr.next() orelse unreachable, "1aa11");
        try std.testing.expectEqualSlices(u8, itr.next() orelse unreachable, "a");
        try std.testing.expectEqualSlices(u8, itr.next() orelse unreachable, "1");
        try std.testing.expect(itr.next() == null);
    }
    { // character sets (negated)
        var itr = match("[^a1]+", "_a112_21aa112_a_1_x_2");
        try std.testing.expectEqualSlices(u8, itr.next() orelse unreachable, "_");
        try std.testing.expectEqualSlices(u8, itr.next() orelse unreachable, "2_2");
        try std.testing.expectEqualSlices(u8, itr.next() orelse unreachable, "2_");
        try std.testing.expectEqualSlices(u8, itr.next() orelse unreachable, "_");
        try std.testing.expectEqualSlices(u8, itr.next() orelse unreachable, "_x_2");
        try std.testing.expect(itr.next() == null);
    }
    { // character sets (negated)
        var itr = match("[^\\d]+", "_a112_21aa112_a_1_x_2");
        try std.testing.expectEqualSlices(u8, itr.next() orelse unreachable, "_a");
        try std.testing.expectEqualSlices(u8, itr.next() orelse unreachable, "_");
        try std.testing.expectEqualSlices(u8, itr.next() orelse unreachable, "aa");
        try std.testing.expectEqualSlices(u8, itr.next() orelse unreachable, "_a_");
        try std.testing.expectEqualSlices(u8, itr.next() orelse unreachable, "_x_");
        try std.testing.expect(itr.next() == null);
    }
    { // character sets (compound)
        var itr = match("[abc]\\d+", "_ab112_c987b123_d16_");
        try std.testing.expectEqualSlices(u8, itr.next() orelse unreachable, "b112");
        try std.testing.expectEqualSlices(u8, itr.next() orelse unreachable, "c987");
        try std.testing.expectEqualSlices(u8, itr.next() orelse unreachable, "b123");
        try std.testing.expect(itr.next() == null);
    }

Edit: split and match iterators have been added and string backend no supports regex.

There’s also many optimizations that are planned too. One example is branch-fusion where we combine multiple branches and/or remove any redundant branches. We also need to do performance testing and work with the generated assembly to get it exactly like we want. All in good time…

Thanks

nyc · April 28, 2024, 10:18am

This is so useful, I think it should be a seperate library.

AndrewCodeDev · April 28, 2024, 10:21am

Thanks - I definitely thought about that. The thing is, it deals with ranges and iterators and we’re going to fill out the backends quite a bit on other fluent-objects so I decided to group them together.

They work really well together and you can always opt out of the fluent interface (same with iterators) and just use the regex engine directly. I’ll always make sure that you don’t have to jump through hoops to just use the pieces that you want

Plus, I’m moving towards making a more “completely functional” string object that does most of what I want under one banner and that combines regex and the fluent interface. Thanks for the suggestion though, I’m not ruling it out!

AndrewCodeDev · April 30, 2024, 4:57am

Quick update here -

Inclusive character character spans are now supported:

Any valid span of characters is supported requires: (a < b)
example: [a-zA-Z] matches a through z or A through Z.
example: [0-5] will match digits 0 through 5

Comptime analysis for regex strings is also online. This checks for invalid strings like: a+a, where the second a will never match to anything as all prior a’s were consumed by a+. These now result in compiler errors that tell you what string sub-string was invalid.

~~There are still more checks that need to be put in (like .*a for the same reasons).~~

Edit: This was fixed with the recent change for back-tracking. These strings compile fine now.

I also need to put in checks for redundant characters in sets such as [aaaa].

I want to encourage anyone that finds an issue or can think of an invalid regex string, please post it here or make an issue on the github so I can correct those. Your help is greatly appreciated.

Also, big thanks to @dude_the_builder for reviewing the regex matching and discovering issues with invalid quantifier sequences. You are, in fact, the dude.

Thanks everyone!

gonzo · April 30, 2024, 11:46am

Idea / suggestion: is there such a thing as a large set of test cases for regexes that you could… er… “borrow” from? Maybe from libpcre?

AndrewCodeDev · April 30, 2024, 12:04pm

Good question/idea - Pierre is in charge of setting up our testing suite. I’m currently working on getting the remaining regex features implemented (hoping to be done by the end of this week) so we may have to mine some places for more regex testing examples.

I’ll do a little thinking outloud here…

We are making certain strings compiler errors that other regex engines enable by altering your regex under the hood. There are a few of those that need to be enabled (such as \".*\" which matches everything between two quotes… technically .* matches quotes so I need to implement give-back). Meanwhile, strings like a+a are a hard stop for us while other engines reinterpret those or step back the matches until they find a match.

I like your idea, and we’ll need to dig for examples that are not hard compile errors for us.

dude_the_builder · April 30, 2024, 12:16pm

Here’s Go’s regex tests. Quite exhaustive. https://github.com/golang/go/blob/master/src/regexp/all_test.go

dude_the_builder · April 30, 2024, 12:20pm

Backtracking in this case could be avoided if you detect these cases (atom_x → .* → atom_x) and turn it into "[^"]" (atom_x → not atom_x → atom_x). But you need negated character classes first for that. :^)

AndrewCodeDev · April 30, 2024, 12:24pm

The good news is we do have negated character classes :^)

That’s an interesting take though - definitely more optimal than walking back through the string. I’d like to find a syntactical transform to handle that… something like…

a.*b -> a[^b]b

Any ideas on handling something like:

[0-9]+8

In this case, we could match something like 1873928 but not 1873927

dude_the_builder · April 30, 2024, 12:43pm

In this case, since there’s a requirement to match at least one digit as the first character, I think you safely turn this into [0-9][^8]*8? Don’t you get the feeling in the back of your mind that there’s some hidden pattern here? But I’m not sure. Can’t nail it down.

AndrewCodeDev · April 30, 2024, 12:49pm

I get that feeling for most things I work on, lol.

I think that string would produce something different for a case like “08888888888” because in that case, I’m getting:

[0-9][^8]*8 -> 08 88 88 88 88

instead of:

[0-9]+8 -> 0888888888

I think a safe solution is to implement backtracking and as more optimal solutions appear, we can implement them. I like the .* transform though, so I can work on getting an optimizer put in to start catching those cases and get a generic backtracker installed in the RegexAND branch because that’s the only one that struggles with these matching cases.

dimdin · April 30, 2024, 12:58pm

google re2 test suite: re2/re2/testing at main · google/re2 · GitHub

dude_the_builder · April 30, 2024, 1:04pm

~~I think you’re getting those matches because your engine is “non-greedy”, matching the least amount of characters each time. If you test it on~~ regex101.com (really great resource for this) ~~you see the result~~

08 88 88 88

Note the last 8 doesn’t match because there’s nothing after it. ~~In this case, the engine is in greedy mode, matching as much as it can in one go.~~

EDIT: I didn’t notice it is actually matching 4 times just like in your results. So never mind.

AndrewCodeDev · April 30, 2024, 1:09pm

Gotcha, I was using https://regexr.com/ and they had it set to “ungreedy” by default and that also an issue here to be considered. I was testing it on their framework.

Either way, still a good thing to be aware of.

Addendum… instead of doing [^x], since I control the tokenizer, I can just set that token to negated and keep it as ~x* and the tree-builder will pick that up. So we don’t need to go through any string transformations. That’s convenient.

Thus .*x -> ~x*x

JPL · April 30, 2024, 6:03pm

hello, I have a test suite for Regex,
I will test

will this conform to PCRE2-Posix standard

AndrewCodeDev · April 30, 2024, 6:07pm

Thank you!

We are about a week out from being ready for a full-fledged test. I still need to implement backtracking. Either way, I appreciate your time and I’m interested in your results!

Can you post your results as an issue to the github page?

I’m trying to match the functionality of the CTRE library: GitHub - hanickadot/compile-time-regular-expressions: Compile Time Regular Expression in C++

And as such, we will not be supporting the following (from that link):

callouts
comments
conditional patterns
control characters (\cX)
match point reset (\K)
named characters
octal numbers
options / modes
subroutines
unicode grapheme cluster (\X)

We also don’t have negative/positive lookahead implemented but I’d like that in a future release - I need to add that to the unimplemented section.

I’ll post a more complete list of what we are supporting currently.

AndrewCodeDev · May 3, 2024, 4:29pm

Update time.

First, thank you to @JPL for writing several great tests for us to work against. They’ve been very helpful and I’m quite pleased with the results. Same to @pierrelgol for building another fleet of tests to make life easier when we refactor. Good work!

So what’s new…

SPLIT

There’s a new iterator in town - split iterator. Like the name suggests, it works like a typical splitting algorithm but uses regex to split the string.

Split iterators are easy to use and work like a standard split:

var itr = Fluent.split("[\\v]+", string);

FLUENT STRINGS

Strings are being updated to work with regex and scalar options instead of sequence or any. This gives us some powerful functions like:

trim - remove anything from left, right, or both using regex or scalar
contains - check for patterns or a scalar in a string
find - get the starting location of a pattern or scalar

I’m still debating on how I want right-side count to work. We can split and then do a reductive count or keep retrimming the right-side and count backwards (that seems slow though). We’re still working out some of the details about the API and maybe cutting back on functions that don’t have particularly compelling use cases.

Tokenize is also getting reviewed to decide how we want that to work. The standard tokenize acts sort of like a greedy-split iterator that consumes delimiters. That said, we already have that with regex by default and split handles most of those cases. Match also already provides facilities to directly extract the strings you want, so I’m open to suggestions here. It may make sense to cut back on the number of count functions we have…? Again, feel free to pitch in ideas here about what you actually want.

More Regex

^$ tokens - check if a string begins with or ends with a match.

These can be composed with other checks to dynamically swap out the requirement to enable some really cool patterns like the following which matches different based on the location of a string: (\W|^)pattern(\W|$) will check for a pattern either at the beginning, end, or not directly preceded by word characters.

We are still planning on implementing positive and negative look-ahead, but that’ll hopefully be in the next update.

As always, thanks!

AndrewCodeDev · May 8, 2024, 2:27pm

Update - positive and negative look ahead have been added.

You can see if something is directly followed by or starts on some pattern. For instance, this matches word characters followed by spaces:

"\\w+(?=\\s)" // every word-character followed by a space
"\\w+(?!\\s)" // every word-character NOT followed by a space

Also, I want to point out that you can run fluent-regex at comptime as well as runtime. In this toy example, we’ll decide the type of something based on a string match:

fn foo(string: []const u8) bool {
    comptime {
        return Fluent.init(string).contains(.regex, "\\w+");
    }
}
// later...
const x: if (foo("a")) usize else u8 = 1;

Even the Regex-Iterators can be used at comptime, too (match and split). I encourage people to give it a shot and tell me how it goes

nyc · May 9, 2024, 12:40am

I will never forgive Larry Wall for what he did to regular expressions. He truly did ruin what I loved. I loved my formal languages class and my first job was writing a regex transducer engine for one of the most advanced search engines at the time (not goog) that had to simultaneously match 100k expressions and tell you which pattern and where the match occurred (on every single query). I got very good at NFA and DFA tricks.

how do you make a fast regex engine - you don’t support half that nasty stuff Perl did that made then CFGs and not regular.

Looking much more advanced since last time i saw it.

AndrewCodeDev · May 9, 2024, 1:35am

Gettin’ there. And yeah, PCRE has more than I care to incorporate at this moment. I’m starting to feel happy with the feature set though. I just need to put in word-boundary characters and I’ll have the feature list I originally aimed for.