Hey folks, I’m about to start researching unicode and regex for my “dt” language I’ve been working on. (ziggit topic Dt: duct tape for your unix pipes (new CLI tool in Zig))
I haven’t gotten as far as science projects quite yet, but I’m planning to start by looking into zigstr by @dude_the_builder and pcre2-unicode and see if they have similar ideas on where characters begin and end.
Zigstr is a UTF-8 string type with lots of methods ispired by other languages, but if it’s Unicode functions that you want, I recommend Ziglyph instead (which Zigstr makes heavy use of.) I also made a toy regex engine that compiles the regex at comptime using a Pratt parser and compiles to a Ken Thompson like regex matching VM. It’s The Extremely Opinionated Regex Engine (theoree). Not heavily tested, and surely has a gazillion bugs, but maybe it’ll have some code in there to help inspire you.