Feature request: regular expressions

Are there any roadmap plans to add a regex module to the Zig standard library?

I’m doing some data cleanup with Zig, and I’m really missing built-in regex suport. I’ve seen a few Zig regex libraries floating around, but nothing that looks complete / tested. I eventually settled on a PCRE wrapper, and that’s been OK - but I wish it was more Ziglike.

I think that regex support would be a perfect candidate to include in the stdlib. They’re tricky for an individual user to implement, and could enable a lot of interesting runtime (and comptime) fun.

7 Likes

I absolutely agree that regex capability is eventually a must have. However, I do not believe that standardizing it is necessarily the best approach. There’s several reasons for this and I’ll make a case for just a few:

  1. User contributed libraries are historically quite good in this space as they mature. One of the better things that came out of C++ was the CTRE library. It was a really good use of compile time strings - really fast too. In Zig, we’d be able to do something similar but much better because of inline loops and comptime variables.

  2. Should it be compile time or runtime focused? So regex patterns could be created at compile time that generate a finite-state automaton that acts as the comparator itself. I would lean towards compile time but that has to be figured out.

  3. What standard of regex? There’s several and you’d have to settle on one… probably should not make our own standard. So which one should the zig-standard support? Not everyone agrees, after all.

Ultimately, there’s a lot of choices here given that regex is a cluster of languages itself. Because of that, it’s not clear what is meant by supporting “regex” because we’d still have to determine which one to support. Essentially, that means that the Zig foundation would have to make a choice for us and then support that decision indefinitely. It’s a really good way to build up a lot of legacy dependency.

Anyhow, that’s just my thoughts - I would be perfectly happy to see regex make its way into the language, but contributing to libraries imo is the best way to do that.

3 Likes

As a counterpoint, stdlib regexes worked out poorly for C++, and out-of-std regexes of Rust were a major success.

4 Likes

One of the things that’s really drawn people to Python is its “batteries included” style of stdlib. One of the things it provides is a really good regex experience right out-of-the-box.

From my perspective, it feels like Zig is aiming for a stdlib that’s a little more targeted, but something that still gives developers a lot of common building blocks. Especially ones that would otherwise require outside dependencies.

Regular expressions are right in that spot, I think. They’re very, very useful for lots of systems programming, which often has a lot of “transforming text from X to Y” type functions in it.

For me, the jump from “no dependencies” to “at least one dependency” is a big change for a program, and regexes are common enough that stdlib support would really cut down on a developer’s dependency burden. Over a software project’s lifespan, external dependencies are one of the most expensive elements.

Re: comptime vs runtime, ideally we’d one library that works identically for both. It doesn’t have to be the absolute fastest, highest-performing regex library in the world. We could aim for “good enough”, and optimize for consistency and user-experience.

Re: regex standard, you’re right that regexes have a bunch of dialects/standards. We could just pick one. Some common choices that would all be reasonable:

  • POSIX extended regex (aka sed -r / grep -E)
  • PCRE regex (widely used in lots of tools)
  • ECMAScript (C++11-current, Javascript)
  • Python regex (for developer familiarity)
1 Like

The stdlib hasn’t yet been the focus of development so we don’t know what the final philosophy will be. At the moment the stdlib is developed an organic fashion, mostly based on what’s needed for the compiler itself.

Once we make a decision on what should stay, there will be a moment where a bunch of stuff will be removed from the stdlib.

Just pointing this out so people know what’s probably going to happen.

6 Likes

As a counterpoint, stdlib regexes worked out poorly for C++, and out-of-std regexes of Rust were a major success.

Both true. C++ regexes are pretty bad from a size perspective, because of how much code they pull into your binary with each use. The API and feature-set are fine though, and would hit that “good enough” sweet-spot if the implementation wasn’t so code-hungry.

Rust’s regex library is pretty good. It’s also a first-party library from the core team, I think (based on being in the rust-lang Github org). Rust’s ecosystem also kind of forces you to shop for external dependencies and download them with Cargo.

Don’t get me wrong, Cargo is a great tool. But there are a -lot- of lifecycle issues when you think about the web of dependencies/sub-dependencies that can get pulled in. Too far down the hole, and we find ourselves vulnerable to:

  • log4j-style attacks that affect big chunks of the ecosystem
  • leftpad-style disruptions to the software supply-chain
  • taking design tips from the npm ecosystem

A healthy ecosystem is important, and so is a good package manager. But also it’s nice not to need them for something as common as regexes. So - just my two cents - when it comes time to standardize the stdlib, I would :two_hearts: an included regex implementation bigtime.

2 Likes

I’m happy to respectfully agree-to-disagree. If Zig had a standardized regex, I wouldn’t complain but I am still in favor of keeping it outside the standard library.

Ultimately, I don’t think our approaches are opposed, however. If a regex feature gets included into the standard library, I am willing to bet it would be based on something that started in the community first and got proposed at some point (kind of like how ranges got into C++).

If you haven’t already, I strongly suggest you become a financial supporter of the Zig foundation - they have to work full time to keep this ship running so getting and supporting features in the standard is highly dependent on having maintainers and a core team that can address this stuff. Just my two cents, but no judgement either way.

4 Likes

I’ve wrtten two regex and transducer libraries. Boring af and the best ones just use a ton of little tricks.
My advice: copy BurmtSushi’s (or pay him to do a Zig version).

2 Likes