Pzre - pragmatic zig regex engine

Hello,
I’ve been working on a regex engine since december with a similar philosophy to re2.
This project initially started because I needed a regex analysis library, but I somehow derailed quite badly and ended up making a matching engine instead

Here are a couple of design philosophies that shaped this project:

  1. Sensible defaults; the default automata should be very small and fast without requiring it to cover all possible use cases. There should be a hierarchy of engines that are used as needed. The default engine shouldn’t even support capture groups
  2. Zig’s comptime should be fully used while optimizing, e.g. minimize integer sizes, infer optimal context lengths, etc
  3. All automata and search algorithms should be highly resilient to untrusted input
  4. The entire API should work the same whether calls are being done at comptime or runtime.
  5. Interesting machine topologies should be included that scale well as the number of compiled machines increase
  6. Compiled machines should be fully immutable and the core matching api fully allocator free

The engine is currently quite usable. I spent quite a bit of time implementing the core of the engine for v0.1.0. I tested it quite thoroughly and thought about as many edge cases as I could, but there are probably still bugs somewhere.

Check it out pzre

I’d be happy to hear your thoughts or receive any feedback

Supported Zig versions

0.16.0

AI / LLM usage disclosure

Very minimal, some tests are generated. I dont use AI autocompletion or code editors

11 Likes

Looks really good. I was looking for a zig regex engine and this seems really well designed.

1 Like

Thanks! I hope you find it useful. There is a bunch of crucial stuff coming I’ve already designed but simply haven’t had the time to write.

I was just thinking if the context allocation could be avoided somehow. Didn’t go into the code yet, just see it as problematic in some cases.

Yeah the default context is a dynamic one that on init allocates its buffers to match the machine it was defined for. When compiling you can either use .fixed or .comptime_fixed contexts that are simple arrays that never perform allocations, so its safe to pass undefined gpa to initContext in that case. There is a wrapper method initContextFixed that does exactly that.

1 Like

I go over context usage in src/showcase.zig. I’d also appreciate any feedback on whether something is confusing or not explained properly

1 Like

Really cool! Excited to try this.

One question/maybe feedback: I notice you that for the context lifecycle you use initContext and destroy. Any reason you didn’t pick deinit to match the std conventions?

1 Like

There was no reason. Thanks for pointing it out. I fixed it

1 Like

Oh! That makes it perfect. I’ll definitely need to dig deeper. This seems the best designed Zig regex library I’ve seen.

1 Like

Version 0.2.0 release (zig version 0.16.0)

This version marks a complete refactor of the compilation pipeline and the API objects that compilation returns. The goal was to move closer to a many-architecture approach, where the system maintains a universe of architectures and is able to dynamically pick the best machine for a given pattern.

The matching API was unchanged and no new architectures were added. In addition to the system being now more correct, from a usage perspective the main thing that changed was the introduction of proper API-types.

The original compile-pipeline was bad and confusing. It ended as such because I originally had issues designing the types so that ZLS could still resolve them and provide autocompletions. I went in circles and ended up with a bad implementation. The old pipeline was deleted completely and designed from the ground-up with a multi-architecture approach which also works for arch-targeting compilation. Additionally a lot of the code got shuffled around because previously a lot of the logic was hard-coded to the minimal_nfa architecture. There is now an architecture abstraction that all implemented architectures have to implement.

I also redesigned the testing suite. The harness is now much more comprehensive and it computes a cartesian product of as many relevant compilation paths and configurations I could think of. The system should be a lot more robust now.

Additionally, the README was rewritten in order to more accurately explain the philosophy of the project. Also there is now a usage reference document which will be kept in sync with the currently implemented features

The API-type design faced various challenges. A major philosophy of the system is to be explicit in what is and is not included in the final binary. The system should not link executable code of architectures the user does not expect to ever use. Additionally, it should be obvious to the user what is and isnt being included.

The design I came up with is to define a proper primitive Regex type that implements the API for a specific architecture. It can be either defined explicitly, and then built, or defined partially and let the system fill in the blanks.

const arch = ArchResolved{
  .minimal_nfa = .{
    .context = .{ .fixed = 64 },
    .offset_bp = .i8,
  },
};

const Re = regex.Regex(arch, .{});

var re = try Re.compile(.{}, gpa, "^abc"); // explicit
defer re.deinit(gpa);

try expectEqual(Re, @TypeOf(re));
const arch = Arch{
  .minimal_nfa = .{ .context = .compact_fixed },
};

// partial -> deduces ArchResolved
const re = comptime regex.compileComptime(arch, .{}, "^abc");

const Expected = regex.Regex(.{
  .minimal_nfa = .{
    .context = .{.fixed = 3}, // deduces required context length of 3
    .offset_bp = .i8,         // deduces smallest offset integer type
  },
}, .{});
try expectEqual(Expected, @TypeOf(re));

This design is useful for users who wish to avoid any kind of bloat in their binary. The problem is that different architectures return different Regex types, so its impossible to refer to them under a unified type name in function signatures etc.

For this purpose, I designed a type-erased AnyRegex which is implemented as a tagged union. Similarly to Regex it is also explicit in the sense that its defined by the set of architectures it can draw from.

const Re = anyregex.AnyRegex(&.{
  .{ .minimal_nfa = .{ .offset_bp = .i8,  .context = .{ .dynamic = .u16 } } },
  .{ .minimal_nfa = .{ .offset_bp = .i16, .context = .{ .dynamic = .u16 } } },
}, .{});
 
var re = try Re.compile(.{}, gpa, "[A-Z][a-z_]+");
defer re.deinit(gpa);

The design is deliberate, as the major problem of the type-erasure approach is the bloat it can cause when all of the generic code from each architecture is linked. With this, the type-erased regex can be defined for the types of patterns you expect, minimizing bloat.

I also included some general-purpose definitions

var fixed = try anyregex.FixedRegex(128).compile(.{}, gpa, "[A-Z][a-z_]+");
defer fixed.deinit(gpa);
 
var dynamic = try anyregex.DynamicRegex.compile(.{}, gpa, "[A-Z][a-z_]+");
defer dynamic.deinit(gpa);

The biggest challenge was to comprehend how the Context type had to be designed for AnyRegex. The most important aspect of the Context is that it needs to be shareable between all instantiations of a given AnyRegex. This means that the context type needs to support all included architectures, essentially turning it into a ManyContext that accesses its fields dynamically as required. I’m unsure whether this is the correct design, there are probably some issues that will come up when the next major architecture is being implemented.

I’d be curious to hear your thoughts