Version 0.2.0 release (zig version 0.16.0)
This version marks a complete refactor of the compilation pipeline and the API objects that compilation returns. The goal was to move closer to a many-architecture approach, where the system maintains a universe of architectures and is able to dynamically pick the best machine for a given pattern.
The matching API was unchanged and no new architectures were added. In addition to the system being now more correct, from a usage perspective the main thing that changed was the introduction of proper API-types.
The original compile-pipeline was bad and confusing. It ended as such because I originally had issues designing the types so that ZLS could still resolve them and provide autocompletions. I went in circles and ended up with a bad implementation. The old pipeline was deleted completely and designed from the ground-up with a multi-architecture approach which also works for arch-targeting compilation. Additionally a lot of the code got shuffled around because previously a lot of the logic was hard-coded to the minimal_nfa architecture. There is now an architecture abstraction that all implemented architectures have to implement.
I also redesigned the testing suite. The harness is now much more comprehensive and it computes a cartesian product of as many relevant compilation paths and configurations I could think of. The system should be a lot more robust now.
Additionally, the README was rewritten in order to more accurately explain the philosophy of the project. Also there is now a usage reference document which will be kept in sync with the currently implemented features
The API-type design faced various challenges. A major philosophy of the system is to be explicit in what is and is not included in the final binary. The system should not link executable code of architectures the user does not expect to ever use. Additionally, it should be obvious to the user what is and isnt being included.
The design I came up with is to define a proper primitive Regex type that implements the API for a specific architecture. It can be either defined explicitly, and then built, or defined partially and let the system fill in the blanks.
const arch = ArchResolved{
.minimal_nfa = .{
.context = .{ .fixed = 64 },
.offset_bp = .i8,
},
};
const Re = regex.Regex(arch, .{});
var re = try Re.compile(.{}, gpa, "^abc"); // explicit
defer re.deinit(gpa);
try expectEqual(Re, @TypeOf(re));
const arch = Arch{
.minimal_nfa = .{ .context = .compact_fixed },
};
// partial -> deduces ArchResolved
const re = comptime regex.compileComptime(arch, .{}, "^abc");
const Expected = regex.Regex(.{
.minimal_nfa = .{
.context = .{.fixed = 3}, // deduces required context length of 3
.offset_bp = .i8, // deduces smallest offset integer type
},
}, .{});
try expectEqual(Expected, @TypeOf(re));
This design is useful for users who wish to avoid any kind of bloat in their binary. The problem is that different architectures return different Regex types, so its impossible to refer to them under a unified type name in function signatures etc.
For this purpose, I designed a type-erased AnyRegex which is implemented as a tagged union. Similarly to Regex it is also explicit in the sense that its defined by the set of architectures it can draw from.
const Re = anyregex.AnyRegex(&.{
.{ .minimal_nfa = .{ .offset_bp = .i8, .context = .{ .dynamic = .u16 } } },
.{ .minimal_nfa = .{ .offset_bp = .i16, .context = .{ .dynamic = .u16 } } },
}, .{});
var re = try Re.compile(.{}, gpa, "[A-Z][a-z_]+");
defer re.deinit(gpa);
The design is deliberate, as the major problem of the type-erasure approach is the bloat it can cause when all of the generic code from each architecture is linked. With this, the type-erased regex can be defined for the types of patterns you expect, minimizing bloat.
I also included some general-purpose definitions
var fixed = try anyregex.FixedRegex(128).compile(.{}, gpa, "[A-Z][a-z_]+");
defer fixed.deinit(gpa);
var dynamic = try anyregex.DynamicRegex.compile(.{}, gpa, "[A-Z][a-z_]+");
defer dynamic.deinit(gpa);
The biggest challenge was to comprehend how the Context type had to be designed for AnyRegex. The most important aspect of the Context is that it needs to be shareable between all instantiations of a given AnyRegex. This means that the context type needs to support all included architectures, essentially turning it into a ManyContext that accesses its fields dynamically as required. I’m unsure whether this is the correct design, there are probably some issues that will come up when the next major architecture is being implemented.
I’d be curious to hear your thoughts