The Zitron LALR(1) Parser Generator

It gives me great pleasure to announce the release of Zitron. This is a parser generator based upon a port of Lemon, the tool used in SQLite to parse SQL queries, written by the same author, DRH, a legend if ever there was one.

Zitron is quite extensively documented, and ready for action. Additional bells and whistles may well be added, but to indicate its state of readiness: included is a complete Zig port of Lemon, shown to produce byte-identical output when given several grammar files, including that of SQLite itself. Those are in C, naturally; Zitron has been adopted from this base to emit Zig, with some other affordances.

Look forward to seeing what you make of it, and perhaps with it.

25 Likes

This artifice, much like the EU, is guaranteed to please no one.

LOL. And yacc shaving (I didn’t realize it was spelled that way) is art. Zero shame. Would be curious about existing and future “changes” hinted, though - perhaps some lean the work away from shaving and toward an alternate motive?

2 Likes

It generally isn’t :slight_smile:

Where the future is concerned, I would rather show then tell.

As to differences from Lemon, first of all, it emits Zig, which has consequences. Lemon parsers rely extensively on macro preprocessing, an option I did not have. I found a project golemon, which is itself pretty remarkable: the author ported Lemon to Go, made it emit Go code, then ported Pikchr to emit Go code using golemon, all so that he could embed Pikchr diagrams easily in his Hugo-based blog.

That was helpful in establishing a strategy to defeat the macro monster. What Zig does have is comptime, so macros used for conditional inclusion were just translated to comptime if statements, and textual inclusion macros were replaced with a HashMap which finds bits of the template and replaces them with the needful.

There were further changes needed to make it possible to use try inside user code, which I consider a requirement. defer to the rescue!

None of that is user-visible, however. In terms of what is, I completely reworked options parsing to behave in a more familiar manner, and rigged it so that command-line options can also be supplied as build options, since Zig projects will presumably want to use Zitron from the Zig build system, a procedure I have documented.

There are a number of special directives which are Zig-specific, C-specific build flags removed, and the like.

I also added a ‘ditto rule’ as syntax shorthand:

expr(A) ::= expr(B) PLUS expr(C). { ... }
  ``        expr(B) TIMES expr(C). { ... }
  ``        LPAREN expr(E) RPAREN. { ... }
  ``    ::= VALUE(A).

As an alternative to repeating expr(A) for each production (which is also permitted). Aesthetically I would suggest either including the optional ::=, or not, rather than both as in the example. That’s just to show that both are possible.

The named actions, and %impl directives which use them, are also novel. I’m pretty happy with how they turned out :slight_smile:

1 Like

Holy smokes; way-too-cool. Why is the density of witty transformations like “yak shaving” and “Heisenburg” (heisenbug, in another thread) so high in ziggit? (rhetorical)

Lovely, lovely project. I have been deeply in love for the longest time with lex, yacc, LALR parsing, GLR parsing and lemon. Thank you for putting in the time to do this.

One thing I have always wondered (and never have made time myself to do it) is why nobody has attempted to implement something like lemon but, instead of generating source code to be compiled, simply output the semantic information in a more universal format (JSON or JSON-like), then have a separate utility that can process that information and generate code for many different languages, or even use it at run-time, without generating any code, to create the parser – I think Raku does something like this.

Anyway. Thank you again. I will dive into the codebase out of pure joy.

2 Likes

Thank you for your kind words!

It’s been on my mind. Lemon (and therefore, Zitron) generates an SQL dump of some of the internal parser state. I don’t know the reason, but the flag to make it happen is present when SQLite (well, make) calls Lemon.

As it stands currently, this is just the symbols and rules. I have in mind to enhance the output with the rest of the parser’s internal state, the State, Config, and Action structures.

There’s also the .out file, which has most of what you’re looking for, it’s just formatted in a human-readable way for diagnostic purposes. Making it machine-ingestible would be tractable, SQLite seems like a good intermediate form to me since it can be queried and so on, and exporting from SQLite tables into whatever one pleases is a well-trodden road. A search for ReportOutput will turn that up.

One of the notions in adding the “named action” feature, which separates the grammar definitions from the code, was that it should make it easier to reuse the grammar itself in more contexts. Moreso for generating more than one kind of Zig program, but also as templates for what another parser has to do to get the same effect. We’ll see how that turns out.

2 Likes

Porting Lemon to Zig is what I had shelved in bucket list for some time. I’m really glad to see this implemented :grin:

I used Lemon in combination with re2c a lot when I needed to implement a simple language, and when I started to use Zig instead of C and C++, I really missed this piece of software. I also had an idea of implementing parser generator with Zig comptime, using struct fields to define grammar, but didn’t put much effort in to this (maybe someday). I will definitely take a look at Zitron next time I need to implement language parser.

1 Like

There was a very cool project like that just recently. Not exactly the same idea, it uses inlining to specialize a VM definition of a PEG into a parser state machine at comptime. I was impressed.

I have a vague idea of how to write a Zitron template which would do the same basic thing with LALR tables, but I haven’t tried it yet. For one thing I want enough grammars and test data such that I could benchmark them against each other and see if it helps at all.

1 Like

This is an awesome project! Having a yacc like parser generator in zig really extends the fundamental zig tool kits.

I skim through the doc a bit. I really like how you use aliases instead of $$/$1/$2/etc to reference rule terms in the action block. One question, do the aliases have zig types? E.g. when I do a ‘A = B + C’ in an action block, do they have to have the same int type? Also can I do ‘A = B.field1 + C.field2’?

Another question, can this build an AST directly? I vaguely remember yacc has a separate library for building AST.

Also how do you pass a context and an allocator to each action block? Is each action block technically a zig function or just a case block in a big switch statement?

Also does it support some form of inherited attribute, allowing passing attributes from top down to lower rules. I know lalr is basically bottom up and the action blocks are doing synthesized attributes. Just curious if attribute like scope can be passed down during parsing.

Hopefully not too many questions. This is exciting.

That’s DRH’s good idea, it’s one of the reasons I went with Lemon as a base.

It’s pure textual substitution, the types are dictated through the use of %type and %token_type directives. It does work to do B.field, A.?.val, and that sort of thing. I even made it avoid expanding aliases in comments and strings, which Lemon does not do. One can always pick an alias which won’t collide, but. still.

To illustrate, if you have a directive like this:

%type foo { []const u8 }

Then a rule like

foo(C) ::= thing(B) TOKEN other_thing.  @foo_thing(C; B)

Will expect C, in the %impl of @foo_thing, to be a []const u8. The rules have types, the aliases are just identifiers.

It triggers code when a rule reduces. Building an AST is a fairly straightforward application of that, but the code to do that has to be written, there’s no automatic version of it.

They become fields on Parser, or whatever else you want to call it with a %name directive.

One can certainly use an %extra_context to do things like this, yes. Although ‘top down’ is a bit challenging because, well, LALR is bottom-up. There are no on-shift actions, if you really dig into what’s going on with a LALR parser, especially rule coalescence and action compression, you’ll see why: by the time it’s all digested, the parser doesn’t have a firm grasp on what rule it’s reducing until it reduces.

In short, there’s no formal attribute grammar mechanism going on here, as neat as that would be. You’d have to fake it.

1 Like

Zitron now has a functional Tree-sitter grammar, such that Zig code will look like Zig.

4 Likes

Thanks for the thoughtful answers. Zitron looks great. I’ll reach for it when I have the need of building a parser.

1 Like

The tree-sitter grammar seems to be missing a license.

1 Like

Coming right up

Edit: Sorted. I generally start repos with a script, but they’re language-specific and I don’t have one for… everything Treesitter is, so it slipped through the cracks.

While I’m here: there’s a new release of Zitron, which makes the %impl grammar a little nicer, and which also won’t crash under certain malformed inputs. If anyone happens to be using Zitron already (yay!) this is a low-priority update, which does not affect already-correct programs, or the output.

And we’re up to 1.3, this is a codegen bug and updating is recommended.

The case where a rule-type has a %destructor defined, but no code action which captures it in at least one reduction of that rule type, was not handled correctly, resulting in Zig code which didn’t compile. This has been rectified.

Release v0.1.4 flips the sense of a Boolean, with the result that error recovery is normally attempted.

Prior to this, a syntax error which is caused by the unexpected end of input would enter an infinite loop. This is no longer the case.

As a general note: Zitron is based on mature production-grade technology, but is itself a zero-point-one, and is still very much in the shakeout phase with respect to the code which translated lemon.zig into zitron.zig. I’m in the process of writing a few demonstration parsers to cover the main surface, but there’s just a lot of logic in there and the process will take time. As you can see, I’m still hitting ‘obvious’ edge cases like syntax error + end of input.

So, stay tuned! Or if you’re feeling adventurous, recognize that using Zitron in its present state means participating in its development.

Edit: Which brings us to v0.1.5.

1 Like

v0.1.6 fixes a condition where “impl” code blocks were not being detected soon enough to prevent certain rules from being conflated during table compression.

The result for certain grammars would be a parser which correctly recognizes the strings, but for which some user code would be written into the parser, yet unable to be triggered, because the state it corresponds to was merged with another.

This is no longer the case.

After this update I won’t post more v0.1.n patch releases, which aren’t all that interesting in the grand scheme of things.

Anyone who wants such updates can subscribe only to releases, it’s Watch :eye: → Custom → :white_check_mark: Releases.

I expect to be back with news of somewhat greater general interest, in the relatively near future.