General Lexer Questions

Probably too general of a question, for which I apologize, but I am hoping to write a lexer in Zig. In thinking of how it would be tackled, say you have a graph:
A → B

A → C

B → D

B → E

C → D

C → E

where A, B, C, D, and E are not characters, but representations of nodes in the graph (which, themselves could contain a graph within).

First would be to consume characters until some delimiter and compare against “A”, right? If that doesn’t match, fail. But, if it does, we then need to see if the next “token” matches “B” or “C”.

So then you check “B”, if it matches, good, then you move on to checking if “D” matches, if it works, great, if it doesn’t check “E”, else it fails.

Here is where my first question arises, before any section of the graph with parallel nodes, is one meant to “save” the position to jump back? I.e., in the aforementioned example, traveling A → B → D and A → B → E didn’t match. So, one would have to restart from A → C right?

My second question is, can it taken as an assumption that any well-defined grammars that have these parallel nodes ensured that those parallel nodes are unique? What I mean is, it is impossible for a string to satisfy multiple routes?

If the two things above are true: I want to minimize stored state as I don’t care about the the knowledge gained throughout the processing of a statement–just that it abides by the grammar.

I would need my current position. I fear I need an unbounded stack to store all the “stop points” along the graph to go back to. I think that’s it? I wish there was a way to not require the whole “history”.

1 Like

It’s not to general. We all have to learn from somewhere :grinning_face_with_smiling_eyes:. I think, and maybe I’m wrong, that your question is not exactly about lexing but also a bit about parsing. So I will try to expand a bit on that.

If you implement the graph literally and try paths one by one, then yes, you need a way to backtrack. This is essentially a depth-first search on your grammar graph. However, this is rarely how production lexers work because backtracking is expensive and can require significant state.

The “trick” is to make this nondeterminism deterministic (formally converting the NFA to a DFA) by basically saving the history of each state into itself. Then you only need to save the state and the place you’re at in the input; keeping memory constant.

Formally you’re asking about ambiguous grammers. A kind of famous one that is easy to grasp is the dangling else problem. Generally programming language grammers try(or at least should try) to be unambiguous, meaning that code can only mean one thing. But if you do have an ambiguous one you need a rule to break this, like choosing the longest match or operator precedence.


I think it would help for you to look a bit into automata theory and the chomsky hierarchy. In general, as I think is quite intuitive, the more complex a grammer/language is the harder it is to deal with it:

  • regular languages can easily be recognized by a finite state automata. They don’t require more memory of the past beyond the current state. But they also can’t handle things like nested parentheses.
  • context free languages require a stack for the history (pushdown automata). This is the tier most programming languages are at and are also quite simple to implement. For instance T(a) in Zig is always a function call, nothing else.
  • context sensitive languages are harder because the meaning of a symbol depends on what the symbols around it are. Here T(a) in C++ can mean a bunch of things, function call, casting.
  • Recursively enumerable languages are not really feasible for parsing work.

If you have more questions just ask. I hope this helps you somewhat.

7 Likes

I second @pzittlau ‘s suggestion to delve into a bit of theory, which will allow you to understand why things are usually implemented the way they are.

Regarding ambiguity, keep in mind that it is closely related to how much lookup ahead you are willing to allow your parser. Some grammars are non-ambiguous with just knowledge about the current token (so, zero look-ahead), some require one token of look-ahead, and some may require more. This is why you also have parsers that end up producing all possible parses for a given string – some strings are ambiguous for a given grammar, no matter what.

You could start reading about this topic here, skipping all the non-computer science stuff. Good luck!

1 Like

From what you describe, your question is about parsing, not about lexing.

The lexer reads text (or binary data, but usually text) and creates a stream of tokens like keywords, symbols, identifiers, number and string literals.

The parser reads this token stream and tries to “understand“ it, by transforming it into an AST or byte code or machine code or an intermediate representation.

From a practical perspective, the keywords and symbols, together with the grammar, help the parser to quickly detect which constructs are possible at a given location.

For many programming languages, the grammar is developed carefully such that the parser requires zero or only one token lookahead.

This restriction of the grammar (no need to look ahead far) also helps us humans to understand the text.

If your grammar is ambiguous, then implementing a parser is more difficult/slow, and it also is more difficult for humans to read.

2 Likes

@pzittlau Yeah, it might be about both. My context is writing a validator, i.e. showing that a string corresponds to a given grammar and, if it diverges, in what way.

I see, so DFS would be a naive implementation.

I don’t really understand what you mean by saving history into itself, but will look into it.

Regarding ambiguous grammar, if it’s guaranteed to not be ambiguous, I assume that means there is always just a single right path through the grammar such that you also avoid DFS.

By that those definitions of languages and Google, it seems SQL is not context-free, so presumably it is context-sensitive.

@gonzo I think part of what I’m trying to wrap my head around are two different types of look-ahead. I.e., “character” lookahead (e.g. say we had two keywords, SAME and SOME, I need at least 1-character look ahead to disambiguate these two keywords), vs. token lookahead, where to know which branch I must travel I have to know what the next token is.

@hvbargen Actually, by what you define it as, I think it is more so the lexer, since, I would hope, to implement validation I don’t have to store the whole token stream into an AST and just transiently keep a subset as I confirm the string to match the grammar.

One question I do have is if there’s a guaranteed delimiter in most grammars. I.e., in SQL, it seems that whitespace or a special set of delimiters are used to delineate all tokens. I.e., if true, I could read in until the next break, and check if that’s a token. If not though, maybe I have to read character by character.

There are plenty of grammars where there is no definitive sync point.

Again, you’re mostly talking about parsing, not lexing.

Lexing is simply the division of the input stream into tokens. “SAME” and “SOME” would be tokens of some type: reserved words, keywords, or user identifiers. Whitespace is usually a token delimiter but there are languages where it isn’t necessarily.

Most grammars, if you’re looking at something like BNF generally do define the lexing rules. It is possible to writer an analyzer that doesn’t separate out lexing, but it’s most common to separate tokenizing from the rest of processing the language. It just makes life much easier to separate these concerns.

I recommend reading Robert Nystroms excellent book “Crafting Interpreters”. It is free of scientific babble and straight to the point, very pleasant to read.

You can buy it in printed form (I did) or read it online.

4 Likes

Traditionally, you would separate this into a lexing phase, which would recognize SAME and SOME as different tokens, and a parsing phase, which would derive the structure (AST) that these tokens fall under. You can always do “scannerless parsing”, where you skip the lexing phase altogether; this is much more flexible, but it is also much more expensive, since it imposes the overhead of parsing (determining the AST) at a character level, rather than at a token level.

You really should read a bit about theory for this, even if you end up implementing things your own way and not using any “compiler compiler” tools. The theory helps a lot in understanding why different choices are made, and what the limits are for each technology.

2 Likes