General Lexer Questions

Probably too general of a question, for which I apologize, but I am hoping to write a lexer in Zig. In thinking of how it would be tackled, say you have a graph:
A → B

A → C

B → D

B → E

C → D

C → E

where A, B, C, D, and E are not characters, but representations of nodes in the graph (which, themselves could contain a graph within).

First would be to consume characters until some delimiter and compare against “A”, right? If that doesn’t match, fail. But, if it does, we then need to see if the next “token” matches “B” or “C”.

So then you check “B”, if it matches, good, then you move on to checking if “D” matches, if it works, great, if it doesn’t check “E”, else it fails.

Here is where my first question arises, before any section of the graph with parallel nodes, is one meant to “save” the position to jump back? I.e., in the aforementioned example, traveling A → B → D and A → B → E didn’t match. So, one would have to restart from A → C right?

My second question is, can it taken as an assumption that any well-defined grammars that have these parallel nodes ensured that those parallel nodes are unique? What I mean is, it is impossible for a string to satisfy multiple routes?

If the two things above are true: I want to minimize stored state as I don’t care about the the knowledge gained throughout the processing of a statement–just that it abides by the grammar.

I would need my current position. I fear I need an unbounded stack to store all the “stop points” along the graph to go back to. I think that’s it? I wish there was a way to not require the whole “history”.

1 Like

It’s not to general. We all have to learn from somewhere :grinning_face_with_smiling_eyes:. I think, and maybe I’m wrong, that your question is not exactly about lexing but also a bit about parsing. So I will try to expand a bit on that.

If you implement the graph literally and try paths one by one, then yes, you need a way to backtrack. This is essentially a depth-first search on your grammar graph. However, this is rarely how production lexers work because backtracking is expensive and can require significant state.

The “trick” is to make this nondeterminism deterministic (formally converting the NFA to a DFA) by basically saving the history of each state into itself. Then you only need to save the state and the place you’re at in the input; keeping memory constant.

Formally you’re asking about ambiguous grammers. A kind of famous one that is easy to grasp is the dangling else problem. Generally programming language grammers try(or at least should try) to be unambiguous, meaning that code can only mean one thing. But if you do have an ambiguous one you need a rule to break this, like choosing the longest match or operator precedence.


I think it would help for you to look a bit into automata theory and the chomsky hierarchy. In general, as I think is quite intuitive, the more complex a grammer/language is the harder it is to deal with it:

  • regular languages can easily be recognized by a finite state automata. They don’t require more memory of the past beyond the current state. But they also can’t handle things like nested parentheses.
  • context free languages require a stack for the history (pushdown automata). This is the tier most programming languages are at and are also quite simple to implement. For instance T(a) in Zig is always a function call, nothing else.
  • context sensitive languages are harder because the meaning of a symbol depends on what the symbols around it are. Here T(a) in C++ can mean a bunch of things, function call, casting.
  • Recursively enumerable languages are not really feasible for parsing work.

If you have more questions just ask. I hope this helps you somewhat.

4 Likes

I second @pzittlau ‘s suggestion to delve into a bit of theory, which will allow you to understand why things are usually implemented the way they are.

Regarding ambiguity, keep in mind that it is closely related to how much lookup ahead you are willing to allow your parser. Some grammars are non-ambiguous with just knowledge about the current token (so, zero look-ahead), some require one token of look-ahead, and some may require more. This is why you also have parsers that end up producing all possible parses for a given string – some strings are ambiguous for a given grammar, no matter what.

You could start reading about this topic here, skipping all the non-computer science stuff. Good luck!

1 Like

From what you describe, your question is about parsing, not about lexing.

The lever reads text (or binary data, but usually text) and creates a stream of tokens like keywords, symbols, identifiers, number and string literals.

The parser reads this token stream and tries to “understand“ it, by transforming it into an AST or byte code or machine code or an intermediate representation.

From a practical perspective, the keywords and symbols, together with the grammar, help the parser to quickly detect which constructs are possible at a given location.

For many programming languages, the grammar is developed carefully such that the parser requires zero or only one token lookahead.

Theis restriction of the grammars (no need to look ahead far) also helps us humans to understand the text.

If your grammar is ambiguous, then implementing a parser is more difficult/slow, and it also is more difficult for humans to read.