Interpreter in Go - 3

August 22, 2021

A lexer takes the source code, a sequence of characters, and group them into tokens. e.g., it makes the first decision on how to process the strings 100-10, -100-10, and -100--100 into groups. I’m going to call this grouping “tokenization” even though I may be misusing the term.

Tokenizing source code is hard. How should -100--100 be tokenized? Should it be a literal -100 followed by the minus token, followed by another -100?

Fun fact: Chinese speakers natively deal with ambiguous lexing all the time. For example:

美國加州大學

Can be grouped as (美國, 加, 州大學) or (美國, 加州, 大學) or (美國, 加州大學). We somehow figure it out by looking at a lot of context.

So how should -100--100 be tokenized? Both Nystrom’s book and Ball’s Book make the decision to treat the minus sign as an individual token. This simplifies the lexer. A rule of thumb might be, “if a character’s meaning is context dependent, make it into a token.”

A random commenter on stackoverflow says:

As a matter of practical engineering, usually the lexer doesn't handle signed numbers.

This means the parser can handle the minus sign, either as a subtract between two operands (the second of which may be a number), or as the negative of the the operand (of which the operand may be a number). That also makes it easy to handle odd expressions like "a- -7". So all you need is the "MINUS" token.

A parser turns a sequence of tokens into a tree. In our example, “-100–100” might be represented as the following tree:

           - (negation operator)
         /    \
       100    - (minus toke)
             /
            - (negation operator)
           /
         100

or it might be represented as an invalid tree:

           - (negation operator)
         /    \
       100    - (minus token)
             /
            - (sytax error: unexpected token!)

Another aside: Ball and Nystrom’s have very different teaching styles. Ball’s sample code started by ignoring the minus token to get the reader off the ground running then asking the reader to change the code later.. While Nystrom tend to start with the end result to beginwith. It’s pretty cool to follow both teaching styles as you go along.

Anyway, the parser should do more of the heavy lifting with a particular token has context sensitive meaning. In this case, the parser can look at whether the minus sign is a prefix or an infix and apply different logic in order to intrepret the source code as:

   (-1 * 100) subtract (-1 * 100)

Lexing and parsing are both a single pass. So we go though our code in two passes. Perhaps it is possible to combine them into one pass but it seems exceedingly complex. Similarly, perhaps there might be another stage between lexing and parsing which makes a compiler easier to work with – but I am not smart enough to think of such an algorithm. Sometimes, conventional wisdom is good enough.