The main motivation behind this post comes from the fact that, with the implementation of the lambda calculus we wrote in the last post, we'd have to write something like this

let term = Term::Abs(
    b'x',
    Box::new(Term::Abs(
        b'y',
        Box::new(Term::App(
            Box::new(Term::Var(b'x')),
            Box::new(Term::Var(b'y')),
        )),
    )),
);

To simply define the (λx. (λy. (x y))) term. This not only has poor readability and is difficult to write, it also forces the user of our implementation to have Rust installed to use it, instead of being able to run an standalone executable.

One step in this direction is to write a parser for our small language, so we are able to take raw text and transform it into an adequate term. Rust has a lot of parsing libraries, such as nom, pest, combine and lalrpop, each one with its pros and cons. However, I think is better to just write our own parser from scratch so we can take a look on how parsing actually works. In case you're interested the parsing pattern we are going to use here is loosely based on the very first predictive parser from the second chapter of the Dragon Book.

Specifying a grammar

Usually, the first step is to decide what is the grammar of the language we want to parse. We are very lucky because we defined the grammar for the lambda calculus in the last post:

    t :=                <terms>
         x              <variables>
       | λx. t          <abstractions>
       | t t            <applications>

Even then, this is not the concrete grammar for our language because it does not includes the information of what's x inside our grammar. In practice, we also use parentheses to define where abstractions and applications end. To build a concrete grammar we need to add these details:

    t :=                <terms>
         x              <variables>
       | (\x. t)        <abstractions>
       | (t t)          <applications>

    x := 'a' | 'b' | 'c' | ...

We introduced an small change about abstractions, instead of using the traditional λ which cannot be written easily, we will borrow Haskell's notation for anonymous functions and use \ instead. The fact that this is our concrete grammar implies that we won't allow the programmer to write extra spaces like ( x y) or (\ x . y) and so on.

Error management

¿Don't you hate when the compiler you're using gives you an unhelpful error? We will try our best by pointing to the error location in the original input and adding an small description of our error when parsing fails.

pub struct ParseError {
    pub loc: usize,
    pub kind: ParseErrorKind,
}

impl ParseError {
    pub fn new(loc: usize, kind: ParseErrorKind) -> Self {
        ParseError { loc, kind }
    }
}

pub enum ParseErrorKind {}

Every ParseError has a location and the specific information about the error is stored in ParseErrorKind, which right now is an empty enum. We will add the variants as we need them.

The idea is that every fallible parsing operation we do here should return Result<T, ParseError>. Thus, we will add an alias type for that

pub type ParseResult<T> = Result<T, ParseError>;

Tokenization

Instead of working directly with the bytes or characters of the source code, we will transform the source code into tokens

(\x. y) ~> [LBracket, Lambda, Char('x'), Dot, Space, Char('y'), RBracket]

In our language all tokens are generated from a single character. However, in more complex languages is normal to generate tokens from longer sequences of strings. As an example, let's suppose for a moment that we want to have longer names for our variables because single letters aren't enough:

(\foo. bar) ~> [LBracket, Lambda, Name("foo"), Dot, Space, Name("bar"), RBracket]

Tokens are a great tool for parsing because they free us from working with the raw input directly. In particular, if someone writes a program like (\x. ?) we can catch the invalid ? during the tokenization process. This means that we don't have to worry about that when doing the actual parsing.

Let's define our tokens

#[derive(Debug, PartialEq, Eq, Clone)]
pub struct Token {
    loc: usize,
    kind: TokenKind,
}

impl Token {
    fn kind(&self) -> &TokenKind {
        &self.kind
    }
}

#[derive(Debug, PartialEq, Eq, Clone)]
pub enum TokenKind {
    LBracket,
    RBracket,
    Lambda,
    Char(u8),
    Dot,
    Space,
    EOF,
}

impl TokenKind {
    fn into_token(self, loc: usize) -> Token {
        Token { loc, kind: self }
    }
}

We are storing the location in which every token started so we can report easily when a token was not expected. We also have an special TokenKind for when we reach the end of file. This removes the hassle of checking every time if we ran out of tokens. Thus, reaching the end of the source code unexpectedly is on the same level as reading an unexpected token.

Now we need to transform the source code into an stream of tokens. For this we will write a Tokenizer

pub struct Tokenizer<'a> {
    input: &'a [u8],
    loc: usize,
}

impl<'a> Tokenizer<'a> {
    pub fn new(input: &'a [u8]) -> Self {
        Tokenizer { input, loc: 0 }
    }
}

Our tokenizer stores an slice of bytes with the source code and the location of the beginning of the next token to be tokenized. It also should be able to tokenize one byte at a time

impl<'a> Tokenizer<'a> {
    ...
    fn tokenize(&self, byte: u8) -> ParseResult<Token> {
        use TokenKind::*;

        let kind = match byte {
            b'(' => LBracket,
            b')' => RBracket,
            b'\\' => Lambda,
            b'.' => Dot,
            b' ' => Space,
            byte if byte.is_ascii_lowercase() => Char(byte),
            _ => return Err(ParseError::new(self.loc, ParseErrorKind))),
        };
        Ok(kind.into_token(self.loc))
    }
}

Tokenizer::tokenize takes a single byte and transforms it into a token if the byte can be mapped to a proper TokenKind. If that is not the case, it returns an error.

And here is when we add our first possible parsing error. We need to report that an invalid byte was provided during tokenization. Let's add a new variant to ParseErrorKind for that:

pub enum ParseErrorKind {
    InvalidByte(u8),
}

Now let's add a method to our tokenizer capable of traversing the source code and producing tokens simultaneously

impl<'a> Tokenizer<'a> {
    ...
    pub fn next_token(&mut self) -> ParseResult<Token> {
        if let Some(&byte) = self.input.get(self.loc) {
            let token = self.tokenize(byte)?;
            self.loc += 1;
            Ok(token)
        } else {
            Ok(TokenKind::EOF.into_token(self.loc))
        }
    }
}

Note that we don't update the location if we reached the end of file. So if we try to tokenize something like (\x. y). The tokenizer will return the following tokens when we call next_token repeatedly:

[LBracket, Lambda, Char(b'x'), Dot, Space, Char(b'y'), RBracket]

However, if we try to tokenize (\x. ?), next_token will return an error with kind InvalidByte(b'?').

Now we are ready to write the actual parser.

Implementing a parser

Let's rethink our parsing logic in terms of tokens. We have the following grammar

    t :=                <terms>
         x              <variables>
       | (\x. t)        <abstractions>
       | (t t)          <applications>

    x := 'a' | 'b' | 'c' | ...

From it we can infer that any possible term must be a variable, or start with a (. That means that when parsing a term, the first possible token can only be a TokenKind::Char or a TokenKind::LBracket.

If it is a TokenKind::Char, we are done and can return a Term::Var with it. If it is a TokenKind::LBracket, we are parsing either an abstraction or an application.

However, it is not clear which one unless we take a look to the next token. If it is a TokenKind::Lambda we are parsing an abstraction, otherwise we are parsing an application. This means that we need to be able to peek the next token before actually using it, this operation is usually known as a lookahead.

To be able to do a lookahead before calling Tokenizer::next_token, we need to store the token produced by the previous call to Tokenizer::next_token

struct Parser<'a> {
    lookahead: Token,
    tokens: Tokenizer<'a>,
}

So if we need to do a lookahead we can check lookahead and decide what do to next. However, we must also be able to actually move ahead and read the next token

impl<'a> Parser<'a> {
    fn consume_lookahead(&mut self) -> ParseResult<Token> {
        let mut token = self.tokens.next_token()?;
        std::mem::swap(&mut token, &mut self.lookahead);
        Ok(token)
    }
}

To do that, we call Tokenizer::next_token to get the next token and now we must decide what to do with the old lookahead value. That value might be useful when we don't want to look ahead but just read the next token. Then, we swap the old lookahead value with the token we just read and return the old lookahead value.

Now we are ready to start parsing terms

impl<'a> Parser<'a> {
    ...
    fn parse_term(&mut self) -> ParseResult<Term> {
        let token = self.consume_lookahead()?;
        match *token.kind() {
            TokenKind::Char(byte) => Ok(Term::Var(byte)),
            TokenKind::LBracket => // We are parsing an abstraction or an application
            _ => // We weren't expecting this token
        }
    }
}

If the token we have is a TokenKind::Char, we are in the easy part because we can return a Term::Var with its contents and call it for a day. If it is a TokenKind::LBracket, we should check if we should parse the subsequent tokens as an abstraction or as an application. Here is when our lookahead is useful

impl<'a> Parser<'a> {
    ...
    fn parse_term(&mut self) -> ParseResult<Term> {
        let token = self.consume_lookahead()?;
        match *token.kind() {
            TokenKind::Char(byte) => Ok(Term::Var(byte)),
            TokenKind::LBracket => {
                if let TokenKind::Lambda = self.lookahead.kind() {
                    self.parse_abs()
                } else {
                    self.parse_app()
                }
            }
            _ => // We weren't expecting this token
        }
    }
    /// Parse an abstraction
    fn parse_abs(&mut self) -> ParseResult<Term> {
        todo!()
    }
    /// Parse an application
    fn parse_app(&mut self) -> ParseResult<Term> {
        todo!()
    }
}

If the lookahead is a TokenKind::Lambda, we proceed to parse an abstraction, otherwise we try to parse an application. If we didn't had the lookahead we'd have to figure out how to pass the starting token to the other parsing functions.

However, what happens if we didn't read a TokenKind::LBacket nor a TokenKind::Char in the first place? Well, that means that the source code has an error in it, because all terms must start with a lowecase letter or a (. So we should add a new kind of error

pub enum ParseErrorKind {
    InvalidByte(u8),
    UnexpectedToken(TokenKind),
}

And that new kind of error is only created when we read a token we didn't expect, so it would be useful to be able to transform a token into an error if we weren't expecting it

impl Token {
    ...
    pub fn into_unexpected(self) -> ParseError {
        ParseError::new(self.loc, ParseErrorKind::UnexpectedToken(self.kind))
    }
}

With all this we can go back to work on our parse_term method

impl<'a> Parser<'a> {
    ...
    fn parse_term(&mut self) -> ParseResult<Term> {
        let token = self.consume_lookahead()?;
        match *token.kind() {
            TokenKind::Char(byte) => Ok(Term::Var(byte)),
            TokenKind::LBracket => {
                if let TokenKind::Lambda = self.lookahead.kind() {
                    self.parse_abs()
                } else {
                    self.parse_app()
                }
            }
            _ => Err(token.into_unexpected())
        }
    }
}

However, there is an small detail. If we are parsing a term between brackets, we need to check that there is a closing bracket at the end. And actually we need to do that with several tokens. When parsing an abstraction we don't need any information from TokenKind::Lambda, TokenKind::Dot and TokenKind::Space. We need to check if they are there and go on. So it is a good idea to do a method that consumes a token and check if it is of a certain TokenKind

impl<'a> Parser<'a> {
    ...
    fn expect_token(&mut self, expected: TokenKind) -> ParseResult<()> {
        let token = self.consume_lookahead()?;
        if expected == *token.kind() {
            Ok(())
        } else {
            Err(token.into_unexpected())
        }
    }
}

And then we can finish our parse_term method

impl<'a> Parser<'a> {
    ...
    fn parse_term(&mut self) -> ParseResult<Term> {
        let token = self.consume_lookahead()?;
        match *token.kind() {
            TokenKind::Char(byte) => Ok(Term::Var(byte)),
            TokenKind::LBracket => {
                let term = if let TokenKind::Lambda = self.lookahead.kind() {
                    self.parse_abs()?
                } else {
                    self.parse_app()?
                };
                self.expect_token(TokenKind::RBracket)?;
                Ok(term)
            }
            _ => Err(token.into_unexpected())
or
        }
    }
}

Now we have to write the parse_abs method. First we need to check for the TokenKind::Lambda

impl<'a> Parser<'a> {
    ...
    fn parse_abs(&mut self) -> ParseResult<Term> {
        self.expect_token(TokenKind::Lambda)?;
    }
}

Then we need to read the argument of the abstraction, which should be a TokenKind::Char

impl<'a> Parser<'a> {
    ...
    fn parse_abs(&mut self) -> ParseResult<Term> {
        self.expect_token(TokenKind::Lambda)?;

        let token = self.consume_lookahead()?;
        let arg = if let TokenKind::Char(byte) = *token.kind() {
            byte
        } else {
            return Err(token.into_unexpected());
        };
    }
}

After that, we need to check for the dot and the space

impl<'a> Parser<'a> {
    ...
    fn parse_abs(&mut self) -> ParseResult<Term> {
        self.expect_token(TokenKind::Lambda)?;

        let token = self.consume_lookahead()?;
        let arg = if let TokenKind::Char(byte) = *token.kind() {
            byte
        } else {
            return Err(token.into_unexpected());
        };

        self.expect_token(TokenKind::Dot)?;
        self.expect_token(TokenKind::Space)?;
    }
}

Finally we need to parse the body of the abstraction, which can an arbitrary term. Then we call parse_term for that

impl<'a> Parser<'a> {
    ...
    fn parse_abs(&mut self) -> ParseResult<Term> {
        self.expect_token(TokenKind::Lambda)?;

        let token = self.consume_lookahead()?;
        let arg = if let TokenKind::Char(byte) = *token.kind() {
            byte
        } else {
            return Err(token.into_unexpected());
        };

        self.expect_token(TokenKind::Dot)?;
        self.expect_token(TokenKind::Space)?;
        let body = self.parse_term()?;

        Ok(Term::Abs(arg, Box::new(body)))
    }
}

And we are done. We have to write parse_app but that should be pretty straightforward

impl<'a> Parser<'a> {
    ...
    fn parse_app(&mut self) -> ParseResult<Term> {
        let t1 = self.parse_term()?;
        self.expect_token(TokenKind::Space)?;
        let t2 = self.parse_term()?;

        Ok(Term::App(Box::new(t1), Box::new(t2)))
    }
}

And that's it. We just need to have a way to run our parser over the tokenizer's tokens and return a term

impl<'a> Parser<'a> {
   fn run(mut tokens: Tokenizer<'a>) -> ParseResult<Term> {
       let mut parser = Parser {
           lookahead: tokens.next_token()?,
           tokens,
       };
       let term = parser.parse_term()?;
       parser.expect_token(TokenKind::EOF)?;
       Ok(term)
   }

We also check that we reach the end of the file so we don't accept programs like (\x .x))))))).

Displaying parsing errors

With the information we have stored in the ParseError, we should be able to display something like this

Parsing error at location 5
(\x. !x)
     ^
     Invalid byte '!'

To do that, we will implement std::fmt::Display for both ParseError and ParseErrorKind.

use std::fmt;

impl fmt::Display for ParseError {
    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
        write!(f, "Parsing error at location {}", self.loc)
    }
}

impl fmt::Display for ParseErrorKind {
    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
        use ParseErrorKind::*;
        match self {
            &InvalidByte(byte) => write!(f, "Invalid byte {:?}", byte as char),
            UnexpectedToken(token) => write!(f, "Unexpected token '{:?}'", token),
        }
    }
}

Now we will add a parse function that calls Parser::run and returns the Term if it was successful or else it prints the error

pub fn parse(input: &str) -> Option<Term> {
    let tokens = Tokenizer::new(input.as_bytes());
    match Parser::run(tokens) {
        Ok(term) => Some(term),
        Err(error) => {
            eprintln!("{}", error);
            eprintln!("{}", input);
            eprintln!("{:offset$}{}", " ", "^", offset = error.loc);
            eprintln!("{:offset$}{}", " ", error.kind, offset = error.loc);
            None
        }
    }
}

And that's it, we are ready to test the whole thing.

Testing our parser

fn main() {
    let inputs = [
        r"((\x. (x y)) (\z. z))",
        r"((\x. (\y. (x y))) (y z))",
        r"((\x. (y x)) x)",
        r"(\x. (y. z))",
        r"(\x. !x)",
        r"((\x. (\y. (x y))) (y z)))",
        r"((\x. (\y. (x y))) (y z)",
    ];

    for input in &inputs {
        println!("\nInput: {}", input);
        if let Some(mut term) = parse(input) {
            println!("Parsed term: {}", term);
            term.reduce();
            println!("After reduction: {}", term);
        }
    }
}

The first three terms should parse without trouble

Input: ((\x. (x y)) (\z. z))
Parsed term: ((λx. (x y)) (λz. z))
After reduction: ((λz. z) y)

Input: ((\x. (\y. (x y))) (y z))
Parsed term: ((λx. (λy. (x y))) (y z))
After reduction: ((λx. (λy. (x y))) (y z))

Input: ((\x. (y x)) x)
Parsed term: ((λx. (y x)) x)
After reduction: (y x)

Then (\x. (y. z)) has either a missing \ or an extra .

Input:
Parsing error at location 7
(\x. (y. z))
       ^
       Unexpected token 'Dot'

The next one (\x. !x) has a ! character that cannot be tokenized

Input: (\x. !x)
Parsing error at location 5
(\x. !x)
     ^
     Invalid byte '!'

Now we get to the harder ones, ((\x. (\y. (x y))) (y z))) has an extra bracket at the end

Input: ((\x. (\y. (x y))) (y z)))
Parsing error at location 25
((\x. (\y. (x y))) (y z)))
                         ^
                         Unexpected token 'RBracket'

Finally, ((\x. (\y. (x y))) (y z) is missing a bracket at the end

Input: ((\x. (\y. (x y))) (y z)
Parsing error at location 24
((\x. (\y. (x y))) (y z)
                        ^
                        Unexpected token 'EOF'

Ok, everything is working at intended!

What's next?

Now we can load any program from a file in a more compact syntax that resembles a lot the usual syntax for the lambda calculus. We could improve the parsing errors to show what token were we expecting instead of the one we got. We could even store the position of the first token of each kind of term and improve the errors to highlight the term we were trying to parse.

However, I'd rather spend more time improving the language itself. It would be cooler to fix the problem we had last time with the free variables when doing beta reduction.

See you next time!

PD: You can read the complete code here.