Skip to main content

Recursive Descent and Parser Combinators

It is particularly easy to turn an LL(1) grammar into an efficient parser using the technique of recursive descent parsing. For each non-terminal in the grammar, we write a function that recognizes strings produced from that non-terminal. If there are multiple productions for the non-terminal, we use the next available character to decide which one to use. To parse the right-hand side of the chosen production rule, we have to recognize a sequence of terminals and non-terminals in order. To recognize a terminal, we just check that the current character from the input matches the expected symbol. To recognize a non-terminal, we call the associated function for that non-terminal.

Therefore, our parser will be a set of mutually recursive functions, one for each non-terminal. To parse a word in the language, we call the function corresponding to the starting non-terminal; if that function returns without error, then we have successfully matched a word. In addition to recognizing a string of characters, it is common for each recursive descent parsing function to return a data structure (the parse tree, or a close relative known as an abstract syntax tree) representing the input that was parsed.

Here is code for a recursive descent parser in Java, corresponding to the following grammar (expressed here in Backus-Naur form; it is very similar to the example G2G_2 discussed in the parsing section):

Expr ::= Term [ ( +   ) Term ]Term ::= Factor [ (   / ) Factor ]Factor ::= ident  num  “(” Expr “)”\begin{aligned} \langle\textit{Expr}\rangle\ &::=\ \langle\textit{Term}\rangle\ [\ (\ +\ |\ -\ )\ \langle\textit{Term}\rangle\ ]\ldots\\ \langle\textit{Term}\rangle\ &::=\ \langle\textit{Factor}\rangle\ [\ (\ *\ |\ /\ )\ \langle\textit{Factor}\rangle\ ]\ldots\\ \langle\textit{Factor}\rangle\ &::=\ \textrm{ident}\ |\ \textrm{num}\ |\ \textrm{``(''}\ \langle\textit{Expr}\rangle\ \textrm{``)''} \end{aligned}
/**
* Represents an expression node in an abstract syntax tree.
*/
public interface Expr {
// instance methods appropriate to the application should be declared here
/**
* Parse an expression (sum/difference of one or more terms).
*
* @param input
* @return
*/
public static Expr parse(Input input) {
Expr e = parseTerm(input);
while (input.peek() == '+' || input.peek() == '-') {
BinOp op = BinOp.parse(input);
Expr e2 = Expr.parseTerm(input);
e = new BinOpExpr(e, op, e2);
}
return e;
}
/**
* Parse a term (product/quotient of one or more factors).
*
* @param input
* @return
*/
public static Expr parseTerm(Input input) {
Expr e = parseFactor(input);
while (input.peek() == '*' || input.peek() == '/') {
BinOp op = BinOp.parse(input);
Expr e2 = Expr.parseFactor(input);
e = new BinOpExpr(e, op, e2);
}
return e;
}
/**
* Parse a factor (identifier, number, or parenthesized expression). Throws a
* RuntimeException if a factor is not available.
*
* @param input
* @return
*/
public static Expr parseFactor(Input input) {
if (Character.isLetter(input.peek())) {
String id = input.readIdent();
return new IdentExpr(id);
} else if (Character.isDigit(input.peek())) {
int n = input.readInt();
return new NumExpr(n);
} else if (input.peek() == '(') {
input.skip();
Expr e = parse(input);
input.match(')');
return e;
} else {
throw new RuntimeException("expected a factor");
}
}
}
public class BinOpExpr implements Expr {
private Expr left, right;
private BinOp op;
public BinOpExpr(Expr left, BinOp op, Expr right) {
this.left = left;
this.op = op;
this.right = right;
}
public String toString() {
return "BinOp(" + left + ", " + op + ", " + right + ")";
}
}
public class IdentExpr implements Expr {
private String id;
public IdentExpr(String id) {
this.id = id;
}
public String toString() {
return "Ident(" + id + ")";
}
}
public class NumExpr implements Expr {
private int n;
public NumExpr(int n) {
this.n = n;
}
public String toString() {
return "Num(" + n + ")";
}
}
/**
* Represents the binary operators available in the abstract syntax for
* expressions.
*/
public enum BinOp {
PLUS, MINUS, TIMES, DIVIDE;
/**
* Parse a binary operator from the given Input. Should only be called when the
* current character may start an operator.
*
* @param input
* @return
*/
static BinOp parse(Input input) {
switch (input.peek()) {
case '+':
input.skip();
return PLUS;
case '-':
input.skip();
return MINUS;
case '*':
input.skip();
return TIMES;
case '/':
input.skip();
return DIVIDE;
default:
return null; // shouldn't happen
}
}
}
/**
* Wrapper around a Reader that provides useful abstractions for recursive
* descent parsing.
*/
public class Input {
private java.io.Reader source;
private char next;
private boolean atEnd;
public Input(java.io.Reader source) {
this.source = source;
skip();
}
/**
* @return current available character
*/
public char peek() {
return next;
}
/**
* @return true if no more characters available
*/
public boolean atEnd() {
return atEnd;
}
/**
* Read the next available character, skipping over whitespace
*/
public void skip() {
readNext();
skipWhitespace();
}
/**
* If the current character is c, skip to the next. Throw a RuntimeException if
* the character does not match.
*
* @param c
*/
public void match(char c) {
if (next == c) {
skip();
} else {
throw new RuntimeException("expected " + c + " but found " + next);
}
}
/**
* Read an identifier (letter followed by zero or more letters or digits). This
* should only be called when the current character is a letter.
*
* @return the identifier
*/
public String readIdent() {
StringBuilder builder = new StringBuilder();
builder.append(next);
readNext();
while (!atEnd && Character.isLetterOrDigit(next)) {
builder.append(next);
readNext();
}
skipWhitespace();
return builder.toString();
}
/**
* Read an integer (digit followed by zero or more additional digits). This
* should only be called when the current character is a digit.
*
* @return the number
*/
public int readInt() {
int result = next - '0';
readNext();
while (!atEnd && Character.isDigit(next)) {
result = result * 10 + next - '0';
readNext();
}
skipWhitespace();
return result;
}
private void readNext() {
try {
int c = source.read();
if (c != -1) {
next = (char) c;
atEnd = false;
} else {
next = '\0';
atEnd = true;
}
} catch (java.io.IOException e) {
next = '\0';
atEnd = true;
}
}
private void skipWhitespace() {
while (!atEnd && Character.isWhitespace(next)) {
readNext();
}
}
}
public class Demo {
public static void main(String[] args) {
String sample = " 3*abc + (x1 - x0) * r2d2/42 \n";
Input input = new Input(new StringReader(sample));
Expr e = Expr.parse(input);
if (input.atEnd()) {
System.out.println("Found " + e);
} else {
System.out.println("unscanned input after parsing " + e);
}
}
}

Parser Combinators

Instead of giving a direct translation of the Java version into ReasonML, it is common in functional languages to use what are known as parser combinators to write recursive descent parsers. A parser is viewed as a function from input to the pair of a result plus the remaining input (since in a functional language we do not want to use side-effects to update the "current character" available from an input source). A parser combinator is then a function that can combine one or more of these parsing functions into a composite parser.

For example, given parsers p1 and p2, the combinator <|> produces the parser p1 <|> p2 which attempts to parse according to p1; if it fails, then it attempts to use p2 instead. This corresponds to the | (choice) operator in BNF (and also in regular expressions). Some of the other combinators used below are <*>, which corresponds to sequencing one parser after another, and rep, which repeats a parser zero or more times (like the Kleene star).

Here is code for parser combinators in ReasonML, based on bs-little-parser:

Here is the parser for arithmetic expressions, corresponding to the Java example above. Note how the definitions of expr, term, and factor are very close to the original BNF: