Recursive Descent and Parser Combinators
It is particularly easy to turn an LL(1) grammar into an efficient parser using the technique of recursive descent parsing. For each non-terminal in the grammar, we write a function that recognizes strings produced from that non-terminal. If there are multiple productions for the non-terminal, we use the next available character to decide which one to use. To parse the right-hand side of the chosen production rule, we have to recognize a sequence of terminals and non-terminals in order. To recognize a terminal, we just check that the current character from the input matches the expected symbol. To recognize a non-terminal, we call the associated function for that non-terminal.
Therefore, our parser will be a set of mutually recursive functions, one for each non-terminal. To parse a word in the language, we call the function corresponding to the starting non-terminal; if that function returns without error, then we have successfully matched a word. In addition to recognizing a string of characters, it is common for each recursive descent parsing function to return a data structure (the parse tree, or a close relative known as an abstract syntax tree) representing the input that was parsed.
Here is code for a recursive descent parser in Java, corresponding to the following grammar (expressed here in Backus-Naur form; it is very similar to the example discussed in the parsing section):
/*** Represents an expression node in an abstract syntax tree.*/public interface Expr {// instance methods appropriate to the application should be declared here/*** Parse an expression (sum/difference of one or more terms).** @param input* @return*/public static Expr parse(Input input) {Expr e = parseTerm(input);while (input.peek() == '+' || input.peek() == '-') {BinOp op = BinOp.parse(input);Expr e2 = Expr.parseTerm(input);e = new BinOpExpr(e, op, e2);}return e;}/*** Parse a term (product/quotient of one or more factors).** @param input* @return*/public static Expr parseTerm(Input input) {Expr e = parseFactor(input);while (input.peek() == '*' || input.peek() == '/') {BinOp op = BinOp.parse(input);Expr e2 = Expr.parseFactor(input);e = new BinOpExpr(e, op, e2);}return e;}/*** Parse a factor (identifier, number, or parenthesized expression). Throws a* RuntimeException if a factor is not available.** @param input* @return*/public static Expr parseFactor(Input input) {if (Character.isLetter(input.peek())) {String id = input.readIdent();return new IdentExpr(id);} else if (Character.isDigit(input.peek())) {int n = input.readInt();return new NumExpr(n);} else if (input.peek() == '(') {input.skip();Expr e = parse(input);input.match(')');return e;} else {throw new RuntimeException("expected a factor");}}}public class BinOpExpr implements Expr {private Expr left, right;private BinOp op;public BinOpExpr(Expr left, BinOp op, Expr right) {this.left = left;this.op = op;this.right = right;}public String toString() {return "BinOp(" + left + ", " + op + ", " + right + ")";}}public class IdentExpr implements Expr {private String id;public IdentExpr(String id) {this.id = id;}public String toString() {return "Ident(" + id + ")";}}public class NumExpr implements Expr {private int n;public NumExpr(int n) {this.n = n;}public String toString() {return "Num(" + n + ")";}}/*** Represents the binary operators available in the abstract syntax for* expressions.*/public enum BinOp {PLUS, MINUS, TIMES, DIVIDE;/*** Parse a binary operator from the given Input. Should only be called when the* current character may start an operator.** @param input* @return*/static BinOp parse(Input input) {switch (input.peek()) {case '+':input.skip();return PLUS;case '-':input.skip();return MINUS;case '*':input.skip();return TIMES;case '/':input.skip();return DIVIDE;default:return null; // shouldn't happen}}}/*** Wrapper around a Reader that provides useful abstractions for recursive* descent parsing.*/public class Input {private java.io.Reader source;private char next;private boolean atEnd;public Input(java.io.Reader source) {this.source = source;skip();}/*** @return current available character*/public char peek() {return next;}/*** @return true if no more characters available*/public boolean atEnd() {return atEnd;}/*** Read the next available character, skipping over whitespace*/public void skip() {readNext();skipWhitespace();}/*** If the current character is c, skip to the next. Throw a RuntimeException if* the character does not match.** @param c*/public void match(char c) {if (next == c) {skip();} else {throw new RuntimeException("expected " + c + " but found " + next);}}/*** Read an identifier (letter followed by zero or more letters or digits). This* should only be called when the current character is a letter.** @return the identifier*/public String readIdent() {StringBuilder builder = new StringBuilder();builder.append(next);readNext();while (!atEnd && Character.isLetterOrDigit(next)) {builder.append(next);readNext();}skipWhitespace();return builder.toString();}/*** Read an integer (digit followed by zero or more additional digits). This* should only be called when the current character is a digit.** @return the number*/public int readInt() {int result = next - '0';readNext();while (!atEnd && Character.isDigit(next)) {result = result * 10 + next - '0';readNext();}skipWhitespace();return result;}private void readNext() {try {int c = source.read();if (c != -1) {next = (char) c;atEnd = false;} else {next = '\0';atEnd = true;}} catch (java.io.IOException e) {next = '\0';atEnd = true;}}private void skipWhitespace() {while (!atEnd && Character.isWhitespace(next)) {readNext();}}}public class Demo {public static void main(String[] args) {String sample = " 3*abc + (x1 - x0) * r2d2/42 \n";Input input = new Input(new StringReader(sample));Expr e = Expr.parse(input);if (input.atEnd()) {System.out.println("Found " + e);} else {System.out.println("unscanned input after parsing " + e);}}}
Parser Combinators
Instead of giving a direct translation of the Java version into ReasonML, it is common in functional languages to use what are known as parser combinators to write recursive descent parsers. A parser is viewed as a function from input to the pair of a result plus the remaining input (since in a functional language we do not want to use side-effects to update the "current character" available from an input source). A parser combinator is then a function that can combine one or more of these parsing functions into a composite parser.
For example, given parsers p1
and p2
, the combinator <|>
produces the parser
p1 <|> p2
which attempts to parse according to p1
; if it fails, then it attempts
to use p2
instead. This corresponds to the (choice) operator in BNF (and also
in regular expressions). Some of the other combinators used below are <*>
, which
corresponds to sequencing one parser after another, and rep
, which repeats a
parser zero or more times (like the Kleene star).
Here is code for parser combinators in ReasonML, based on bs-little-parser:
Here is the parser for arithmetic expressions, corresponding to the Java example above.
Note how the definitions of expr
, term
, and factor
are very close to the original BNF: