Parsing

In addition to providing access to virtually any library of code written in Java, the standard Scala distribution also supplies a number of powerful libraries of its own. In this lab, we will get some preliminary experience using one of these, which we will be using several more times this semester: the combinator parsing library. The details of how the parsing combinators are implemented are somewhat involved, but their use is quite straightforward.

The basic idea is that the combinators can take very simple parsers, that can only recognize individual tokens such as identifiers, numbers, or punctuation, and build more complex parsers that can recognize sequences of tokens. These compound parsers, as Scala expressions, can be named and used to form ever more sophiticated parsers that bear a strong resemblance to BNF grammars and regular expressions.

S-Expressions

An S-Expression (sexpr), which stands for "Symbolic Expression", is either an "atom" (a single token such as an identifier or numeric literal) or a list of sexprs in parentheses. One of the strengths of the LISP family of languages is that this very simple formation rule is used to construct both the program code and the data on which it operates -- this enabled even the earliest versions of LISP, in the late 1950's, to implement their own interpreter in a small amount of code.

Here is the sexpr formation rule expressed using parser combinators:

Show source
> import scala.util.parsing.combinator._
import scala.util.parsing.combinator._

> object SExprParser extends JavaTokenParsers {
|   lazy val sexpr: Parser[Any] = atom | list
|   
|   lazy val atom: Parser[Any] = ident | wholeNumber
|   
|   lazy val list: Parser[Any] = "(" ~ rep(sexpr) ~ ")"
|   
|   def apply(in: String) = parseAll(sexpr, in)
| }
defined module SExprParser

The javaTokenParsers class provides the basic ident and wholeNumber parsers; other useful parsers in the class are floatingPointNumber and stringLiteral for matching Java-style numbers and quote-delimited strings. In each case, the parser's result is the matched string of characters -- if you want a number, you need to convert it yourself with toInt or toDouble.

The ~ combinator is simply sequencing, or concatenation; the parser p ~ q ~ r matches a p followed by a q followed by an r. The result will be of the form a ~ b ~ c, where a, b, and c are the results of the individual parsers p, q, and r. The | combinator gives "alternation", or choice -- p | q first tries to match p; if this fails, then it will try to match q instead, starting at the same place in the input. Finally, the parser rep(p) matches 0 or more copies of p in sequence; its result is a list of all the matches. There is a variant, rep1(p), which matches 1 or more; it is similar to p ~ rep(p), except all of the matches are placed in a single list.

This parser may be used as follows:

Show source
> SExprParser("(this is (a test))") // success
res0: SExprParser.ParseResult[Any] = [1.19] parsed: (((~List(this, is, (((~List(a, test))~))))~))

> SExprParser("(no closing parenthesis") // failure
res1: SExprParser.ParseResult[Any] = 
[1.24] failure: `)' expected but `' found

(no closing parenthesis
                       ^

Exercise: Try more examples with this parser, and see if you can figure out the information in the result.

Note that when the parse succeeds, the result contains all of the pieces of the input in parsed form. This is usually more than we want -- rather than a full "parse tree", we generally want more of an "abstract syntax tree", preserving only the essential structure and data. By using the ^^ operator, we may grab and modify the return value from each grammar production:

Show source
> sealed trait SExpr
defined trait SExpr

> case class IdentSExpr(id: String) extends SExpr
defined class IdentSExpr

> case class IntLiteralSExpr(lit: String) extends SExpr
defined class IntLiteralSExpr

> case class ListSExpr(items: List[SExpr]) extends SExpr
defined class ListSExpr

> object SExprParser2 extends JavaTokenParsers {
|   lazy val sexpr: Parser[SExpr] = atom | list
|   
|   lazy val atom: Parser[SExpr] =
|   ( ident       ^^ {case id => IdentSExpr(id)}
|   | wholeNumber ^^ {case lit => IntLiteralSExpr(lit)}
|   )
|   
|   lazy val list: Parser[SExpr] =
|     "(" ~ rep(sexpr) ~ ")" ^^ {case (_ ~ items ~ _) => ListSExpr(items)}
|     
|   def apply(in: String) = parseAll(sexpr, in)
| }
defined module SExprParser2

Now we are returning a value in the algebraic data type SExpr on success:

Show source
> SExprParser2("(this is (a test))") // success
res2: SExprParser2.ParseResult[SExpr] = [1.19] parsed: ListSExpr(List(IdentSExpr(this), IdentSExpr(is), ListSExpr(List(IdentSExpr(a), IdentSExpr(test)))))

> SExprParser2("(no closing parenthesis") // failure
res3: SExprParser2.ParseResult[SExpr] = 
[1.24] failure: `)' expected but `' found

(no closing parenthesis
                       ^

Finally, we may extract the result with get:

Show source
> SExprParser2("(this is (a test))").get
res4: SExpr = ListSExpr(List(IdentSExpr(this), IdentSExpr(is), ListSExpr(List(IdentSExpr(a), IdentSExpr(test)))))

> SExprParser2("(plus 17 (times 5 5))").get
res5: SExpr = ListSExpr(List(IdentSExpr(plus), IntLiteralSExpr(17), ListSExpr(List(IdentSExpr(times), IntLiteralSExpr(5), IntLiteralSExpr(5)))))

Once we can parse a sexpr, here is an example of a simple evaluation function:

Show source
> def eval(exp: SExpr): Int = exp match {
|   case IntLiteralSExpr(lit) => lit.toInt
|   case ListSExpr(IdentSExpr("plus") :: exps) =>
|     exps.map(eval).sum
|   case ListSExpr(IdentSExpr("times") :: exps) =>
|     exps.map(eval).product
|   case _ => error("Unrecognized expression")
| }
eval: (exp: SExpr)Int

> eval(SExprParser2("(plus 17 (times 5 5))").get)
res6: Int = 42

Exercise: Extend the parser so that it will also allow floating point literals, then extend the eval function so that it will compute a Double result.

Arithmetic Expressions

Here is a slightly larger example, based on the expression grammar given in Example 2.8 of Scott, "Programming Language Pragmatics" (3rd ed.):

Show source
> sealed trait Expr
defined trait Expr

> case class BinOpExpr(op: String, left: Expr, right: Expr) extends Expr
defined class BinOpExpr

> case class UnOpExpr(op: String, expr: Expr) extends Expr
defined class UnOpExpr

> case class IdExpr(id: String) extends Expr
defined class IdExpr

> case class NumExpr(num: Int) extends Expr
defined class NumExpr

> object ArithParser extends JavaTokenParsers with PackratParsers {
|   lazy val expr: PackratParser[Expr] =
|   ( expr ~ addop ~ term ^^ {case e ~ op ~ t => BinOpExpr(op, e, t)}
|   | term
|   )
|     
|   lazy val term: PackratParser[Expr] =
|   ( term ~ mulop ~ factor ^^ {case t ~ op ~ f => BinOpExpr(op, t, f)}
|   | factor
|   )
|     
|   lazy val factor: PackratParser[Expr] =
|   ( "-" ~ factor     ^^ {case op ~ f => UnOpExpr(op, f)}
|   | "(" ~ expr ~ ")" ^^ {case _ ~ e ~ _ => e}
|   | ident            ^^ {case id => IdExpr(id)}
|   | wholeNumber      ^^ {case numLit => NumExpr(numLit.toInt)}
|   )
|     
|   lazy val addop: PackratParser[String] = "+" | "-"
|   
|   lazy val mulop: PackratParser[String] = "*" | "/"
|   
|   def apply(in: String) = parseAll(expr, in)
| }
defined module ArithParser

> ArithParser("3+4*5")
res7: ArithParser.ParseResult[Expr] = [1.6] parsed: BinOpExpr(+,NumExpr(3),BinOpExpr(*,NumExpr(4),NumExpr(5)))

> ArithParser("10-4-3")
res8: ArithParser.ParseResult[Expr] = [1.7] parsed: BinOpExpr(-,BinOpExpr(-,NumExpr(10),NumExpr(4)),NumExpr(3))

> ArithParser("-1/(x+y)")
res9: ArithParser.ParseResult[Expr] = [1.9] parsed: BinOpExpr(/,UnOpExpr(-,NumExpr(1)),BinOpExpr(+,IdExpr(x),IdExpr(y)))

Except for the algebraic data type definitions at the top, which define our abstract syntax, and the ^^ clauses on some of the rules describing how to construct the AST, this almost exactly matches the grammar rules given in the text. The only difference is in the ordering (and the explicit use of ~ for sequencing), because Scala's combinator parsers are sensitive to the order of the rules -- in general, you should put longer rules before shorter ones, or else a shorter one might succeed too early and prevent a longer one from matching.

The PackratParsers trait mixed in here, and the PackratParser return type, indicate that we are using a variety of parser which can handle the kind of recursion present in these grammar rules.

Exercise: Add floating point numbers as before, as well as a square root function. This should involve adding two rules to the definition of a factor. The AST for the expression sqrt(2.0) should be UnOpExpr("sqrt", NumExpr(2.0)) (change the parameter of a NumExpr to a Double). Then, write an evaluation function which takes an Expr and returns a Double; for now, you may throw an error if you encounter an IdExpr.