In addition to providing access to virtually any library of code written in Java, the standard Scala distribution also supplies a number of powerful libraries of its own. In this lab, we will get some preliminary experience using one of these, which we will be using several more times this semester: the combinator parsing library. The details of how the parsing combinators are implemented are somewhat involved, but their use is quite straightforward.
The basic idea is that the combinators can take very simple parsers, that can only recognize individual tokens such as identifiers, numbers, or punctuation, and build more complex parsers that can recognize sequences of tokens. These compound parsers, as Scala expressions, can be named and used to form ever more sophiticated parsers that bear a strong resemblance to BNF grammars and regular expressions.
An S-Expression (sexpr), which stands for "Symbolic Expression", is either an "atom" (a single token such as an identifier or numeric literal) or a list of sexprs in parentheses. One of the strengths of the LISP family of languages is that this very simple formation rule is used to construct both the program code and the data on which it operates -- this enabled even the earliest versions of LISP, in the late 1950's, to implement their own interpreter in a small amount of code.
Here is the sexpr formation rule expressed using parser combinators:
> import scala.util.parsing.combinator._ import scala.util.parsing.combinator._ > object SExprParser extends JavaTokenParsers { | lazy val sexpr: Parser[Any] = atom | list | | lazy val atom: Parser[Any] = ident | wholeNumber | | lazy val list: Parser[Any] = "(" ~ rep(sexpr) ~ ")" | | def apply(in: String) = parseAll(sexpr, in) | } defined module SExprParser
The javaTokenParsers
class provides the basic ident
and wholeNumber
parsers; other useful parsers in the class are floatingPointNumber
and
stringLiteral
for matching Java-style numbers and quote-delimited strings.
In each case, the parser's result is the matched string of characters -- if
you want a number, you need to convert it yourself with toInt
or
toDouble
.
The ~
combinator is simply sequencing, or concatenation; the parser
p ~ q ~ r
matches a p
followed by a q
followed by an r
. The result
will be of the form a ~ b ~ c
, where a
, b
, and c
are the results
of the individual parsers p
, q
, and r
. The |
combinator gives "alternation", or choice -- p | q
first tries to match
p
; if this fails, then it will try to match q
instead, starting at
the same place in the input. Finally, the parser rep(p)
matches 0 or
more copies of p
in sequence; its result is a list of all the matches.
There is a variant, rep1(p)
, which matches 1 or more; it is similar to
p ~ rep(p)
, except all of the matches are placed in a single list.
This parser may be used as follows:
> SExprParser("(this is (a test))") // success res0: SExprParser.ParseResult[Any] = [1.19] parsed: (((~List(this, is, (((~List(a, test))~))))~)) > SExprParser("(no closing parenthesis") // failure res1: SExprParser.ParseResult[Any] = [1.24] failure: `)' expected but `' found (no closing parenthesis ^
Exercise: Try more examples with this parser, and see if you can figure out the information in the result.
Note that when the parse succeeds, the result contains all of the
pieces of the input in parsed form. This is usually more than we want --
rather than a full "parse tree", we generally want more of an "abstract
syntax tree", preserving only the essential structure and data. By using
the ^^
operator, we may grab and modify the return value from each
grammar production:
> sealed trait SExpr defined trait SExpr > case class IdentSExpr(id: String) extends SExpr defined class IdentSExpr > case class IntLiteralSExpr(lit: String) extends SExpr defined class IntLiteralSExpr > case class ListSExpr(items: List[SExpr]) extends SExpr defined class ListSExpr > object SExprParser2 extends JavaTokenParsers { | lazy val sexpr: Parser[SExpr] = atom | list | | lazy val atom: Parser[SExpr] = | ( ident ^^ {case id => IdentSExpr(id)} | | wholeNumber ^^ {case lit => IntLiteralSExpr(lit)} | ) | | lazy val list: Parser[SExpr] = | "(" ~ rep(sexpr) ~ ")" ^^ {case (_ ~ items ~ _) => ListSExpr(items)} | | def apply(in: String) = parseAll(sexpr, in) | } defined module SExprParser2
Now we are returning a value in the algebraic data type SExpr
on success:
> SExprParser2("(this is (a test))") // success res2: SExprParser2.ParseResult[SExpr] = [1.19] parsed: ListSExpr(List(IdentSExpr(this), IdentSExpr(is), ListSExpr(List(IdentSExpr(a), IdentSExpr(test))))) > SExprParser2("(no closing parenthesis") // failure res3: SExprParser2.ParseResult[SExpr] = [1.24] failure: `)' expected but `' found (no closing parenthesis ^
Finally, we may extract the result with get
:
> SExprParser2("(this is (a test))").get res4: SExpr = ListSExpr(List(IdentSExpr(this), IdentSExpr(is), ListSExpr(List(IdentSExpr(a), IdentSExpr(test))))) > SExprParser2("(plus 17 (times 5 5))").get res5: SExpr = ListSExpr(List(IdentSExpr(plus), IntLiteralSExpr(17), ListSExpr(List(IdentSExpr(times), IntLiteralSExpr(5), IntLiteralSExpr(5)))))
Once we can parse a sexpr, here is an example of a simple evaluation function:
> def eval(exp: SExpr): Int = exp match { | case IntLiteralSExpr(lit) => lit.toInt | case ListSExpr(IdentSExpr("plus") :: exps) => | exps.map(eval).sum | case ListSExpr(IdentSExpr("times") :: exps) => | exps.map(eval).product | case _ => error("Unrecognized expression") | } eval: (exp: SExpr)Int > eval(SExprParser2("(plus 17 (times 5 5))").get) res6: Int = 42
Exercise: Extend the parser so that it will also allow floating point
literals, then extend the eval
function so that it will compute a
Double
result.
Here is a slightly larger example, based on the expression grammar given in Example 2.8 of Scott, "Programming Language Pragmatics" (3rd ed.):
> sealed trait Expr defined trait Expr > case class BinOpExpr(op: String, left: Expr, right: Expr) extends Expr defined class BinOpExpr > case class UnOpExpr(op: String, expr: Expr) extends Expr defined class UnOpExpr > case class IdExpr(id: String) extends Expr defined class IdExpr > case class NumExpr(num: Int) extends Expr defined class NumExpr > object ArithParser extends JavaTokenParsers with PackratParsers { | lazy val expr: PackratParser[Expr] = | ( expr ~ addop ~ term ^^ {case e ~ op ~ t => BinOpExpr(op, e, t)} | | term | ) | | lazy val term: PackratParser[Expr] = | ( term ~ mulop ~ factor ^^ {case t ~ op ~ f => BinOpExpr(op, t, f)} | | factor | ) | | lazy val factor: PackratParser[Expr] = | ( "-" ~ factor ^^ {case op ~ f => UnOpExpr(op, f)} | | "(" ~ expr ~ ")" ^^ {case _ ~ e ~ _ => e} | | ident ^^ {case id => IdExpr(id)} | | wholeNumber ^^ {case numLit => NumExpr(numLit.toInt)} | ) | | lazy val addop: PackratParser[String] = "+" | "-" | | lazy val mulop: PackratParser[String] = "*" | "/" | | def apply(in: String) = parseAll(expr, in) | } defined module ArithParser > ArithParser("3+4*5") res7: ArithParser.ParseResult[Expr] = [1.6] parsed: BinOpExpr(+,NumExpr(3),BinOpExpr(*,NumExpr(4),NumExpr(5))) > ArithParser("10-4-3") res8: ArithParser.ParseResult[Expr] = [1.7] parsed: BinOpExpr(-,BinOpExpr(-,NumExpr(10),NumExpr(4)),NumExpr(3)) > ArithParser("-1/(x+y)") res9: ArithParser.ParseResult[Expr] = [1.9] parsed: BinOpExpr(/,UnOpExpr(-,NumExpr(1)),BinOpExpr(+,IdExpr(x),IdExpr(y)))
Except for the algebraic data type definitions at the top, which define our
abstract syntax, and the ^^
clauses on some of the rules describing how to
construct the AST, this almost exactly matches the grammar rules given in the
text. The only difference is in the ordering (and the explicit use of ~
for
sequencing), because Scala's combinator parsers are sensitive to the order of
the rules -- in general, you should put longer rules before shorter ones, or
else a shorter one might succeed too early and prevent a longer one from
matching.
The PackratParsers
trait mixed in here, and the PackratParser
return type,
indicate that we are using a variety of parser which can handle the kind of
recursion present in these grammar rules.
Exercise: Add floating point numbers as before, as well as a square root
function. This should involve adding two rules to the definition of a factor
.
The AST for the expression sqrt(2.0)
should be UnOpExpr("sqrt", NumExpr(2.0))
(change the parameter of a NumExpr
to a Double
). Then, write an evaluation
function which takes an Expr
and returns a Double
; for now, you may
throw an error if you encounter an IdExpr
.