BED-Con talk The Parser Scala combinators Simple calculator example

Simple calculator example

Very first step: Simple addition/substraction

 1 class Calculator1 extends Parsers {
 2   type Elem = Char
 3 
 4   def expr: Parser[Int] = addition | subtraction | number
 5 
 6   def addition: Parser[Int] = number ~ '+' ~ number ^^ { case left ~ _ ~ right => left + right }
 7 
 8   def subtraction: Parser[Int] = number ~ '-' ~ number ^^ { case left ~ _ ~ right => left - right }
 9 
10   def number: Parser[Int] = digit.+ ^^ { digits => digits.mkString("").toInt }
11 
12   def digit: Parser[Char] = elem("digit", ch => ch.isDigit)
13 
14   def parse(str: String):Int = expr(new CharSequenceReader(str)) match {
15     case Success(result, remain) if remain.atEnd => result
16     case Success(_, remain) => throw new RuntimeException(s"Unparsed input at ${remain.pos}")
17     case NoSuccess(msg, remain) => throw new RuntimeException(s"Parse error $msg at ${remain.pos}")
18   }
19 }

This is already working for some basic stuff, but breaks on multiple operants:

"42"        ---> 42
"42+54"     ---> 96
"42-54"     ---> -12
"42-54+12"  ---> "Unparsed input" exception

In depth description of the combinations used in this example:

  • elem(kind: String, condition: Elem => Boolean) is a helper to create a simple parser that consumes a single element Elem and is successful if the element matches a given condition. In the example it is used to create a parser that consumes a digit character.
  • The + postfix is a shorthand for rep1[T](p: => Parser[T]): Parser[List[T]] with creates a new parser by repeating a given parser as long as it is successful, whereas the new parser is only successful if the repeated parser is only successful at least once. In the example this is used to combine the digit parser to a list of digits parser.
  • The ^^ operator is a shorthand of map[U](f: T => U): Parser[U] which creates a new parser be converting the result of a given parser. In the example this is used to convert the lost of digits parsers to the number parser and to perform the actual calculations.
  • The ~ operator creates a new parser by combining two given parsers, the new parser is only successful if both parsers are successful in succession. I.e. if a and b are parsers, a ~ b is successful if first a successfully consumes some input and then b consumes some of the remaining input. This is the building block to define syntactical rules.
  • The | operator is a shorthand for append[U >: T](p: => Parser[U]): Parser[U] which also creates a new parser by combining two given parsers, whereas the new parser succeeds if one of the given parsers is successful on the same input. I.e. if a and b are parsers, a | b is successful if a successfully consumes some input or if a fails and b successfully consumes some input.
  • Finally it should be mentioned that the Parsers trait contains an implicit conversion, so that a single element (or character in the example) to a parser consuming exactly this element. I.e. '+' is implicitly converted to a parser consuming the + sign.

Pitfall 1: Longest match selection

On the very first try one might accidentally ein write the “expr” combinator in a different order:

1 ...
2   def expr: Parser[Int] = number | addition | subtraction
3 ...

If written this way the parser behaves entirely different:

"42"        ---> 42
"42+54"     ---> "Unparsed input" exception
"42-54"     ---> "Unparsed input" exception
"42-54+12"  ---> "Unparsed input" exception
If parsers are combined by the ‘ ’ operator, the first parser in the list with a successful result wins. Since all expressions start with a number neither “addition” nor “subtraction” is taken into consideration.
As an alternative the combinator framework offers the ‘     ’ operator, in which case all the parser the successfully consumed the most input wins (longest match combination), so the original behavior can be also achieved by:
1 ...
2   def expr: Parser[Int] = number ||| addition ||| subtraction
3 ...
The problem with this is, that each time all combined parsers have to be evaluated, which might have a huge impact on performance for more complex grammars. As a rule of thumb it is usually better to use the ‘ ’ operator and sort the parsers accordingly.

Pitfall 2: Recursion is not your friend

To support longer expression on is inclined to “fix” the addition and subtraction rule like this:

1 ...
2   def expr: Parser[Int] = addition | subtraction | number
3 
4   def addition: Parser[Int] = expr ~ '+' ~ expr ^^ { case left ~ _ ~ right => left + right }
5 
6   def subtraction : Parser[Int] = expr ~ '-' ~ expr ^^ { case left ~ _ ~ right => left - right }
7 ...

When doing so the parser will break with a stack-overflow. Actually this is quite obvious, considering how the combinator framework operates, but particularly nasty if one uses a classic yacc/bison grammar as starting point (which support recursive rule definitions like this).

The stack-overflow can be avoided, when “fixing” the rules like this

1 ...
2   def expr: Parser[Int] = addition | subtraction | number
3 
4   def addition: Parser[Int] = number ~ '+' ~ expr ^^ { case left ~ _ ~ right => left + right }
5 
6   def subtraction: Parser[Int] = number ~ '-' ~ expr ^^ { case left ~ _ ~ right => left - right }
7 ...

but does not lead to the desired result:

"42"        ---> 42
"42+54"     ---> 96
"42-54"     ---> -12
"42-54+12"  ---> -24  i.e. the expression is actually evaluated as 42 - (54 + 12)

Also, if the input is long enough (i.e. contains many ‘+’ and ‘-‘ operations), the parser might break with a stack-overflow again, which makes it unusable in a production environment.

Support multiple operations and operator precedence

The combinator framework offers various tools to handle the concatenation off operators efficiently, though somewhat different from the way in classical grammars. Instead of writing a set of recursive rules it is possible to express a chain of parsers separated by delimiters/operator and define how the results should be combined for each delimiter/operator.

 1 class Calculator2 extends Parsers {
 2   type Elem = Char
 3 
 4   def expr = addSub
 5 
 6   def addSub: Parser[Int] = mulDiv * (
 7       '+' ^^^ { (left: Int, right: Int) => left + right} 
 8     | '-' ^^^ { (left: Int, right: Int) => left - right} )
 9 
10   def mulDiv = number * (
11       '*' ^^^ { (left: Int, right: Int) => left * right } 
12     | '/' ^^^ { (left: Int, right: Int) => left / right } )
13 
14   def number = digit.+ ^^ { digits => digits.mkString("").toInt }
15 
16   def digit = elem("digit", _.isDigit)
17 }

Which already honor the precedence of ‘*’ and ‘/’ over ‘+’ and ‘-‘, i.e. it already works like a real calculator:

"42"              ---> 42
"42+54"           ---> 96
"42-54"           ---> -12
"42-54+12"        ---> 0
"42-3*3*3*2+24/2" ---> 0

Pitfall 3: Pollution of the syntax definition

Even though the previous example already works quite well, it has no support for whitespace yet. I.e. “42” works perfectly well, while “ 42 “ already breaks. To fix this one might be inclined to simple extend the number parser like this:

1 ...
2   def number = whitespace.* ~> digit.+ <~ whitespace.* ^^ { digits => digits.mkString("").toInt }
3 
4   def digit = elem("digit", _.isDigit)
5 
6   def whitespace = elem("whitespace", ch => ch == ' ' || ch == '\t')
7 ...
  • The * postfix is a shorthand for rep[T](p: => Parser[T]): Parser[List[T]] which - like the + postfix - also repeats a given parser as, though it is also successful if there are no matches at all.
  • The ~> and <~ are variants of the ~ operator if the result of the left resp. right parser is of no interest and should be dropped.

Event though this fixes the whitespace problem and might work for a simple example like this, one can imagine that a grammar will become quickly polluted if whitespace rules (and other lexical patterns) are interwoven at all places where there might occur.

Use the classic lexer/parser pattern

To prevent a pollution of the grammar it is usually a good idea to split the parser into a lexical analyzer and a syntactic parser, just like one would do with lex/flex and yacc/bison in the classic world. To support this, the combinator framework already contains a lexical analyzer that might be used out of the box in many cases:

 1 class Calculator3 extends StdTokenParsers {
 2   override type Tokens = StdLexical
 3 
 4   override val lexical = new StdLexical
 5 
 6   lexical.delimiters ++= List("(", ")", "+", "-", "*", "/")
 7 
 8   def expr: Parser[Int] = addSub
 9 
10   def addSub: Parser[Int] = mulDiv * (
11       '+' ^^^ { (left: Int, right: Int) => left + right} 
12     | '-' ^^^ { (left: Int, right: Int) => left - right} )
13 
14   def mulDiv = number * (
15       '*' ^^^ { (left: Int, right: Int) => left * right } 
16     | '/' ^^^ { (left: Int, right: Int) => left / right } )
17 
18   def term: Parser[Int] = "(" ~> expr <~ ")" | numericLit ^^ (_.toInt)
19 
20   def parse(str: String) = expr(new lexical.Scanner(str)) match {
21     case Success(result, remain) if remain.atEnd => result
22     case Success(_, remain) => throw new RuntimeException(s"Unparsed input at ${remain.pos}")
23     case NoSuccess(msg, remain) => throw new RuntimeException(s"Parse error $msg at ${remain.pos}")
24   }
25 }

As one can see the grammar is almost unchanged while there is now support for whitespaces and comments.

"42 - 3*3*3*2 + 24/2"                      ---> 0
"2 * (3 /* blah blah */ +4) / 2 + 7 * 5"   ---> 42

From this point on it becomes somewhat straight forward to extends this simple example by several other operators, support for functions or even to a full fledged programming language in itself.

Final step: Packrat parsing

Once the grammar becomes more complicate or if there a many | or even ||| rules, the overall performance of the parser might become an issue. In most cases this can be resolved by converting the parser to a packrat parser. Packrat parsing is a comparably new parsing algorithm that addresses the backtrack or look-ahead problem of traditional parsers. A packrat parser offers linear time parsing with a trade-off in the memory footprint, which should not be an issue in most cases though.

The algorithm itself is explained in detain in the original paper on 2002. Scala combinators already contain an out of the box implementation and an easy paradigm to convert most parsers to a packrat parser:

  • The parser should add the PackratParsers trait
  • All rules defined as def should become lazy val
  • Instead of returning Parser[T] the rules should return PackratParser[T]
  • The input has be wrapped by a PackratReader[Elem]
 1 class Calculator4 extends StdTokenParsers with PackratParsers {
 2   override type Tokens = StdLexical
 3 
 4   override val lexical = new StdLexical
 5 
 6   lexical.delimiters ++= List("(", ")", "+", "-", "*", "/")
 7 
 8   lazy val expr: PackratParser[Int] = addSub
 9 
10   lazy val addSub: PackratParser[Int] = mulDiv * (
11       "+" ^^^ { (left: Int, right: Int) => left + right } 
12     | "-" ^^^ { (left: Int, right: Int) => left - right } )
13 
14   lazy val mulDiv: PackratParser[Int] = term * (
15       "*" ^^^ { (left: Int, right: Int) => left * right } 
16     | "/" ^^^ { (left: Int, right: Int) => left / right } )
17 
18   lazy val term: PackratParser[Int] = "(" ~> expr <~ ")" | numericLit ^^ (_.toInt)
19 
20   def parse(str: String) = expr(new PackratReader(new lexical.Scanner(str))) match {
21     case Success(result, remain) if remain.atEnd => result
22     case Success(_, remain) => throw new RuntimeException(s"Unparsed input at ${remain.pos}")
23     case NoSuccess(msg, remain) => throw new RuntimeException(s"Parse error $msg at ${remain.pos}")
24   }
25 }

Of course this is just for demonstration purposes only. This particular grammar is far to simple to have any kind of performance issues and packrat parsing is even counter productive.