BED-Con talk Abstract syntax tree

Abstract syntax tree

What’s an abstract syntax tree

Any language parser has to generate some kind of output that is uaefull for further processing. In case if PHP the output is a series of opcodes for the Zend-Engine, which is good for performance interpretation, but not very helpful for structured code-conversion. An abstract syntax tree (AST) is a more generic output of a parser. In theory any kind of expression or statement can be written down in form of a tree of singular operations.

Consider the following C- or Java-like example:

Abstract syntax tree

The nodes of this AST break down like this:

  • The top-most ExprList serves as a simple container for an arbitrary number of expressions to be evaluated in sequence. In the the example the expressions are separated by ;, though is of no relevance for the AST.
  • Assign is a parameterized node in the AST: The result of the child expression should be assigned/stored to a variable defined by the parameter.
  • Add, Sub, Mul and Div are the AST representation of all binary operators: Addition +, subtraction -, multiplication * and division /. All of these nodes require to have exactly two child nodes (left and right).
  • Literal is a parameterized leaf (i.e. it must not have child node) in the AST representing a literal number.
  • VarGet is another parameterized leaf, to retrieve the current value of a variable defined by the parameter.

How to create an AST with combinators

Returning to the previous calculator example, one might define a simple AST structure like this

 1 trait Node { }
 2 
 3 trait Expr extends Node { }
 4 
 5 case class ExprListExpr(exprs: List[Expr]) extends Expr
 6 
 7 case class LiteralExpr(value: Int) extends Expr
 8 
 9 case class AddExpr(left: Expr, right: Expr) extends Expr
10 
11 case class SubExpr(left: Expr, right: Expr) extends Expr 
12 
13 case class MulExpr(left: Expr, right: Expr) extends Expr
14 
15 case class DivExpr(left: Expr, right: Expr) extends Expr 
16 
17 case class VarGetExpr(name: String) extends Expr
18 
19 case class AssignExpr(name: String, expr: Expr) extends Expr 

The parser for this now becomes a beauty of simplicity - at least once one can read the combinator syntax fluently.

 1 object ASTParser extends StdTokenParsers {
 2   override type Tokens = StdLexical
 3 
 4   override val lexical = new StdLexical
 5 
 6   lexical.delimiters ++= List("(", ")", "+", "-", "*", "/", "=", ";")
 7 
 8   def exprs: Parser[Expr] = repsep(expr, ";") ^^ ExprListExpr
 9 
10   def expr: Parser[Expr] = assign | addSub
11 
12   def assign: Parser[Expr] = ident ~ "=" ~ addSub ^^ { case name ~ _ ~ valueExpr => AssignExpr(name, valueExpr) }
13 
14   def addSub: Parser[Expr] = mulDiv * ("+" ^^^ AddExpr | "-" ^^^ SubExpr)
15 
16   def mulDiv: Parser[Expr] = term * ("*" ^^^ MulExpr | "/" ^^^ DivExpr)
17 
18   def term: Parser[Expr] = 
19     "(" ~> expr <~ ")" | "-" ~> expr ^^ NegExpr | ident ^^ VarGetExpr | numericLit ^^ (str => LiteralExpr(str.toInt))
20 ...
21 }

It should be pointed out, that tools like AntLR are able to generate all this - including the AST itself - from an even simpler grammar file. The drawback is that generated code is usually leas flexible.

How abstract syntax trees facilitate interpretation

Once one has an AST, interpretation becomes straight forward. First one needs some form of execution context, in this case just a place where to store and retrieve arbitrary variables:

1 class CalculatorContext {
2   private val variables = mutable.Map.empty[String, Int]
3 
4   def getVariable(name: String): Option[Int] = variables.get(name)
5 
6   def setVariable(name: String, value: Int) = variables.put(name, value)
7 }

Next one just has to add an evaluation method to the Expr nodes to produce a result for a given context

1 trait Expr extends Node {
2   def eval(implicit context: CalculatorContext): Int
3 }

… and implement it for there concrete nodes, which might be a one-liner in most cases

 1 case class LiteralExpr(value: Int) extends Expr {
 2   override def eval(implicit context: CalculatorContext) = value
 3 }
 4 
 5 case class AddExpr(left: Expr, right: Expr) extends Expr {
 6   override def eval(implicit context: CalculatorContext) = left.eval + right.eval
 7 }
 8 
 9 case class SubExpr(left: Expr, right: Expr) extends Expr {
10   override def eval(implicit context: CalculatorContext) = left.eval - right.eval
11 }
12 
13 case class VarGetExpr(name: String) extends Expr {
14   override def eval(implicit context: CalculatorContext) = context.getVariable(name).getOrElse {
15     throw new RuntimeException(s"Variable $name not defined")
16   }
17 }
18 
19 case class AssignExpr(name: String, expr: Expr) extends Expr {
20   override def eval(implicit context: CalculatorContext) = {
21 
22     val value = expr.eval
23     context.setVariable(name, value)
24     value
25   }
26 }
27 ...

Form here one it become just a matter of time and dedication to extend this simple calculator example to a full fledged language interpreter.

The problems with the PHP grammar

It might seem that creating an AST for PHP is straight forward as well. While this is true for some parts, there are some quirks that need to be addressed

  • First of all the example above works so nicely, because the structure of the parser and the AST are a perfect match for each other. While it is certainly possible to generate a generic AST from the original bison grammar of PHP, this would not be very “nice” for interpretation or conversion. There are several examples of parser rules that have to be combined to a single AST node, while some rules should produce several nodes.
  • Some runtime aspects of PHP do not fit well to the classic interpreter approach demonstrated in this example, e.g.
 1 <?php
 2 $a = 10;
 3 echo "First: $a\n";
 4 
 5 static $a = 20;
 6 echo "Second: $a\n";
 7 
 8 if ( $a < 35 ) {
 9     static $a = 30;
10     echo "Third: $a\n";
11 } else {
12     static $a = 40; 
13     echo "Fourth: $a\n";
14 }
15 ?>

will produce

First: 10
Second: 40
Fourth: 40

I.e. static assignments work somewhat differently than “normal” assignments.