BED-Con talk The Parser Comparison: flex/bison vs Scala combinators

Comparison: flex/bison vs Scala combinators

Lexer side-by-side

Actually a side-by-side comparison of flex and a lexer based one scala combinators is next to impossible. Let’s take a look at some of PHP’s flex code and its corresponding part in JBJ:

...
<ST_IN_SCRIPTING>"new" {
    return T_NEW;
}

<ST_IN_SCRIPTING>"clone" {
    return T_CLONE;
}

<ST_IN_SCRIPTING>"var" {
    return T_VAR;
}

<ST_IN_SCRIPTING>"("{TABS_AND_SPACES}("int"|
  "integer"){TABS_AND_SPACES}")" {
    return T_INT_CAST;
}

<ST_IN_SCRIPTING>"("{TABS_AND_SPACES}("real"|
  "double"|"float"){TABS_AND_SPACES}")" {
    return T_DOUBLE_CAST;
}

<ST_IN_SCRIPTING>"("{TABS_AND_SPACES}("string"|
  "binary"){TABS_AND_SPACES}")" {
    return T_STRING_CAST;
}

<ST_IN_SCRIPTING>"("{TABS_AND_SPACES}"array"
  {TABS_AND_SPACES}")" {
    return T_ARRAY_CAST;
}
...
 1 case class ScriptLexer(mode: ScriptingLexerMode) 
 2                 extends Lexer {
 3 ...
 4 val reserved = Set( ... "var", "new", "clone", ... )
 5 
 6 def processIdent(name: String) =
 7     if (reserved contains name.toLowerCase) 
 8       Keyword(name.toLowerCase) 
 9     else 
10       Identifier(name)
11 ...
12 '(' ~> tabsOrSpaces ~> (str("int") | str("integer")) 
13     <~ tabsOrSpaces <~ ')' ^^ IntegerCast
14 | '(' ~> tabsOrSpaces ~> (str("real") | str("double")
15  | str("float")) <~ tabsOrSpaces <~ ')' ^^ DoubleCast
16 | '(' ~> tabsOrSpaces ~> (str("string") | 
17   str("binary")) <~ tabsOrSpaces <~ ')' ^^ StringCast
18 | '(' ~> tabsOrSpaces ~> str("array") 
19     <~ tabsOrSpaces <~ ')' ^^ ArrayCast 
20 | '(' ~> tabsOrSpaces ~> str("bool") <~ 
21   opt(str("ean")) <~ tabsOrSpaces <~ ')' ^^ 
22     BooleanCast 
23 | '(' ~> tabsOrSpaces ~> str("unset") 
24     <~ tabsOrSpaces <~ ')' ^^ UnsetCast
25 ...
26 }

Overall there are some similarities, but the syntax is vastly different so that a one-by-one translation of the rules is in general on a feasible approach.

Parser side-by-side

The transformation of yacc rules to corresponding scala combinators is more straight-forward, as long as the the underlying lexical analyzers have a similar set of tokens. Let’s take a look at some off PHP’s yacc/bison code and the corresponding parts in JBJ:

...
interface_extends_list:
        /* empty */
    |   T_EXTENDS interface_list
;

implements_list:
        /* empty */
    |   T_IMPLEMENTS interface_list
;

interface_list:
        fully_qualified_class_name {...}
    |   interface_list ',' fully_qualified_class_name {...}
;

foreach_optional_arg:
        /* empty */                     {...}
    |   T_DOUBLE_ARROW foreach_variable {...}
;

foreach_variable:
        variable            {...}
    |   '&' variable        {...}
    |   T_LIST '(' {...} assignment_list ')' {...}
;

for_statement:
        statement
    |   ':' inner_statement_list T_ENDFOR ';'
;


foreach_statement:
        statement
    |   ':' inner_statement_list T_ENDFOREACH ';'
;
...
 1 class JbjParser extends Parsers with PackratParsers {
 2 ...
 3 lazy val interfaceExtendsList = 
 4   opt("extends" ~> interfaceList) ^^ { ... }
 5 
 6 lazy val implementsList = 
 7   opt("implements" ~> interfaceList) ^^ { ... }
 8 
 9 lazy val interfaceList = 
10   rep1sep(fullyQualifiedClassName, ",")
11 
12 lazy val foreachOptionalArg = 
13   opt("=>" ~> foreachVariable)
14 
15 lazy val foreachVariable =
16   variable ^^ { ... } | 
17   "&" ~> variable ^^ { ... } | 
18   "list" ~> "(" ~> assignmentList <~ ")" ^^ { ... }
19 
20 lazy val forStatement =
21   ":" ~> innerStatementList <~ "endfor" <~ ";" | 
22   statement ^^ (List(_)) | 
23   ";" ^^^ Nil
24 
25 lazy val foreachStatement =
26   ":" ~> innerStatementList <~ "endforeach" <~ ";" | 
27   statement ^^ (List(_)) | 
28   ";" ^^^ Nil
29 ...

Most of the differences are based on the fact that many of the yacc rules are expressed recursively, which should be converted to a more explicit form when using scala combinators.

It should be pointed out though, that some parts of the parser - especially those that handle operator precedence - are vastly different from the original bison code. In bison operator precedence can be configured/defined at a global level, white scala combinators require a more explicit implementation.

Somewhat unfair statistics

The technologies are vastly different, nevertheless one can do line counting:

Lexical analyzer

  PHP flex JBJ scala combinators
Files 1 10
Lines 2442 493

Syntactic parser

  PHP bison JBJ scala combinators
Files 1 1
Lines 1283 714