Scala parser combinators are a powerful concept which allow you to write parsers with the
standard Scala distribution. They’re implemented as an internal
DSL which makes them feel naturally in the language. As mentioned
before I’m currently writing some sort of a state machine compiler. As I plan to write some documentation for that software (which actually is kind of rare

) I need some way to describe the syntax. Besides
EBNF,
syntax diagrams are a neat way to do that. The external DSL which is used to describe the state machines is directly written using the parser combinator, so I had figure out a way to generate syntax diagrams as well as EBNF out of them.
My first approach was to pattern match the resulting Parser structure. But unfortunately the combinators themselves are mostly modeled as a function, thus do not appear in the resulting object graph. So I decided to write a Scala compiler plugin and run thru the AST. Basically all this software does is to transform trees from the AST to some intermediate expression structure to a graphing tree one. This approach suffers some limitations and I made some assumptions as well:
- The name of the class which contains the production rules must equal the filename without “.scala”
- Production rules must not be named: “stringWrapper”, “literal”, “regex”, “Predef”, “r” or “accept”
- def production rules must not have parameters, otherwise they’re not considered to be production rules
- production rules defined via val will appear twice (this is most likely a bug)
All in all this implementation works in my case, but might not work for others. This more of a quick hack than carefully engineered software, but it proofs the concept and gets the job done.
With the ability to draw syntax diagrams and having the expression tree structure at hand (which can be seen as an AST for EBNF) writing an EBNF parser (which is semi
ISO/IEC 14977 : 1996(E) compatible) and feeding the diagram generator with the result was a snap.
So let’s consider the following example (which actually is the EBNF parser I wrote):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
| def start : Parser [List [ProductionRule ]] = syntax _rule *
def syntax _rule = meta _identifier ~ '=' ~ definitions _list ~ ';' ^^ {
case name ~ _ ~ expr ~ _ => ProductionRule (name, expr )
}
def meta _identifier = """[a-zA-Z0-9]+"""r
def definitions _list :Parser [Expression ] = (single _definition ~ (('|' ~ single _definition )*)) ^^ {
case t ~ o => OrExpression (t :: o. map(_ match { case _ ~ x => x }))
}
def single _definition = (primary ~ (((","?) ~ primary )*)) ^^ {
case t ~ o => SeqExpression (t :: o. map(_ match { case _ ~ x => x }))
}
def primary : Parser [Expression ] = optional _sequence | repeated _sequence | grouped _sequence | terminal _string | meta _identifier ^^ {
case s => RuleRefExpression (s )
}
def optional _sequence = "[" ~ definitions _list ~ "]" ^^ {
case _ ~ os ~ _ => OptionExpression (os )
}
def repeated _sequence = "{" ~ definitions _list ~ "}" ^^ {
case _ ~ rs ~ _ => ZeroToManyExpression (rs )
}
def grouped _sequence = "(" ~ definitions _list ~ ")" ^^ {
case _ ~ gs ~ _ => gs
}
def terminal _string = (""""[^ "\\\r\n]*(?:\\.[^"\\\r\n ]*)*" """r ) ^^ {
case s => LiteralExpression (s )
} |
The corresponding EBNF expression generated by this software is:
1 2 3 4 5 6 7 8 9 10
| start = { syntax_rule };
syntax_rule = ( '[a-zA-Z0-9]+' '=' ( single_definition { ( '|' single_definition ) } ) ';' );
meta_identifier = '[a-zA-Z0-9]+';
definitions_list = ( ( primary { ( [ ',' ] primary ) } ) { ( '|' single_definition ) } );
single_definition = ( ( optional_sequence | repeated_sequence | grouped_sequence | terminal_string | meta_identifier ) { ( [ ',' ] primary ) } );
primary = ( ( '[' definitions_list ']' ) | ( '{' definitions_list '}' ) | ( '(' definitions_list ')' ) | '"[^"\\\r\n]*(?:\\.[^"\\\r\n]*)*" ' | '[a-zA-Z0-9]+' );
optional_sequence = ( '[' ( single_definition { ( '|' single_definition ) } ) ']' );
repeated_sequence = ( '{' ( single_definition { ( '|' single_definition ) } ) '}' );
grouped_sequence = ( '(' ( single_definition { ( '|' single_definition ) } ) ')' );
terminal_string = '"[^"\\\r\n]*(?:\\.[^"\\\r\n]*)*" '; |
And of course the syntax diagrams we wanted:
So far the syntax diagram generator understands two options:
- -P:virtual:rule01,...,ruleN: declares rule01,…,ruleN to be virtual rules which causes them to be resolved wherever they’re referenced. Be aware that recursive virtual rules are only resolved to one level
- -P:depth:N: resolve rule references until a depth of N (recursive rules are only resolved for one level)