32leaves.net

Making the world a safer place: self-verifying SMC

We all use code-generators, no matter if we love or hate them (or do both). But for most compilers you simply have to trust that they work correctly and do what they’re supposed to do. They typically do not provide any sort of proof for their correctness. Well, the SMC is different in that regard. It can generate verification code for its C backend.
The idea of self-verifying compilers it not completely new. Several methods have been proposed for this task, proof-carying code and credible compiler to name a few. What I’m doing here is some mixture of both (with a bit more of the credible compiler – altough the SMC is no optimizing compiler, so there is only a single transformation involved). Basically once we generated C code out of the FSM description, we have two (possibly different) FSMs. The one we described and the one SMC generated. The intuitive perception of those two FSMs being equal, is that their languages (dented by \mathcal{L}(F)) are equal. But how do we show that?
For our purpose it’s enough to show that the two transition functions are equal, as we only care about the generated FSM behaving exactly as the described one does. That still leaves us with the problem that we have to unobtrusively check the generated FSM.
Let us define p=(t_1, \dots, t_n), t_i = ((s, \sigma), (s', A_{in}, A_{out}))\in\delta to be a sequence of transitions (called path or trace). We’ll call a path loop free if \forall t: \neg\exists t': t = t' \vee s_t \neq s_{t'} holds – meaning that there are no two transitions within a path having the same start state.
Finally we accept the two FSMs to be equal iff for all loop-free paths in the original FSM, there exists an equivalent path in the generated machine. Following we accept the SMC to work correctly if the two FSMs (orignal one and generated one) are equal.
Generating the traces of the original FSM is a rather trivial task. We already have a good structure to work on (the SMC AST). That’s not the case with the generated FSM. First of all generating the trace must be unobtrusive. We must be sure that our instrumentation does alter the behavior of the FSM. We can accept that a statement of the form

1
printf("12345678");

generally does not alter the behavior, if placed appropriate. So we simply place print statements (with a unique ID) at each transition and action. Then we generate code that causes the FSM to transit along the previously determined paths.
If everything goes right, executing that code should cause a sequence of IDs to be printed. In case the SMC works correctly that sequence should be identical to those produced by traces of the original FSM.

To use this verification mechanism, run the SMC with the cverify backend and execute the generated _checker.sh script. That script either ends with “NO ERRORS FOUND” (which should always be the case), with a list of traces which the generated FSM did not perform. Additionally a checker.log file is generated which contains entries like this:

1
2
3
[ ] AST_Transition(_INIT,RT,List(AST_ConstEqCondition(CMD__MODE__RT)),AST_Actions(Some(List(clr)),None)) AST_Transition(RT,_ERR,List(AST_NegatedCondition(AST_PredicateCondition(command))),AST_Actions(None,None))
[!] A98601 T-93118478 T-1152986909 ;
[?] A98601 T-93118478 T-1152986909 ;

where the line marked by the exclamation mark denote the trace of the original FSM, as opposed to question mark which marks trace of the generated FSM.

Ahh, and there is yet another new feature, the SMC can now generate a NuSMV model. Of course that’s all in the repository.

Generating syntax diagrams from Scala parser combinators

Scala parser combinators are a powerful concept which allow you to write parsers with the standard Scala distribution. They’re implemented as an internal DSL which makes them feel naturally in the language. As mentioned before I’m currently writing some sort of a state machine compiler. As I plan to write some documentation for that software (which actually is kind of rare :-P) I need some way to describe the syntax. Besides EBNF, syntax diagrams are a neat way to do that. The external DSL which is used to describe the state machines is directly written using the parser combinator, so I had figure out a way to generate syntax diagrams as well as EBNF out of them.
My first approach was to pattern match the resulting Parser structure. But unfortunately the combinators themselves are mostly modeled as a function, thus do not appear in the resulting object graph. So I decided to write a Scala compiler plugin and run thru the AST. Basically all this software does is to transform trees from the AST to some intermediate expression structure to a graphing tree one. This approach suffers some limitations and I made some assumptions as well:

  • The name of the class which contains the production rules must equal the filename without “.scala”
  • Production rules must not be named: “stringWrapper”, “literal”, “regex”, “Predef”, “r” or “accept”
  • def production rules must not have parameters, otherwise they’re not considered to be production rules
  • production rules defined via val will appear twice (this is most likely a bug)

All in all this implementation works in my case, but might not work for others. This more of a quick hack than carefully engineered software, but it proofs the concept and gets the job done.

With the ability to draw syntax diagrams and having the expression tree structure at hand (which can be seen as an AST for EBNF) writing an EBNF parser (which is semi ISO/IEC 14977 : 1996(E) compatible) and feeding the diagram generator with the result was a snap.
So let’s consider the following example (which actually is the EBNF parser I wrote):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
def start: Parser[List[ProductionRule]] = syntax_rule*
def syntax_rule = meta_identifier ~ '=' ~ definitions_list ~ ';' ^^ {
  case name ~ _ ~ expr ~ _ => ProductionRule(name, expr)
}
def meta_identifier = """[a-zA-Z0-9]+"""r
def definitions_list:Parser[Expression] = (single_definition ~ (('|' ~ single_definition)*)) ^^ {
  case t ~ o => OrExpression(t :: o.map(_ match { case _ ~ x => x }))
}
def single_definition = (primary ~ (((","?) ~ primary)*)) ^^ {
  case t ~ o => SeqExpression(t :: o.map(_ match { case _ ~ x => x }))
}
def primary: Parser[Expression] = optional_sequence | repeated_sequence | grouped_sequence | terminal_string | meta_identifier ^^ {
  case s => RuleRefExpression(s)
}
def optional_sequence = "[" ~ definitions_list ~ "]" ^^ {
  case _ ~ os ~ _ => OptionExpression(os)
}
def repeated_sequence = "{" ~ definitions_list ~ "}" ^^ {
  case _ ~ rs ~ _ => ZeroToManyExpression(rs)
}
def grouped_sequence = "(" ~ definitions_list ~ ")" ^^ {
  case _ ~ gs ~ _ => gs
}
def terminal_string = (""""[^"\\\r\n]*(?:\\.[^"\\\r\n]*)*" """r) ^^ {
  case s => LiteralExpression(s)
}

The corresponding EBNF expression generated by this software is:

1
2
3
4
5
6
7
8
9
10
start = { syntax_rule };
syntax_rule = ( '[a-zA-Z0-9]+' '=' ( single_definition { ( '|' single_definition ) } ) ';' );
meta_identifier = '[a-zA-Z0-9]+';
definitions_list = ( ( primary { ( [ ',' ] primary ) } ) { ( '|' single_definition ) } );
single_definition = ( ( optional_sequence | repeated_sequence | grouped_sequence | terminal_string | meta_identifier ) { ( [ ',' ] primary ) } );
primary = ( ( '[' definitions_list ']' ) | ( '{' definitions_list '}' ) | ( '(' definitions_list ')' ) | '"[^"\\\r\n]*(?:\\.[^"\\\r\n]*)*" ' | '[a-zA-Z0-9]+' );
optional_sequence = ( '[' ( single_definition { ( '|' single_definition ) } ) ']' );
repeated_sequence = ( '{' ( single_definition { ( '|' single_definition ) } ) '}' );
grouped_sequence = ( '(' ( single_definition { ( '|' single_definition ) } ) ')' );
terminal_string = '"[^"\\\r\n]*(?:\\.[^"\\\r\n]*)*" ';

And of course the syntax diagrams we wanted:

So far the syntax diagram generator understands two options:

  • -P:virtual:rule01,...,ruleN: declares rule01,…,ruleN to be virtual rules which causes them to be resolved wherever they’re referenced. Be aware that recursive virtual rules are only resolved to one level
  • -P:depth:N: resolve rule references until a depth of N (recursive rules are only resolved for one level)
If you’re interested in the code you can grab it from this SVN repository (username: guest, password: guest). You’ll need Scala as well as Batik to run it.
Creative Commons License
SyntaxDiagramGenerator is licensed under a Creative Commons Attribution-Share Alike 3.0 Unported License

Y combinator in Scala

This is just a quick post to share this thing with the world: the Y-combinator in Scala (yipee, anonymous recursion). Unfortunately this can’t be written purely as Scala does not employ lazy evaluation (like Haskell does).

1
2
3
4
5
6
7
8
9
object YCombinator {

  def Y[A](f: (A => A) => (A => A)): (A => A) = f(Y(f))(_:A)
 
  def main(args: Array[String]) = {
    println(Y( (fact:(Int => Int)) => (x:Int) => if(x == 0) 1 else x * fact(x - 1) )(5))
  }
 
}

Scala and building DFA’s from regular expressions

These days I wanted to do something with that cool new shiny functional-object-fusion language for the JVM called Scala [1]. So I decided to write a program which builds a DFA [2] that accepts each word of the language defined by a regular epxression [3] (type 3 language – Chomsky hierarchy [4]). The algorithm for doing this is described in the paper written by J. A. Brzozowski [p1]. It all boils down to computing the so called “characteristic derivations” of a regular expression (side-note: I was pretty suprised to find that the validation of Relax-NG can be done in a similar way: using derivations [5]). Each of these derivations is then represented as a state in the resulting DFA (for a more formal description of the process read [p1] or [p2]). During the construction of the DFA, an equivalance test between regular expressions (called RE’s in future) is required. But this is not a trivial problem [p2].
So what I did was, I tried to simplify/minimalize each regular expression to it’s bare minimum which in turn allows me to compare the RE’s based on their structure. This is more to be seen as a heuristic, as the there is no garantee that the simplify mechanism boils down two equal RE’s to the same structure – but in practice it works pretty well.
Why did I use scala for this? Basically because that was the purpose – I wanted to do something with Scala. But it turned out that it really did get the job done. There are some really nice features features in that language:

just to name a few. Check the Scala language tour [11] for more features.

I’m not going to post a lot of code here, but I simply have to post the parser code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
package net.t32leaves.regexpDFA;
import scala.util.parsing.combinator._

sealed abstract class Expr
case class SymbolExpr(chr: String) extends Expr
case class OrExpr(subsexpr: List[Expr]) extends Expr
case class IterationExpr(subexpr: Expr) extends Expr
case class GroupExpr(subsexpr: Expr) extends Expr
case class ConcatExpr(subexpr: List[Expr]) extends Expr
case object EmptyWordExpr extends Expr
case object EmptySetExpr extends Expr

object RegExpParser extends RegexParsers {

    def expr: Parser[Expr] = expr00
    def expr00: Parser[Expr] = expr01 ~( ("|" ~> expr01)*) ^^ {
      case l~e => if (e.size == 0) l else OrExpr(l :: e)
    }
 
    def expr01: Parser[Expr] = (atom*) ^^ {
      case a => if (a.size == 1) a(0) else ConcatExpr(a)
    }
   
    def atom: Parser[Expr] = ( bracketized | chr )
 
    def bracketized: Parser[Expr] = ( "(" ~ expr00 ~ ")" ~ (("*")?) ) ^^ {
      case _~expr~_~iter => iter match {
        case Some(x) => IterationExpr(GroupExpr(expr))
        case _ => GroupExpr(expr)
      }
    }
 
    def chr: Parser[Expr] = (symb ~ (("*")?)) ^^ {
      case expr~iter => iter match {
        case Some(x) => IterationExpr(SymbolExpr(expr))
        case _ => SymbolExpr(expr)
      }
    }
 
    def symb: Parser[String] = """[a-zA-Z0-9]"""r

    def parse(text : String) = parseAll(expr, text)

}

The idea behind this code is truly amazing: so called parser combinators [12][p3]. In the world of functional programming they’re nothing new (Haskell had them for a while already), but for someone coming from a rather object oriented world, this is something uber-cool. Once you got the hang of those parser combinators, you’ll recognise them as an ingenius idea. The code we’re writting here looks some kind of like BNF, but is indeed valid Scala code. Those

1
2
3
4
... ^^ {
    case ... => ...
    case ... => ...
}

statements do what’s called pattern matching [13] and are used to construct the AST [14].

That pattern matching mechanism can also be used to implement some logic pretty close to it’s formal definition. The paper “Derivatives of Regular Expressions,” [p1] defines a function

    \[\delta\]

which basically determines if the empty word can be produced by a regular expression. It can be implemented in Scala like this

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
def containsEpsilon(expr: Expr): Boolean = {
    return expr match {
        case ConcatExpr(l) => (true /: l)(_ && containsEpsilon(_))
        case OrExpr(l) => (false /: l)(_ || containsEpsilon(_))
        case GroupExpr(l) => containsEpsilon(l)
        case SymbolExpr(_) => false
        case IterationExpr(_) => true
        case EmptyWordExpr => true
        case EmptySetExpr => false
    }
}

def delta(expr: Expr): Expr = {
    return if (containsEpsilon(expr)) EmptyWordExpr else EmptySetExpr
}

which is pretty close to the definition in the paper. The

containsEpsilon(expr: Expr): Boolean

uses two pretty neat concepts of Scala: pattern matching and the high-order function foldLeft [15] which is similar to the inject [16] in Ruby.

After all the computational work, a dot [17] file is generated and rendered to an image using Graphviz [18]. That’s why some of the graphs generated might not look as beautiful as they could.
Let’s have a look at the output of this program. Let our regular expression be

    \[aa\]

. Something really simple indeed. The DFA constructed by the program looks like this:

dfa5031

We can see from that simple example that the diagram needs some additional interpretation.

  1. If the machine reaches the state with a double circle, and there is no further inscription to be read, the machine has accepted the input (hence the outgoing transitions from the double circled state).
  2. The state which has an incoming transition from “nowhere” is the start state
  3. An uppercase E stands for

        \[\varepsilon\]

    (empty word) and an X for

        \[\phi\]

    (empty set)

If you want to give this a shot yourself, I’ve created a little service out of all that. It’s basically just a Jetty [19] wrapped around the program (in Scala of course). That service is hosted on a machine standing right under my desk in my apartment, so I can not garantee for it’s availability.

In order to try this, enter your expression in the box below. The parser accepts the following meta-chars:

*	Iteration
( )	Group
|	Alternative



Disable graph rendering (recommended for ”big” expressions)



In case you want the code, grab it using SVN https://guest:guest@hs.32leaves.net/svn/public/regexpToDFA. Please not the CC-BY-3.0 [20] license of the code.

References

  • [p2] S. Owens, J. Reppy, and A. Turon, "Regular-expression derivatives reexamined," Journal of Functional Programming, vol. 19, iss. 2, pp. 173-190, 2009. Go to document
    @article{p2,
      author = {Scott Owens and John Reppy and Aaron Turon},
      title = {Regular-expression derivatives reexamined},
      journal = {Journal of Functional Programming},
      publisher = {Cambridge University Press},
      address = {Cambridge, England},
      volume = 19, number = 2, year = 2009, pages = {173-190},
      topic = {implementation},
      url = {http://www.ccs.neu.edu/home/turon/re-deriv.pdf}
    }
  • [p3] P. Wadler, "How to replace failure by a list of successes," in Proc. of a conference on Functional programming languages and computer architecture, New York, NY, USA, 1985, pp. 113-128.
    @inproceedings{p3,
      author = {Wadler, Philip},
      title = {How to replace failure by a list of successes},
      booktitle = {Proc. of a conference on Functional programming languages and computer architecture},
      year = {1985},
      isbn = {3-387-15975-4},
      pages = {113--128},
      location = {Nancy, France},
      publisher = {Springer-Verlag New York, Inc.},
      address = {New York, NY, USA},
      }
  • [p1] J. A. Brzozowski, "Derivatives of Regular Expressions," J. ACM, vol. 11, iss. 4, pp. 481-494, 1964. Go to document
    @article{p1,
      author = {Brzozowski, Janusz A.},
      title = {Derivatives of Regular Expressions},
      journal = {J. ACM},
      volume = {11},
      number = {4},
      year = {1964},
      issn = {0004-5411},
      pages = {481--494},
      doi = {http://doi.acm.org/10.1145/321239.321249},
      publisher = {ACM},
      address = {New York, NY, USA},
      url = {http://portal.acm.org/citation.cfm?id=321239.321249&coll=GUIDE&dl=ACM&CFID=37630832&CFTOKEN=98811427}
    }

Fork me on GitHub