32leaves.net

List of references plugin for WordPress

In my last post, I used a little plugin which generates a list of references to the post. I wanted to have this, to make the post look kinda like a scientific paper (thereby creating the paper post :P). If you you want this plugin too, you can get it directly from the WordPress site.

Scala and building DFA’s from regular expressions

These days I wanted to do something with that cool new shiny functional-object-fusion language for the JVM called Scala [1]. So I decided to write a program which builds a DFA [2] that accepts each word of the language defined by a regular epxression [3] (type 3 language – Chomsky hierarchy [4]). The algorithm for doing this is described in the paper written by J. A. Brzozowski [p1]. It all boils down to computing the so called “characteristic derivations” of a regular expression (side-note: I was pretty suprised to find that the validation of Relax-NG can be done in a similar way: using derivations [5]). Each of these derivations is then represented as a state in the resulting DFA (for a more formal description of the process read [p1] or [p2]). During the construction of the DFA, an equivalance test between regular expressions (called RE’s in future) is required. But this is not a trivial problem [p2].
So what I did was, I tried to simplify/minimalize each regular expression to it’s bare minimum which in turn allows me to compare the RE’s based on their structure. This is more to be seen as a heuristic, as the there is no garantee that the simplify mechanism boils down two equal RE’s to the same structure – but in practice it works pretty well.
Why did I use scala for this? Basically because that was the purpose – I wanted to do something with Scala. But it turned out that it really did get the job done. There are some really nice features features in that language:

just to name a few. Check the Scala language tour [11] for more features.

I’m not going to post a lot of code here, but I simply have to post the parser code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
package net.t32leaves.regexpDFA;
import scala.util.parsing.combinator._

sealed abstract class Expr
case class SymbolExpr(chr: String) extends Expr
case class OrExpr(subsexpr: List[Expr]) extends Expr
case class IterationExpr(subexpr: Expr) extends Expr
case class GroupExpr(subsexpr: Expr) extends Expr
case class ConcatExpr(subexpr: List[Expr]) extends Expr
case object EmptyWordExpr extends Expr
case object EmptySetExpr extends Expr

object RegExpParser extends RegexParsers {

    def expr: Parser[Expr] = expr00
    def expr00: Parser[Expr] = expr01 ~( ("|" ~> expr01)*) ^^ {
      case l~e => if (e.size == 0) l else OrExpr(l :: e)
    }
 
    def expr01: Parser[Expr] = (atom*) ^^ {
      case a => if (a.size == 1) a(0) else ConcatExpr(a)
    }
   
    def atom: Parser[Expr] = ( bracketized | chr )
 
    def bracketized: Parser[Expr] = ( "(" ~ expr00 ~ ")" ~ (("*")?) ) ^^ {
      case _~expr~_~iter => iter match {
        case Some(x) => IterationExpr(GroupExpr(expr))
        case _ => GroupExpr(expr)
      }
    }
 
    def chr: Parser[Expr] = (symb ~ (("*")?)) ^^ {
      case expr~iter => iter match {
        case Some(x) => IterationExpr(SymbolExpr(expr))
        case _ => SymbolExpr(expr)
      }
    }
 
    def symb: Parser[String] = """[a-zA-Z0-9]"""r

    def parse(text : String) = parseAll(expr, text)

}

The idea behind this code is truly amazing: so called parser combinators [12][p3]. In the world of functional programming they’re nothing new (Haskell had them for a while already), but for someone coming from a rather object oriented world, this is something uber-cool. Once you got the hang of those parser combinators, you’ll recognise them as an ingenius idea. The code we’re writting here looks some kind of like BNF, but is indeed valid Scala code. Those

1
2
3
4
... ^^ {
    case ... => ...
    case ... => ...
}

statements do what’s called pattern matching [13] and are used to construct the AST [14].

That pattern matching mechanism can also be used to implement some logic pretty close to it’s formal definition. The paper “Derivatives of Regular Expressions,” [p1] defines a function

    \[\delta\]

which basically determines if the empty word can be produced by a regular expression. It can be implemented in Scala like this

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
def containsEpsilon(expr: Expr): Boolean = {
    return expr match {
        case ConcatExpr(l) => (true /: l)(_ && containsEpsilon(_))
        case OrExpr(l) => (false /: l)(_ || containsEpsilon(_))
        case GroupExpr(l) => containsEpsilon(l)
        case SymbolExpr(_) => false
        case IterationExpr(_) => true
        case EmptyWordExpr => true
        case EmptySetExpr => false
    }
}

def delta(expr: Expr): Expr = {
    return if (containsEpsilon(expr)) EmptyWordExpr else EmptySetExpr
}

which is pretty close to the definition in the paper. The

containsEpsilon(expr: Expr): Boolean

uses two pretty neat concepts of Scala: pattern matching and the high-order function foldLeft [15] which is similar to the inject [16] in Ruby.

After all the computational work, a dot [17] file is generated and rendered to an image using Graphviz [18]. That’s why some of the graphs generated might not look as beautiful as they could.
Let’s have a look at the output of this program. Let our regular expression be

    \[aa\]

. Something really simple indeed. The DFA constructed by the program looks like this:

dfa5031

We can see from that simple example that the diagram needs some additional interpretation.

  1. If the machine reaches the state with a double circle, and there is no further inscription to be read, the machine has accepted the input (hence the outgoing transitions from the double circled state).
  2. The state which has an incoming transition from “nowhere” is the start state
  3. An uppercase E stands for

        \[\varepsilon\]

    (empty word) and an X for

        \[\phi\]

    (empty set)

If you want to give this a shot yourself, I’ve created a little service out of all that. It’s basically just a Jetty [19] wrapped around the program (in Scala of course). That service is hosted on a machine standing right under my desk in my apartment, so I can not garantee for it’s availability.

In order to try this, enter your expression in the box below. The parser accepts the following meta-chars:

*	Iteration
( )	Group
|	Alternative



Disable graph rendering (recommended for ”big” expressions)



In case you want the code, grab it using SVN https://guest:guest@hs.32leaves.net/svn/public/regexpToDFA. Please not the CC-BY-3.0 [20] license of the code.

References

  • [p2] S. Owens, J. Reppy, and A. Turon, "Regular-expression derivatives reexamined," Journal of Functional Programming, vol. 19, iss. 2, pp. 173-190, 2009. Go to document
    @article{p2,
      author = {Scott Owens and John Reppy and Aaron Turon},
      title = {Regular-expression derivatives reexamined},
      journal = {Journal of Functional Programming},
      publisher = {Cambridge University Press},
      address = {Cambridge, England},
      volume = 19, number = 2, year = 2009, pages = {173-190},
      topic = {implementation},
      url = {http://www.ccs.neu.edu/home/turon/re-deriv.pdf}
    }
  • [p3] P. Wadler, "How to replace failure by a list of successes," in Proc. of a conference on Functional programming languages and computer architecture, New York, NY, USA, 1985, pp. 113-128.
    @inproceedings{p3,
      author = {Wadler, Philip},
      title = {How to replace failure by a list of successes},
      booktitle = {Proc. of a conference on Functional programming languages and computer architecture},
      year = {1985},
      isbn = {3-387-15975-4},
      pages = {113--128},
      location = {Nancy, France},
      publisher = {Springer-Verlag New York, Inc.},
      address = {New York, NY, USA},
      }
  • [p1] J. A. Brzozowski, "Derivatives of Regular Expressions," J. ACM, vol. 11, iss. 4, pp. 481-494, 1964. Go to document
    @article{p1,
      author = {Brzozowski, Janusz A.},
      title = {Derivatives of Regular Expressions},
      journal = {J. ACM},
      volume = {11},
      number = {4},
      year = {1964},
      issn = {0004-5411},
      pages = {481--494},
      doi = {http://doi.acm.org/10.1145/321239.321249},
      publisher = {ACM},
      address = {New York, NY, USA},
      url = {http://portal.acm.org/citation.cfm?id=321239.321249&coll=GUIDE&dl=ACM&CFID=37630832&CFTOKEN=98811427}
    }

Fork me on GitHub