Monday, February 13, 2006

A bit on regular expressions

From wikipedia:

Regular expressions consist of constants and operators that denote sets of strings and operations over these sets, respectively. Given a finite alphabet Σ the following constants are defined:

  • (empty set) denoting the set
  • (empty string) ε denoting the set {ε}
  • (literal character) a in Σ denoting the set {a}

and the following operations:

  • (concatenation) RS denoting the set { αβ | α in R and β in S }. For example {"ab", "c"}{"d", "ef"} = {"abd", "abef", "cd", "cef"}.
  • (alternation) R|S denoting the set union of R and S.
  • (Kleene star) R* denoting the smallest superset of R that contains ε and is closed under string concatenation. This is the set of all strings that can be made by concatenating zero or more strings in R. For example, {"ab", "c"}* = {ε, "ab", "c", "abab", "abc", "cab", "cc", "ababab", ... }.

The above constants and operators form a Kleene algebra.

Many textbooks use the symbols , +, or for alternation instead of the vertical bar.

To avoid brackets it is assumed that the Kleene star has the highest priority, then concatenation and then set union. If there is no ambiguity then brackets may be omitted. For example, (ab)c is written as abc and a|(b(c*)) can be written as a|bc*.

Examples:

  • a|b* denotes {ε, a, b, bb, bbb, ...}
  • (a|b)* denotes the set of all strings consisting of any number of a and b symbols, including the empty string
  • b*(ab*)* the same
  • ab*(c|ε) denotes the set of strings starting with a, then zero or more bs and finally optionally a c.
  • ((a|ba(aa)*b)(b(aa)*b)*(a|ba(aa)*b)|b(aa)*b)*(a|ba(aa)*b)(b(aa)*b)* denotes the set of all strings which contain an even number of bs and an odd number of as. Note that this regular expression is of the form (X Y*X U Y)*X Y* with X = a|ba(aa)*b and Y = b(aa)*b.

The formal definition of regular expressions is purposefully parsimonious and avoids defining the redundant quantifiers ? and +, which can be expressed as follows: a+= aa*, and a? = (ε|a). Sometimes the complement operator ~ is added; ~R denotes the set of all strings over Σ that are not in R. The complement operator is redundant: it can always be expressed by only using the other operators.

Regular expressions in this sense can express exactly the class of languages accepted by finite state automata: the regular languages. There is, however, a significant difference in compactness: some classes of regular languages can only be described by automata that grow exponentially in size, while the length of the required regular expressions only grow linearly. Regular expressions correspond to the type 3 grammars of the Chomsky hierarchy and may be used to describe a regular language.

No comments: