Skip to main content

Regular Expressions

(Content adapted from Critchlow & Eck)

Though we have used the term string throughout to refer to a sequence of symbols from an alphabet, an alternative term that is frequently used is word. The analogy seems fairly obvious: strings are made up of "letters" from an alphabet, just as words are in human languages like English. In English, however, there are no strict rules specifying which sequences of letters can be used to form legal English wordseven unlikely combinations like ghth and ckstr have their place. While some formal languages may simply be random collections of arbitrary strings, more interesting languages are those where the strings in the language all share some common structure: L1={x{a,b} na(x)=nb(x)}L_1 = \{ x\in \{a,b\}^* \ | n_a(x) = n_b(x)\}; L2={legal Java identifiers}L_2 = \{\textrm{legal Java identifiers}\}; L3={legal C++ programs}L_3 = \{\textrm{legal C++ programs}\}. In all of these languages, there are structural rules which determine which sequences of symbols are in the language and which aren't. So despite the terminology of "alphabet" and "word" in formal language theory, the concepts don't necessarily match "alphabet" and "word" for human languages. A better parallel is to think of the alphabet in a formal language as corresponding to the words in a human language; the words in a formal language correspond to the sentences in a human language, as there are rules (grammar rules) which determine how they can legally be constructed.

One way of describing the grammatical structure of the strings in a language is to use a mathematical formalism called a regular expression. A regular expression is a pattern that "matches" strings that have a particular form. For example, consider the language (over alphabet Σ={a,b}\Sigma = \{a,b\}) L={x  x starts and ends with a}L= \{x \ | \ x \textrm{\ starts and ends with\ } a\}. What is the symbol-by-symbol structure of strings in this language? Well, they start with an aa, followed by zero or more aa's or bb's or both, followed by an aa. The regular expression a(ab)aa \cdot (a \,|\, b)^* \cdot a is a pattern that captures this structure and matches any string in LL (\cdot and ^* have their usual meanings, and \,|\, designates or.1) Conversely, consider the regular expression (a(ab))((ab)a)a\cdot(a\,|\, b)^*) \,|\, ((a\,|\, b)^*\cdot a). This is a pattern that matches any string that either has the form "aa followed by zero or more aa's or bb's or both" (i.e., any string that starts with an aa) or has the form "zero or more aa's or bb's or both followed by an aa" (i.e. any string that ends with an aa). Thus the regular expression generates the language of all strings that start or end (or both) in an aa: this is the set of strings that match the regular expression.

Here are the formal definitions of a regular expression and the language generated by a regular expression:

Let Σ\Sigma be an alphabet. Then the following patterns are regular expressions over Σ\Sigma:

  1. Φ\Phi and ε\varepsilon are regular expressions;
  2. aa is a regular expression, for each aΣa \in \Sigma;
  3. if r1r_1 and r2r_2 are regular expressions, then so are r1r2r_1\,|\, r_2, r1r2r_1\cdot r_2, r1r_1^* and (r1)(r_1) (and of course, r2r_2^* and (r2)(r_2)). As in concatenation of strings, the \cdot is often left out of the second expression. (Note: the order of precedence of operators, from lowest to highest, is \,|\,, \cdot, *.)

No other patterns are regular expressions.

The language generated by a regular expression rr, denoted L(r)L(r), is defined as follows:

  1. L(Φ)=L(\Phi) = \emptyset, i.e., no strings match Φ\Phi;
  2. L(ε)={ε}L(\varepsilon) = \{\varepsilon\}, i.e., ε\varepsilon matches only the empty string;
  3. L(a)={a}L(a) = \{a\}, i.e., aa matches only the string aa;
  4. L(r1r2)=L(r1)L(r2)L(r_1\,|\, r_2) = L(r_1) \cup L(r_2), i.e., r1r2r_1\,|\, r_2 matches strings that match r1r_1 or r2r_2 or both;
  5. L(r1r2)=L(r1)L(r2)L(r_1r_2) = L(r_1)L(r_2), i.e., r1r2r_1r_2 matches strings of the form "something that matches r1r_1 followed by something that matches r2r_2";
  6. L(r1)=(L(r1))L(r_1^*) = (L(r_1))^*, i.e., r1r_1^* matches sequences of 0 or more strings each of which matches r1r_1.
  7. L((r1))=L(r1)L((r_1)) = L(r_1), i.e., (r1)(r_1) matches exactly those strings matched by r1r_1.

Example: Let Σ={a,b}\Sigma = \{a,b\}, and consider the regular expression r=abr=a^*b^*. What is L(r)L(r)? Well, L(a)={a}L(a) = \{a\} so L(a)=(L(a))={a}L(a^*) = (L(a))^* = \{a\}^*, and {a}\{a\}^* is the set of all strings of zero or more aa's, so L(a)={ε,a,aa,aaa,}L(a^*) = \{\varepsilon, a, aa, aaa, \ldots\}. Similarly, L(b)={ε,b,bb,bbb,}L(b^*) = \{\varepsilon, b, bb, bbb, \ldots\}. Since L(ab)=L(a)L(b)={xy  xL(a)yL(b)}L(a^*b^*) = L(a^*)L(b^*) = \{xy \ | \ x\in L(a^*)\land y\in L(b^*)\}, we have L(ab)={ε,a,b,aa,ab,bb,aaa,aab,abb,bbb,}L(a^*b^*) = \{\varepsilon, a, b, aa, ab, bb, aaa, aab, abb, bbb, \ldots\}, which is the set of all strings of the form "zero or more aa's followed by zero or more bb's".



Example: Let Σ={a,b}\Sigma = \{a,b\}, and consider the regular expression r=(aaaaaa)(bb)r=(a\,|\, aa\,|\, aaa)(bb)^*. Since L(a)={a}L(a) = \{a\}, L(aa)=L(a)L(a)={aa}L(aa) = L(a)L(a) = \{aa\}. Similarly, L(aaa)={aaa}L(aaa) = \{aaa\} and L(bb)={bb}L(bb) = \{bb\}. Now L(aaaaaa)=L(a)L(aa)L(aaa)={a,aa,aaa}L(a\,|\, aa\,|\, aaa) = L(a) \cup L(aa) \cup L(aaa) = \{a, aa, aaa\}, and L((bb))=(L((bb)))=(L(bb))L((bb)^*) = (L((bb)))^* = (L(bb))^* (the last equality is from clause 7 of the efinition), and (L(bb))={bb}={ε,bb,bbbb,}(L(bb))^* = \{bb\}^* = \{\varepsilon, bb, bbbb, \ldots\}. So L(r)L(r) is the set of strings formed by concatenating aa or aaaa or aaaaaa with zero or more pairs of bb's.


A language is regular if it is generated by a regular expression.

Regular languages, then, are languages whose strings' structure can be described in a very formal, mathematical way. The fact that a language can be "mechanically" described or generated means that we are likely to be able to get a computer to recognize strings in that language. We will pursue the question of mechanical language recognition in a later section, and subsequently will see that our first attempt to model mechanical language recognition does in fact produce a family of "machines" that recognize exactly the regular languages. But first, in the next section, we will look at some practical applications of regular expressions.

Exercises

  1. Give English-language descriptions of the languages generated by the following regular expressions:

    • (ab)(a\,|\, b)^*
      Answer

      Strings of aa's and bb's.

    • aba^*\,|\, b^*
      Answer

      Strings of all aa's or all bb's (no mixing).

    • b(abab)b^*(ab^*ab^*)^*
      Answer

      Strings with an even number of aa's (and any number of bb's).

    • b(abb)b^*(abb^*)
      Answer

      Strings of aa's and bb's where every aa is followed by a bb.

  2. Give regular expressions over Σ={a,b}\Sigma=\{a,b\} that generate the following languages:

    • L1={x  x contains 3 consecutive a’s}L_1 = \{ x \ | \ x \textrm{ contains 3 consecutive a's}\}
      Answer

      (ab)aaa(ab)(a\,|\,b)^*aaa(a\,|\,b)^*

    • L2={x  x has even length}L_2 = \{ x \ | \ x \textrm{ has even length}\}
      Answer

      (aaabbabb)(aa\,|\,ab\,|\,ba\,|\,bb)^*

    • L3={x  nb(x)=2mod3}L_3 = \{ x \ | \ n_b(x) = 2 \bmod{3}\}
      Answer

      ababa(bababa)a^*ba^*ba^*(ba^*ba^*ba^*)^*

    • L4={x  x contains the substring aaba}L_4 = \{ x \ | \ x \textrm{ contains the substring } aaba\}
      Answer

      (ab)aaba(ab)(a\,|\,b)^*aaba(a\,|\,b)^*

    • L5={x  nb(x)<2}L_5 = \{ x \ | \ n_b(x) < 2 \}
      Answer

      aabaa^*\,|\,a^*ba^*

    • L6={x  x doesn’t end in aa}L_6 = \{ x \ | \ x \textrm{ doesn't end in } aa\}
      Answer

      εab(ab)(abbabb)\varepsilon\,|\,a\,|\,b\,|\,(a\,|\,b)^*(ab\,|\,ba\,|\,bb)

  3. Prove that all finite languages are regular.

    Answer

    If the language is {w1,w2,,wn}\{w_1, w_2, \ldots, w_n\}, then it can be given by the regular expression w1w2wnw_1\,|\,w_2\,|\,\ldots\,|\,w_n.


  1. Various symbols have been used to represent the "or" operation in regular expressions. Both ++ and \cup have been used for this purpose. In this book, we use the symbol | because it is commonly used in computer implementations of regular expressions.