Skip to main content

Finite-State Automata and Regular Languages

(Content adapted from Critchlow & Eck)

We know now that our two models for mechanical language recognition actually recognize the same class of languages. The question still remains: do they recognize the same class of languages as the class generated mechanically by regular expressions? The answer turns out to be "yes". There are two parts to proving this: first that every language generated can be recognized, and second that every language recognized can be generated.

Regular Expression to NFA

Theorem: Every language generated by a regular expression can be recognized by an NFA.

Proof: The proof of this theorem is a nice example of a proof by induction on the structure of regular expressions. The definition of regular expression is inductive: Φ\Phi, ε\varepsilon, and aa are the simplest regular expressions, and then more complicated regular expressions can be built from these. We will show that there are NFAs that accept the languages generated by the simplest regular expressions, and then show how those machines can be put together to form machines that accept languages generated by more complicated regular expressions.

Consider the regular expression Φ\Phi. L(Φ)={}L(\Phi) = \{\}. Here is a machine that accepts {}\{\}:

NFA for empty language

Consider the regular expression ε\varepsilon. L(ε)={ε}L(\varepsilon) = \{\varepsilon\}. Here is a machine that accepts {ε}\{\varepsilon\}:

NFA for empty string

Consider the regular expression aa. L(a)={a}L(a) = \{a\}. Here is a machine that accepts {a}\{a\}:

NFA for a

Now suppose that you have NFAs that accept the languages generated by the regular expressions r1r_1 and r2r_2. Building a machine that accepts L(r1r2)L(r_1 | r_2) is fairly straightforward: take an NFA M1M_1 that accepts L(r1)L(r_1) and an NFA M2M_2 that accepts L(r2)L(r_2). Introduce a new state qnewq_{new}, connect it to the start states of M1M_1 and M2M_2 via ε\varepsilon-transitions, and designate it as the start state of the new machine. No other transitions are added. The final states of M1M_1 together with the final states of M2M_2 are designated as the final states of the new machine. It should be fairly clear that this new machine accepts exactly those strings accepted by M1M_1 together with those strings accepted by M2M_2: any string ww that was accepted by M1M_1 will be accepted by the new NFA by starting with an ε\varepsilon-transition to the old start state of M1M_1 and then following the accepting path through M1M_1; similarly, any string accepted by M2M_2 will be accepted by the new machine; these are the only strings that will be accepted by the new machine, as on any input ww all the new machine can do is make an ε\varepsilon-move to M1M_1's (or M2M_2's) start state, and from there ww will only be accepted by the new machine if it is accepted by M1M_1 (or M2M_2). Thus, the new machine accepts L(M1)L(M2)L(M_1) \cup L(M_2), which is L(r1)L(r2)L(r_1) \cup L(r_2), which is exactly the definition of L(r1r2)L(r_1 | r_2).

NFA for r1 | r2

(A pause before we continue: note that for the simplest regular expressions, the machines that we created to accept the languages generated by the regular expressions were in fact DFAs. In our last case above, however, we needed ε\varepsilon-transitions to build the new machine, and so if we were trying to prove that every regular language could be accepted by a DFA, our proof would be in trouble. THIS DOES NOT MEAN that the statement "every regular language can be accepted by a DFA" is false, just that we can't prove it using this kind of argument, and would have to find an alternative proof.)

Suppose you have machines M1M_1 and M2M_2 that accept L(r1)L(r_1) and L(r2)L(r_2) respectively. To build a machine that accepts L(r1)L(r2)L(r_1)L(r_2) proceed as follows. Make the start state q01q_{01} of M1M_1 be the start state of the new machine. Make the final states of M2M_2 be the final states of the new machine. Add ε\varepsilon-transitions from the final states of M1M_1 to the start state q02q_{02} of M2M_2.

NFA for r1 r2

It should be fairly clear that this new machine accepts exactly those strings of the form xyxy where xL(r1)x\in L(r_1) and yL(r2)y \in L(r_2): first of all, any string of this form will be accepted because xL(r1)x\in L(r_1) implies there is a path that consumes xx from q01q_{01} to a final state of M1M_1; a ε\varepsilon-transition moves to q02q_{02}; then yL(r2)y \in L(r_2) implies there is a path that consumes yy from q02q_{02} to a final state of M2M_2; and the final states of M2M_2 are the final states of the new machine, so xyxy will be accepted. Conversely, suppose zz is accepted by the new machine. Since the only final states of the new machine are in the old M2M_2, and the only way to get into M2M_2 is to take a ε\varepsilon-transition from a final state of M1M_1, this means that z=xyz=xy where xx takes the machine from its start state to a final state of M1M_1, a ε\varepsilon-transition occurs, and then yy takes the machine from q02q_{02} to a final state of M2M_2. Clearly, xL(r1)x\in L(r_1) and yL(r2)y \in L(r_2).

We leave the construction of an NFA that accepts L(r)L(r^*) from an NFA that accepts L(r)L(r) as an exercise.

The algorithm in this proof is commonly known as Thompson's Construction, credited to Ken Thompson who, along with Dennis Ritchie, also designed and implemented Unix and the C programming language.1 Several of the utilities that Thompson developed for Unix, such as ed and grep, make use of regular expressions for searching and replacing text.

NFA to Regular Expression

Theorem: Every language that is accepted by a DFA or an NFA is generated by a regular expression.

Proving this result is actually fairly involved and not very illuminating. Before presenting a proof, we will give an illustrative example of how one might actually go about extracting a regular expression from an NFA or a DFA. You can go on to read the proof if you are interested.


Example: Consider the DFA shown below:

Example DFA

Note that there is a loop from state q2q_2 back to state q2q_2: any number of aa's will keep the machine in state q2q_2, and so we label the transition with the regular expression aa^*. We do the same thing to the transition labeled bb from q0q_0. (Note that the result is no longer a DFA, but that doesn't concern us, we're just interested in developing a regular expression.)

Example DFA being converted to Regular Expression

Next we note that there is in fact a loop from q1q_1 to q1q_1 via q0q_0. A regular expression that matches the strings that would move around the loop is abaab^*a. So we add a transition labeled abaab^*a from q1q_1 to q1q_1, and remove the now-irrelevant aa-transition from q1q_1 to q0q_0. (It is irrelevant because it is not part of any other loop from q1q_1 to q1q_1.)

Example DFA being converted to Regular Expression

Next we note that there is also a loop from q1q_1 to q1q_1 via q2q_2. A regular expression that matches the strings that would move around the loop is babba^*b. Since the transitions in the loop are the only transitions to or from q2q_2, we simply remove q2q_2 and replace it with a transition from q1q_1 to q1q_1.

Example DFA being converted to Regular Expression

It is now clear from the diagram that strings of the form bab^*a get you to state q1q_1, and any number of repetitions of strings that match abaab^*a or babba^*b will keep you there. So the machine accepts L(ba(ababab))L(b^*a(ab^*a | ba^*b)^*).


Proof: We prove that the language accepted by a DFA is regular. The proof for NFAs follows from the equivalence between DFAs and NFAs.

Suppose that MM is a DFA, where M=(Q,Σ,q0,δ,F)M=(Q,\Sigma,q_0,\delta,F). Let nn be the number of states in MM, and write Q={q0,q1,,qn1}Q=\{q_0,q_1,\dots,q_{n-1}\}. We want to consider computations in which MM starts in some state qiq_i, reads a string ww, and ends in state qkq_k. In such a computation, MM might go through a series of intermediates states between qiq_i and qkq_k: qip1p2prqkq_i\longrightarrow p_1\longrightarrow p_2 \cdots\longrightarrow p_r\longrightarrow q_k We are interested in computations in which all of the intermediate statesp1,p2,,prp_1,p_2,\dots,p_rare in the set {q0,q1,,qj1}\{q_0,q_1,\dots,q_{j-1}\}, for some number jj. We define Ri,j,kR_{i,j,k} to be the set of all strings ww in Σ\Sigma^* that are consumed by such a computation. That is, wRi,j,kw\in R_{i,j,k} if and only if when MM starts in state qiq_i and reads ww, it ends in state qkq_k, and all the intermediate states between qiq_i and qkq_k are in the set {q0,q1,,qj1}\{q_0,q_1,\dots,q_{j-1}\}. Ri,j,kR_{i,j,k} is a language over Σ\Sigma. We show that Ri,j,kR_{i,j,k} for 0i<n0\le i < n, 0jn0\le j \le n, 0k<n0\le k < n.

Consider the language Ri,0,kR_{i,0,k}. For wRi,0,kw\in R_{i,0,k}, the set of allowable intermediate states is empty. Since there can be no intermediate states, it follows that there can be at most one step in the computation that starts in state qiq_i, reads ww, and ends in state qkq_k. So, w|w| can be at most one. This means that Ri,0,kR_{i,0,k} is finite, and hence is regular. (In fact, Ri,0,k={aΣδ(qi,a)=qk}R_{i,0,k}=\{a\in\Sigma\,|\, \delta(q_i,a)=q_k\}, for iki\ne k, and Ri,0,i={ε}{aΣδ(qi,a)=qi}R_{i,0,i}=\{\varepsilon\}\cup\{a\in\Sigma\,|\, \delta(q_i,a)=q_i\}. Note that in many cases, Ri,0,kR_{i,0,k} will be the empty set.)

We now proceed by induction on jj to show that Ri,j,kR_{i,j,k} is regular for all ii and kk. We have proved the base case, j=0j=0. Suppose that 0j<n0\le j< n we already know that Ri,j,kR_{i,j,k} is regular for all ii and all kk. We need to show that Ri,j+1,kR_{i,j+1,k} is regular for all ii and kk. In fact, Ri,j+1,k=Ri,j,k(Ri,j,jRj,j,jRj,j,k)R_{i,j+1,k}=R_{i,j,k}\cup \left( R_{i,j,j}R_{j,j,j}^*R_{j,j,k}\right) which is regular because Ri,j,kR_{i,j,k} is regular for all ii and kk, and because the union, concatenation, and Kleene star of regular languages are regular.

To see that the above equation holds, consider a string wΣw\in\Sigma^*. Now, wRi,j+1,kw\in R_{i,j+1,k} if and only if when MM starts in state qiq_i and reads ww, it ends in state qkq_k, with all intermediate states in the computation in the set {q0,q1,,qj}\{q_0,q_1,\dots,q_j\}. Consider such a computation. There are two cases: Either qjq_j occurs as an intermediate state in the computation, or it does not. If it does not occur, then all the intermediate states are in the set {q0,q1,,qj1}\{q_0,q_1,\dots,q_{j-1}\}, which means that in fact wRi,j,kw\in R_{i,j,k}. If qjq_j does occur as an intermediate state in the computation, then we can break the computation into phases, by dividing it at each point where qjq_j occurs as an intermediate state. This breaks ww into a concatenation w=xy1y2yrzw=xy_1y_2\cdots y_rz. The string xx is consumed in the first phase of the computation, during which MM goes from state qiq_i to the first occurrence of qjq_j; since the intermediate states in this computation are in the set {q0,q1,,qj1}\{q_0,q_1,\dots,q_{j-1}\}, xRi,j,jx\in R_{i,j,j}. The string zz is consumed by the last phase of the computation, in which MM goes from the final occurrence of qjq_j to qkq_k, so that zRj,j,kz\in R_{j,j,k}. And each string yty_t is consumed in a phase of the computation in which MM goes from one occurrence of qjq_j to the next occurrence of qjq_j, so that yrRj,j,jy_r\in R_{j,j,j}. This means that w=xy1y2yrzRi,j,jRj,j,jRj,j,kw=xy_1y_2\cdots y_rz\in R_{i,j,j}R_{j,j,j}^*R_{j,j,k}.

We now know, in particular, that R0,n,kR_{0,n,k} is a regular language for all kk. But R0,n,kR_{0,n,k} consists of all strings wΣw\in\Sigma^* such that when MM starts in state q0q_0 and reads ww, it ends in state qkq_k (with no restriction on the intermediate states in the computation, since every state of MM is in the set {q0,q1,,qn1}\{q_0,q_1,\dots,q_{n-1}\}). To finish the proof that L(M)L(M) is regular, it is only necessary to note that L(M)=qkFR0,n,kL(M)=\bigcup_{q_k\in F} R_{0,n,k} which is regular since it is a union of regular languages. This equation is true since a string ww is in L(M)L(M) if and only if when MM starts in state q0q_0 and reads ww, in ends in some accepting state qkFq_k\in F. This is the same as saying wR0,n,kw\in R_{0,n,k} for some kk with qkFq_k\in F.

Closure Properties for Regular Languages

We have already seen that if two languages L1L_1 and L2L_2 are regular, then so are L1L2L_1 \cup L_2, L1L2L_1L_2, and L1L_1^* (and of course L2L_2^*). We have not yet seen, however, how the common set operations intersection and complementation affect regularity. Is the complement of a regular language regular? How about the intersection of two regular languages?

Both of these questions can be answered by thinking of regular languages in terms of their acceptance by DFAs. Let's consider first the question of complementation. Suppose we have an arbitrary regular language LL. We know there is a DFA MM that accepts LL. Pause a moment and try to think of a modification that you could make to MM that would produce a new machine MM' that accepts L\overline{L}. Okay, the obvious thing to try is to make MM' be a copy of MM with all final states of MM becoming non-final states of MM' and vice versa. This is in fact exactly right: MM' does in fact accept L\overline{L}. To verify this, consider an arbitrary string ww. The transition functions for the two machines MM and MM' are identical, so δ(q0,w)\delta^* (q_0, w) is the same state in both MM and MM'; if that state is accepting in MM then it is non-accepting in MM', so if ww is accepted by MM it is not accepted by MM'; if the state is non-accepting in MM then it is accepting in MM', so if ww is not accepted by MM then it is accepted by MM'. Thus MM' accepts exactly those strings that MM does not, and hence accepts L\overline{L}.

It is worth pausing for a moment and looking at the above argument a bit longer. Would the argument have worked if we had looked at an arbitrary language LL and an arbitrary NFA MM that accepted LL? That is, if we had built a new machine MM' in which the final and non-final states had been switched, would the new NFA MM' accept the complement of the language accepted by MM? The answer is "not necessarily". Remember that acceptance in an NFA is determined based on whether or not at least one of the states reached by a string is accepting. So any string ww with the property that (q0,w)\partial^*(q_0, w) contains both accepting and non-accepting states of MM would be accepted both by MM and by MM'.

Now let's turn to the question of intersection. Given two regular languages L1L_1 and L2L_2, is L1L2L_1 \cap L_2 regular? Again, it is useful to think in terms of DFAs: given machines M1M_1 and M2M_2 that accept L1L_1 and L2L_2, can you use them to build a new machine that accepts L1L2L_1 \cap L_2? The answer is yes, and the idea behind the construction bears some resemblance to that behind the NFA-to-DFA construction. We want a new machine where transitions reflect the transitions of both M1M_1 and M2M_2 simultaneously, and we want to accept a string ww only if those sequences of transitions lead to final states in both M1M_1 and M2M_2. So we associate the states of our new machine MM with pairs of states from M1M_1 and M2M_2. For each state (q1,q2)(q_1,q_2) in the new machine and input symbol aa, define δ((q1,q2),a)\delta((q_1,q_2),a) to be the state (δ1(q1,a),δ2(q2,a))(\delta_1(q_1,a), \delta_2(q_2,a)). The start state q0q_0 of MM is (q01,q02)(q_{01}, q_{02}), where q0iq_{0i} is the start state of MiM_i. The final states of MM are the the states of the form (qf1,qf2)(q_{f1}, q_{f2}) where qf1q_{f1} is an accepting state of M1M_1 and qf2q_{f2} is an accepting state of M2M_2. You should convince yourself that MM accepts a string xx iff xx is accepted by both M1M_1 and M2M_2.

The results of the previous section and the preceding discussion are summarized by the following theorem:

Theorem: The intersection of two regular languages is a regular language.

The union of two regular languages is a regular language.

The concatenation of two regular languages is a regular language.

The complement of a regular language is a regular language.

The Kleene closure of a regular language is a regular language.

Exercises

  1. Give a DFA that accepts the intersection of the languages accepted by the machines shown below. (Suggestion: use the construction discussed just before the previous theorem.)

Example DFAs

  1. Complete Thompson's Construction by showing how to modify a machine that accepts L(r)L(r) into a machine that accepts L(r)L(r^*).

  2. Using Thompson's Construction, build an NFA that accepts L((aba)(bb))L((ab | a)^*(bb)).

  3. Prove that the reverse of a regular language is regular.

    Answer

    Given a DFA for the language, construct an NFA by reversing all of the edges. The original start state should become the only final state, and there should be a new start state with ε\varepsilon-transitions to each of the original final states. If there is an accepting path from this new start state to the final state, then it must correspond to the reverse of an accepting path in the original machine. Therefore, the new NFA will accept exactly the reverse of the given regular langauge.

  4. Show that for any DFA or NFA, there is an NFA with exactly one final state that accepts the same language.

    Answer

    Define the NFA by adding one new state. This state should be the only final state, and all of the original final states should instead have ε\varepsilon-transitions to the new state.

  5. Suppose we change the model of NFAs to allow NFAs to have multiple start states. Show that for any "NFA" with multiple start states, there is an NFA with exactly one start state that accepts the same language.

    Answer

    Given such an extended NFA, define an ordinary NFA by adding a new state. This state should be the only start state, and it should have ε\varepsilon-transitions to each of the original start states.

  6. Show that the closure of regular languages under both union and complement is enough to show that they are also closed under intersection.

    Answer

    This is just an application of DeMorgan's laws: L1L2=L1L2L_1\cap L_2 = \overline{\overline{L_1}\cup\overline{L_2}}.


  1. Thompson is actually credited with creating B, while Ritchie created its successor C.