Finite-State Automata and Regular Languages
(Content adapted from Critchlow & Eck)
We know now that our two models for mechanical language recognition actually recognize the same class of languages. The question still remains: do they recognize the same class of languages as the class generated mechanically by regular expressions? The answer turns out to be "yes". There are two parts to proving this: first that every language generated can be recognized, and second that every language recognized can be generated.
Regular Expression to NFA
Theorem: Every language generated by a regular expression can be recognized by an NFA.
Proof: The proof of this theorem is a nice example of a proof by induction on the structure of regular expressions. The definition of regular expression is inductive: , , and are the simplest regular expressions, and then more complicated regular expressions can be built from these. We will show that there are NFAs that accept the languages generated by the simplest regular expressions, and then show how those machines can be put together to form machines that accept languages generated by more complicated regular expressions.
Consider the regular expression . . Here is a machine that accepts :
Consider the regular expression . . Here is a machine that accepts :
Consider the regular expression . . Here is a machine that accepts :
Now suppose that you have NFAs that accept the languages generated by the regular expressions and . Building a machine that accepts is fairly straightforward: take an NFA that accepts and an NFA that accepts . Introduce a new state , connect it to the start states of and via -transitions, and designate it as the start state of the new machine. No other transitions are added. The final states of together with the final states of are designated as the final states of the new machine. It should be fairly clear that this new machine accepts exactly those strings accepted by together with those strings accepted by : any string that was accepted by will be accepted by the new NFA by starting with an -transition to the old start state of and then following the accepting path through ; similarly, any string accepted by will be accepted by the new machine; these are the only strings that will be accepted by the new machine, as on any input all the new machine can do is make an -move to 's (or 's) start state, and from there will only be accepted by the new machine if it is accepted by (or ). Thus, the new machine accepts , which is , which is exactly the definition of .
(A pause before we continue: note that for the simplest regular expressions, the machines that we created to accept the languages generated by the regular expressions were in fact DFAs. In our last case above, however, we needed -transitions to build the new machine, and so if we were trying to prove that every regular language could be accepted by a DFA, our proof would be in trouble. THIS DOES NOT MEAN that the statement "every regular language can be accepted by a DFA" is false, just that we can't prove it using this kind of argument, and would have to find an alternative proof.)
Suppose you have machines and that accept and respectively. To build a machine that accepts proceed as follows. Make the start state of be the start state of the new machine. Make the final states of be the final states of the new machine. Add -transitions from the final states of to the start state of .
It should be fairly clear that this new machine accepts exactly those strings of the form where and : first of all, any string of this form will be accepted because implies there is a path that consumes from to a final state of ; a -transition moves to ; then implies there is a path that consumes from to a final state of ; and the final states of are the final states of the new machine, so will be accepted. Conversely, suppose is accepted by the new machine. Since the only final states of the new machine are in the old , and the only way to get into is to take a -transition from a final state of , this means that where takes the machine from its start state to a final state of , a -transition occurs, and then takes the machine from to a final state of . Clearly, and .
We leave the construction of an NFA that accepts from an NFA that accepts as an exercise.
The algorithm in this proof is commonly known as Thompson's Construction, credited
to Ken Thompson who, along with Dennis Ritchie, also designed and implemented Unix and
the C programming language.1 Several of the utilities that Thompson developed for
Unix, such as ed
and grep
, make use of regular expressions for searching and replacing
text.
NFA to Regular Expression
Theorem: Every language that is accepted by a DFA or an NFA is generated by a regular expression.
Proving this result is actually fairly involved and not very illuminating. Before presenting a proof, we will give an illustrative example of how one might actually go about extracting a regular expression from an NFA or a DFA. You can go on to read the proof if you are interested.
Example: Consider the DFA shown below:
Note that there is a loop from state back to state : any number of 's will keep the machine in state , and so we label the transition with the regular expression . We do the same thing to the transition labeled from . (Note that the result is no longer a DFA, but that doesn't concern us, we're just interested in developing a regular expression.)
Next we note that there is in fact a loop from to via . A regular expression that matches the strings that would move around the loop is . So we add a transition labeled from to , and remove the now-irrelevant -transition from to . (It is irrelevant because it is not part of any other loop from to .)
Next we note that there is also a loop from to via . A regular expression that matches the strings that would move around the loop is . Since the transitions in the loop are the only transitions to or from , we simply remove and replace it with a transition from to .
It is now clear from the diagram that strings of the form get you to state , and any number of repetitions of strings that match or will keep you there. So the machine accepts .
Proof: We prove that the language accepted by a DFA is regular. The proof for NFAs follows from the equivalence between DFAs and NFAs.
Suppose that is a DFA, where . Let be the number of states in , and write . We want to consider computations in which starts in some state , reads a string , and ends in state . In such a computation, might go through a series of intermediates states between and : We are interested in computations in which all of the intermediate states——are in the set , for some number . We define to be the set of all strings in that are consumed by such a computation. That is, if and only if when starts in state and reads , it ends in state , and all the intermediate states between and are in the set . is a language over . We show that for , , .
Consider the language . For , the set of allowable intermediate states is empty. Since there can be no intermediate states, it follows that there can be at most one step in the computation that starts in state , reads , and ends in state . So, can be at most one. This means that is finite, and hence is regular. (In fact, , for , and . Note that in many cases, will be the empty set.)
We now proceed by induction on to show that is regular for all and . We have proved the base case, . Suppose that we already know that is regular for all and all . We need to show that is regular for all and . In fact, which is regular because is regular for all and , and because the union, concatenation, and Kleene star of regular languages are regular.
To see that the above equation holds, consider a string . Now, if and only if when starts in state and reads , it ends in state , with all intermediate states in the computation in the set . Consider such a computation. There are two cases: Either occurs as an intermediate state in the computation, or it does not. If it does not occur, then all the intermediate states are in the set , which means that in fact . If does occur as an intermediate state in the computation, then we can break the computation into phases, by dividing it at each point where occurs as an intermediate state. This breaks into a concatenation . The string is consumed in the first phase of the computation, during which goes from state to the first occurrence of ; since the intermediate states in this computation are in the set , . The string is consumed by the last phase of the computation, in which goes from the final occurrence of to , so that . And each string is consumed in a phase of the computation in which goes from one occurrence of to the next occurrence of , so that . This means that .
We now know, in particular, that is a regular language for all . But consists of all strings such that when starts in state and reads , it ends in state (with no restriction on the intermediate states in the computation, since every state of is in the set ). To finish the proof that is regular, it is only necessary to note that which is regular since it is a union of regular languages. This equation is true since a string is in if and only if when starts in state and reads , in ends in some accepting state . This is the same as saying for some with .
Closure Properties for Regular Languages
We have already seen that if two languages and are regular, then so are , , and (and of course ). We have not yet seen, however, how the common set operations intersection and complementation affect regularity. Is the complement of a regular language regular? How about the intersection of two regular languages?
Both of these questions can be answered by thinking of regular languages in terms of their acceptance by DFAs. Let's consider first the question of complementation. Suppose we have an arbitrary regular language . We know there is a DFA that accepts . Pause a moment and try to think of a modification that you could make to that would produce a new machine that accepts …. Okay, the obvious thing to try is to make be a copy of with all final states of becoming non-final states of and vice versa. This is in fact exactly right: does in fact accept . To verify this, consider an arbitrary string . The transition functions for the two machines and are identical, so is the same state in both and ; if that state is accepting in then it is non-accepting in , so if is accepted by it is not accepted by ; if the state is non-accepting in then it is accepting in , so if is not accepted by then it is accepted by . Thus accepts exactly those strings that does not, and hence accepts .
It is worth pausing for a moment and looking at the above argument a bit longer. Would the argument have worked if we had looked at an arbitrary language and an arbitrary NFA that accepted ? That is, if we had built a new machine in which the final and non-final states had been switched, would the new NFA accept the complement of the language accepted by ? The answer is "not necessarily". Remember that acceptance in an NFA is determined based on whether or not at least one of the states reached by a string is accepting. So any string with the property that contains both accepting and non-accepting states of would be accepted both by and by .
Now let's turn to the question of intersection. Given two regular languages and , is regular? Again, it is useful to think in terms of DFAs: given machines and that accept and , can you use them to build a new machine that accepts ? The answer is yes, and the idea behind the construction bears some resemblance to that behind the NFA-to-DFA construction. We want a new machine where transitions reflect the transitions of both and simultaneously, and we want to accept a string only if those sequences of transitions lead to final states in both and . So we associate the states of our new machine with pairs of states from and . For each state in the new machine and input symbol , define to be the state . The start state of is , where is the start state of . The final states of are the the states of the form where is an accepting state of and is an accepting state of . You should convince yourself that accepts a string iff is accepted by both and .
The results of the previous section and the preceding discussion are summarized by the following theorem:
Theorem: The intersection of two regular languages is a regular language.
The union of two regular languages is a regular language.
The concatenation of two regular languages is a regular language.
The complement of a regular language is a regular language.
The Kleene closure of a regular language is a regular language.
Exercises
- Give a DFA that accepts the intersection of the languages accepted by the machines shown below. (Suggestion: use the construction discussed just before the previous theorem.)
Complete Thompson's Construction by showing how to modify a machine that accepts into a machine that accepts .
Using Thompson's Construction, build an NFA that accepts .
Prove that the reverse of a regular language is regular.
Answer
Given a DFA for the language, construct an NFA by reversing all of the edges. The original start state should become the only final state, and there should be a new start state with -transitions to each of the original final states. If there is an accepting path from this new start state to the final state, then it must correspond to the reverse of an accepting path in the original machine. Therefore, the new NFA will accept exactly the reverse of the given regular langauge.
Show that for any DFA or NFA, there is an NFA with exactly one final state that accepts the same language.
Answer
Define the NFA by adding one new state. This state should be the only final state, and all of the original final states should instead have -transitions to the new state.
Suppose we change the model of NFAs to allow NFAs to have multiple start states. Show that for any "NFA" with multiple start states, there is an NFA with exactly one start state that accepts the same language.
Answer
Given such an extended NFA, define an ordinary NFA by adding a new state. This state should be the only start state, and it should have -transitions to each of the original start states.
Show that the closure of regular languages under both union and complement is enough to show that they are also closed under intersection.
Answer
This is just an application of DeMorgan's laws: .
- Thompson is actually credited with creating B, while Ritchie created its successor C.↩