Context-Free Grammars
(Content adapted from Critchlow & Eck)
Both natural languages, such as English, and the artificial languages used for programming have a structure known as grammar or syntax. In order to form legal sentences or programs, the parts of the language must be fit together according to certain rules. For natural languages, the rules are somewhat informal (although high-school English teachers might have us believe differently). For programming languages, the rules are absolute, and programs that violate the rules will be rejected by a compiler.
In this chapter, we will study formal grammars and languages defined by them. The languages we look at will, for the most part, be "toy" languages, compared to natural languages or even to programming languages, but the ideas and techniques are basic to any study of language. In fact, many of the ideas arose almost simultaneously in the 1950s in the work of linguists who were studying natural language and programmers who were looking for ways to specify the syntax of programming languages.
The grammars in this chapter are generative grammars. A generative grammar is a set of rules that can be used to generate all the legal strings in a language. We will also consider the closely related idea of parsing. To parse a string means to determine how that string can be generated according to the rules.
This chapter is a continuation of the preceding chapter. Like a regular expression, a grammar is a way to specify a possibly infinite language with a finite amount of information. In fact, we will see that every regular language can be specified by a certain simple type of grammar. We will also see that some languages that can be specified by grammars are not regular.
In its most general form, a grammar is a set of rewriting rules. A rewriting rule specifies that a certain string of symbols can be substituted for all or part of another string. If and are strings, then is a rewriting rule that specifies that the string can be replaced by the string . The symbol "" is read "can be rewritten as." Rewriting rules are also called production rules or productions, and "" can also be read as "produces." For example, if we consider strings over the alphabet , then the production rule can be applied to the string to give the string . The substring in the string has been replaced with .
In a context-free grammar, every rewriting rule has the form , where is single symbol and is a string of zero or more symbols. (The grammar is "context-free" in the sense that can be substituted for wherever occurs in a string, regardless of the surrounding context in which occurs.) The symbols that occur on the left-hand sides of production rules in a context-free grammar are called non-terminal symbols. By convention, the non-terminal symbols are usually uppercase letters. The strings on the right-hand sides of the production rules can include non-terminal symbols as well as other symbols, which are called terminal symbols. By convention, the terminal symbols are usually lowercase letters. Here are some typical production rules that might occur in context-free grammars:
In the last rule in this list, represents the empty string, as usual. For example, this rule could be applied to the string to produce the string . The first occurrence of the symbol in has been replaced by the empty string—which is just another way of saying that the symbol has been dropped from the string.
In every context-free grammar, one of the non-terminal symbols is designated as the start symbol of the grammar. The start symbol is often, though not always, denoted by . When the grammar is used to generate strings in a language, the idea is to start with a string consisting of nothing but the start symbol. Then a sequence of production rules is applied. Each application of a production rule to the string transforms the string to a new string. If and when this process produces a string that consists purely of terminal symbols, the process ends. That string of terminal symbols is one of the strings in the language generated by the grammar. In fact, the language consists precisely of all strings of terminal symbols that can be produced in this way.
As a simple example, consider a grammar that has three production rules: , , and . In this example, is the only non-terminal symbol, and the terminal symbols are and . Starting from the string , we can apply any of the three rules of the grammar to produce either , , or . Since the string contains no non-terminals, we see that is one of the strings in the language generated by this grammar. The strings and are not in that language, since they contain the non-terminal symbol , but we can continue to apply production rules to these strings. From , for example, we can obtain , , or . From , we go on to obtain , , or . The strings and are in the language generated by the grammar. It's not hard to see that any string of 's and 's that ends with a can be generated by this grammar, and that these are the only strings that can be generated. That is, the language generated by this grammar is the regular language specified by the regular expression .
It's time to give some formal definitions of the concepts which we have been discussing.
A context-free grammar is a 4-tuple , where:
is a finite set of symbols. The elements of are the non-terminal symbols of the grammar.
is a finite set of symbols such that . The elements of are the terminal symbols of the grammar.
is a set of production rules. Each rule is of the form where is one of the symbols in and is a string in the language .
. is the start symbol of the grammar.
Even though this is the formal definition, grammars are often specified informally simply by listing the set of production rules. When this is done it is assumed, unless otherwise specified, that the non-terminal symbols are just the symbols that occur on the left-hand sides of production rules of the grammar. The terminal symbols are all the other symbols that occur on the right-hand sides of production rules. The start symbol is the symbol that occurs on the left-hand side of the first production rule in the list. Thus, the list of production rules
specifies a grammar where is , is , and is the start symbol. , of course, is a set containing the six production rules in the list.
Let be a context-free grammar. Suppose that and are strings in the language . The notation is used to express the fact that can be obtained from by applying one of the production rules in . To be more exact, we say that if and only if there is a production rule in the grammar and two strings and in the language such that and . The fact that is just a way of saying that occurs somewhere in . When the production rule is applied to substitute for in , the result is , which is . Note that either or or both can be the empty string.
If a string can be obtained from a string by applying a sequence of zero or more production rules, we write . In most cases, the "" in the notations and will be omitted, assuming that the grammar in question is understood. Note that is a relation on the set . The relation is the reflexive, transitive closure of that relation. (This explains the use of "", which is usually used to denote the transitive, but not necessarily reflexive, closure of a relation. In this case, is reflexive as well as transitive since is true for any string .) For example, using the grammar that is defined by the above list of production rules, we have
From this, it follows that . The relation is read "yields" or "produces" while can be read "yields in zero or more steps" or "produces in zero or more steps." The following theorem states some simple facts about the relations and :
Theorem: Let be the context-free grammar . Then:
- If and are strings in such that , then .
- If , , and are strings in such that and , then .
- If and are strings in such that , and if and are any strings in , then .
- If and are strings in such that , and if and are any strings in , then .
Proof: Parts 1 and 2 follow from the fact that is the transitive closure of . Part 4 follows easily from Part 3. (I leave this as an exercise.) To prove Part 3, suppose that , , , and are strings such that . By definition, this means that there exist strings and and a production rule such that and . But then we also have and . These two equations, along with the existence of the production rule show, by definition, that .
We can use to give a formal definition of the language generated by a context-free grammar:
Suppose that is a context-free grammar. Then the language generated by is the language over the alphabet defined by
That is, contains any string of terminal symbols that can be obtained by starting with the string consisting of the start symbol, , and applying a sequence of production rules.
A language is said to be a context-free language if there is a context-free grammar such that is . Note that there might be many different context-free grammars that generate the same context-free language. Two context-free grammars that generate the same language are said to be equivalent.
Suppose is a context-free grammar with start symbol and suppose . By definition, this means that there is a sequence of one or more applications of production rules which produces the string from . This sequence has the form . Such a sequence is called a derivation of (in the grammar ). Note that might have more than one derivation. That is, it might be possible to produce in several different ways.
Consider the language . We already know that is not a regular language.1 However, it is a context-free language. That is, there is a context-free grammar such that is the language generated by . This gives us our first theorem about grammars:
Theorem: Let be the language . Let be the context-free grammar where , and consists of the productions
Then , so that is a context-free language. In particular, there exist context-free languages which are not regular.
Proof: To show that , we must show both that and that . To show that , let be an arbitrary element of . By definition of , for some . We show that by induction on . In the case where , we have . Now, since can be produced from the start symbol by an application of the rule , so our claim is true for . Now, suppose that and that we already know that . We must show that . Since , we also have, by the previous theorem, that . That is, . Combining this with the production rule , we see that . This means that , as we wanted to show. This completes the proof that .
To show that , suppose that . That is, . We must show that for some . Since , there is a derivation , where . We first prove by induction on that in any derivation , we must have either or . Consider the case . Suppose . Then, we must have that is a rule in the grammar, so must be either or . Since and , is of the required form. Next, consider the inductive case. Suppose that and we already know that in any derivation , we must have or . Suppose that . We know by induction that or , but since and contains no non-terminal symbols, we must have . Since is obtained by applying one of the production rules or to , is either or . That is, is either or , as we wanted to show. This completes the induction. Turning back to , we see that must be of the form or of the form . But since , it can contain no non-terminal symbols, so must be of the form , as we wanted to show. This completes the proof that .
I have given a very formal and detailed proof of this theorem, to show how it can be done and to show how induction plays a role in many proofs about grammars. However, a more informal proof of the theorem would probably be acceptable and might even be more convincing. To show that , we could just note that the derivation demonstrates that . On the other hand, it is clear that every derivation for this grammar must be of this form, so every string in is of the form .
For another example, consider the language . Let's try to design a grammar that generates this language. This is similar to the previous example, but now we want to include strings that contain more 's than 's. The production rule always produces the same number of 's and 's. Can we modify this idea to produce more 's than 's?
One approach would be to produce a string containing just as many 's as 's, and then to add some extra 's. A rule that can generate any number of 's is . After applying the rule for a while, we want to move to a new state in which we apply the rule . We can get to the new state by applying a rule that changes the into an . We still need a way to finish the process, which means getting rid of all non-terminal symbols in the string. For this, we can use the rule . Putting these rules together, we get the grammar
This grammar does indeed generate the language . With slight variations on this grammar, we can produce other related languages. For example, if we replace the rule with , we get the language .
There are other ways to generate the language . For example, the extra non-terminal symbol, , is not really necessary, if we allow to sometimes produce a single without a . This leads to the grammar
(But note that the rule would not work in place of , since it would allow the production of strings in which an can follow a , and there are no such strings in the language .) And here are two more grammars that generate this language:
and
Consider another variation on the language , in which the 's and 's can occur in any order, but the number of 's is still equal to the number of 's. This language can be defined as . This language includes strings such as , , and .
Let's start with the grammar containing the rules and . We can try adding the rule . Every string that can be generated using these three rules is in the language . However, not every string in can be generated. A derivation that starts with can only produce strings that begin with and end with . A derivation that starts with can only generate strings that begin with and end with . There is no way to generate the strings or , which are in the language . But we shall see that any string in that begins and ends with the same letter can be written in the form where and are shorter strings in . To produce strings of this form, we need one more rule, . The complete set of production rules for the language is
It's easy to see that every string that can be generated using these rules is in , since each rule introduces the same number of 's as 's. But we also need to check that every string in can be generated by these rules. This can be done by induction on the length of , using the second form of the principle of mathematical induction. In the base case, and . In this case, since in one step. Suppose , where , and suppose that we already know that for any with , . To finish the induction we must show, based on this induction hypothesis, that .
Suppose that the first and last characters of are different. Then is either of the form or of the form , for some string . Let's assume that is of the form . (The case where is of the form is handled in a similar way.) Since has the same number of 's and 's and since has one fewer than and one fewer than , must also have the same number of 's as 's. That is, . But , so by the induction hypothesis, . So we have . By the theorem above, we get then . Combining this with the fact that , we get that , that is, . This proves that .
Finally, suppose that the first and last characters of are the same. Let's say that begins and ends with . (The case where begins and ends with is handled in a similar way.) I claim that can be written in the form where and and neither nor is the empty string. This will finish the induction, since we will then have by the induction hypothesis that and , and we can derive from by first applying the rule and then using the first on the right-hand side to derive and the second to derive .
It only remains to figure out how to divide into two strings and which are both in . The technique that is used is one that is more generally useful. Suppose that , where each is either or . Consider the sequence of integers , , \dots, where for each , is the number of 's in minus the number of 's in . Since , . Since , . And since , we must have . Furthermore the difference between and is either or , for .
Since and and the value of goes up or down by 1 when increases by 1, must be zero for some between 1 and . That is, cannot get from 1 to unless it passes through zero. Let be a number between 1 and such that . Let and let . Note that . The fact that means that the string has the same number of 's and 's, so . It follows automatically that also. Since is strictly between 1 and , neither nor is the empty string. This is all that we needed to show to finish the proof that .
The basic idea of this proof is that if contains the same number of 's as 's, then an at the beginning of must have a "matching" somewhere in . This matches the in the sense that the corresponding is zero, and the marks the end of a string which contains the same number of 's as 's. For example, in the string , the at the beginning of the string is matched by the third , since is the shortest prefix of that has an equal number of 's and 's.
Closely related to this idea of matching 's and 's is
the idea of balanced parentheses. Consider a string
made up of parentheses, such as (()(()))(())
.
The parentheses in this sample string are balanced because
each left parenthesis has a matching right parenthesis,
and the matching pairs are properly nested. A careful definition
uses the sort of integer sequence introduced in the above
proof. Let be a string of parentheses. Write
, where each is either (
or )
. Define a sequence of integers , , \dots, ,
where is the number of left parentheses in
minus the number of right parentheses. We say that the parentheses
in are balanced if and for all .
The fact that says that contains the same number of
left parentheses as right parentheses. The fact the
means that the nesting of pairs of parentheses is correct:
You can't have a right parenthesis unless it is balanced by a left
parenthesis in the preceding part of the string. The language
that consists of all balanced strings of parentheses is context-free.
It is generated by the grammar
The proof is similar to the preceding proof about strings of 's and 's. (It might seem that I've made an awfully big fuss about matching and balancing. The reason is that this is one of the few things that we can do with context-free languages that we can't do with regular languages.)
Before leaving this section, we should look at a few more general results. Since we know that most operations on regular languages produce languages that are also regular, we can ask whether a similar result holds for context-free languages. We will see later that the intersection of two context-free languages is not necessarily context-free. Also, the complement of a context-free language is not necessarily context-free. However, some other operations on context-free languages do produce context-free languages.
Theorem: Suppose that and are context-free languages. Then the languages , , and are also context-free.
Proof: I will prove only the first claim of the theorem, that is context-free. In the exercises for this section, you are asked to construct grammars for and (without giving formal proofs that your answers are correct).
Let and be context-free grammars such that and . We can assume that , since otherwise we could simply rename the non-terminal symbols in . The idea of the proof is that to generate a string in , we first decide whether we want a string in or a string in . Once that decision is made, to make a string in , we use production rules from , while to make a string in , we use rules from . We have to design a grammar, , to represent this process.
Let be a symbol that is not in any of the alphabets , , , or . will be the start symbol of . The production rules for consist of all the production rules from and together with two new rules:
Formally, is defined to be the grammar
Suppose that . That is , so there is a derivation . Since every rule from is also a rule in , if follows that . Combining this with the fact that , we have that , and . This shows that . In an exactly similar way, we can show that . Thus, .
It remains to show that . Suppose . Then there is a derivation . This derivation must begin with an application of one of the rules or , since these are the only rules in which appears. If the first rule applied in the derivation is , then the remainder of the derivation shows that . Starting from , the only rules that can be applied are rules from , so in fact we have . This shows that . Similarly, if the first rule applied in the derivation is , then . In any case, . This proves that .
Finally, we should clarify the relationship between context-free languages and regular languages. We have already seen that there are context-free languages which are not regular. On the other hand, it turns out that every regular language is context-free. That is, given any regular language, there is a context-free grammar that generates that language. This means that any syntax that can be expressed by a regular expression, by a DFA, or by an NFA could also be expressed by a context-free grammar. In fact, we only need a certain restricted type of context-free grammar to duplicate the power of regular expressions.
A right-regular grammar is a context-free grammar in which the right-hand side of every production rule has one of the following forms: the empty string; a string consisting of a single non-terminal symbol; or a string consisting of a single terminal symbol followed by a single non-terminal symbol.
Examples of the types of production rule that are allowed in a right-regular grammar are , , and . The idea of the proof is that given a right-regular grammar, we can build a corresponding and vice-versa. The states of the correspond to the non-terminal symbols of the grammar. The start symbol of the grammar corresponds to the starting state of the NFA. A production rule of the form corresponds to a transition in the NFA from state to state while reading the symbol . A production rule of the form corresponds to an -transition from state to state in the NFA. And a production rule of the form exists in the grammar if and only if is a final state in the NFA. With this correspondence, a derivation of a string in the grammar corresponds to an execution path through the NFA as it accepts the string . I won't give a complete proof here. You are welcome to work through the details if you want. But the important fact is:
Theorem: A language is regular if and only if there is a right-regular grammar such that . In particular, every regular language is context-free.
Exercises
Show that Part 4 of the theorem about follows from Part 3.
Answer
If , then there is a sequence of strings , for some , such that , , and for each . By Part 3 of the theorem, we also know that for each , so we may conclude .
Give a careful proof that the language is generated by the context-free grammar
Identify the language generated by each of the following context-free grammars.
Answer
Answer
Answer
Answer
For each of the following languages find a context-free grammar that generates the language:
Answer
Answer
Answer
Answer
Answer
Answer
Answer
Answer
We must either have or . In the first case, we match the 's with 's and 's by expanding and , and then generate at least one unmatched by expanding to . In the second case, if then when we stop generating 's we will force at least one unmatched by expanding to , and then can produce additional 's preceded by an arbitrary number of 's. Alternately, if , we will match all of the 's with 's, and some 's with 's, by expanding S and T, and then switch to generating at least one unmatched by expanding to .
Find a context-free grammar that generates the language .
Find a context-free grammar that generates the language .
A palindrome is a string that reads the same backwards and forwards, such as "mom", "radar", or "aabccbccbaa". That is, is a palindrome if . Let . Show that is a context-free language by finding a context-free grammar that generates .
Answer
Let . That is, is the alphabet consisting of the four symbols , , , and . Let be the language over consisting of strings in which both parentheses and brackets are balanced. For example, the string is in but is not. Find a context-free grammar that generates the language .
Answer
Suppose that and are context-free grammars. Let and let . Explain how to construct a context-free grammar for the language . You do not need to give a formal proof that your grammar is correct.
Answer
If the start symbols for and are and , respectively, then the desired grammar will contain all of the rules for and combined (with non-terminals renamed to avoid duplicates), plus a new start symbol whose only rule is .
Suppose that is a context-free grammar. Let . Explain how to construct a context-free grammar for the language . You do not need to give a formal proof that your grammar is correct.
Answer
If the start symbol for is , then the desired grammar will have all of the rules of plus a new start symbol (call it ) with the rules and .
Suppose that is a context-free language. Prove that is a context-free language. (Hint: Given a context-free grammar for , make a new grammar, , by reversing the right-hand side of each of the production rules in . That is, is a production rule in if and only if is a production rule in .)
Define a left-regular grammar to be a context-free grammar in which the right-hand side of every production rule is of one of the following forms: the empty string; a single non-terminal symbol; or a non-terminal symbol followed by a terminal symbol. Show that a language is regular if and only if it can be generated by a left-regular grammar. (Hint: Use the preceding exercise and the theorem about right-regular grammars.)
- TODO↩