Skip to main content

Context-Free Grammars

(Content adapted from Critchlow & Eck)

Both natural languages, such as English, and the artificial languages used for programming have a structure known as grammar or syntax. In order to form legal sentences or programs, the parts of the language must be fit together according to certain rules. For natural languages, the rules are somewhat informal (although high-school English teachers might have us believe differently). For programming languages, the rules are absolute, and programs that violate the rules will be rejected by a compiler.

In this chapter, we will study formal grammars and languages defined by them. The languages we look at will, for the most part, be "toy" languages, compared to natural languages or even to programming languages, but the ideas and techniques are basic to any study of language. In fact, many of the ideas arose almost simultaneously in the 1950s in the work of linguists who were studying natural language and programmers who were looking for ways to specify the syntax of programming languages.

The grammars in this chapter are generative grammars. A generative grammar is a set of rules that can be used to generate all the legal strings in a language. We will also consider the closely related idea of parsing. To parse a string means to determine how that string can be generated according to the rules.

This chapter is a continuation of the preceding chapter. Like a regular expression, a grammar is a way to specify a possibly infinite language with a finite amount of information. In fact, we will see that every regular language can be specified by a certain simple type of grammar. We will also see that some languages that can be specified by grammars are not regular.

In its most general form, a grammar is a set of rewriting rules. A rewriting rule specifies that a certain string of symbols can be substituted for all or part of another string. If ww and uu are strings, then wuw\longrightarrow u is a rewriting rule that specifies that the string ww can be replaced by the string uu. The symbol "\longrightarrow" is read "can be rewritten as." Rewriting rules are also called production rules or productions, and "\longrightarrow" can also be read as "produces." For example, if we consider strings over the alphabet {a,b,c}\{a,b,c\}, then the production rule abaccaba\longrightarrow cc can be applied to the string abbabacabbabac to give the string abbcccabbccc. The substring abaaba in the string abbabacabbabac has been replaced with cccc.

In a context-free grammar, every rewriting rule has the form AwA\longrightarrow w, where AA is single symbol and ww is a string of zero or more symbols. (The grammar is "context-free" in the sense that ww can be substituted for AA wherever AA occurs in a string, regardless of the surrounding context in which AA occurs.) The symbols that occur on the left-hand sides of production rules in a context-free grammar are called non-terminal symbols. By convention, the non-terminal symbols are usually uppercase letters. The strings on the right-hand sides of the production rules can include non-terminal symbols as well as other symbols, which are called terminal symbols. By convention, the terminal symbols are usually lowercase letters. Here are some typical production rules that might occur in context-free grammars:

AaAbBSSSCAccBbAε\begin{aligned} &A\longrightarrow aAbB\\ &S\longrightarrow SS\\ &C\longrightarrow Acc\\ &B\longrightarrow b\\ &A\longrightarrow\varepsilon \end{aligned}

In the last rule in this list, ε\varepsilon represents the empty string, as usual. For example, this rule could be applied to the string aBaAcAaBaAcA to produce the string aBacAaBacA. The first occurrence of the symbol AA in aBaAcAaBaAcA has been replaced by the empty stringwhich is just another way of saying that the symbol has been dropped from the string.

In every context-free grammar, one of the non-terminal symbols is designated as the start symbol of the grammar. The start symbol is often, though not always, denoted by SS. When the grammar is used to generate strings in a language, the idea is to start with a string consisting of nothing but the start symbol. Then a sequence of production rules is applied. Each application of a production rule to the string transforms the string to a new string. If and when this process produces a string that consists purely of terminal symbols, the process ends. That string of terminal symbols is one of the strings in the language generated by the grammar. In fact, the language consists precisely of all strings of terminal symbols that can be produced in this way.

As a simple example, consider a grammar that has three production rules: SaSS\longrightarrow aS, SbSS\longrightarrow bS, and SbS\longrightarrow b. In this example, SS is the only non-terminal symbol, and the terminal symbols are aa and bb. Starting from the string SS, we can apply any of the three rules of the grammar to produce either aSaS, bSbS, or bb. Since the string bb contains no non-terminals, we see that bb is one of the strings in the language generated by this grammar. The strings aSaS and bSbS are not in that language, since they contain the non-terminal symbol SS, but we can continue to apply production rules to these strings. From aSaS, for example, we can obtain aaSaaS, abSabS, or abab. From abSabS, we go on to obtain abaSabaS, abbSabbS, or abbabb. The strings abab and abbabb are in the language generated by the grammar. It's not hard to see that any string of aa's and bb's that ends with a bb can be generated by this grammar, and that these are the only strings that can be generated. That is, the language generated by this grammar is the regular language specified by the regular expression (ab)b(a | b)^{*}b.

It's time to give some formal definitions of the concepts which we have been discussing.

A context-free grammar is a 4-tuple (V,Σ,P,S)(V,\Sigma,P,S), where:

  1. VV is a finite set of symbols. The elements of VV are the non-terminal symbols of the grammar.

  2. Σ\Sigma is a finite set of symbols such that VΣ=V\cap\Sigma=\emptyset. The elements of Σ\Sigma are the terminal symbols of the grammar.

  3. PP is a set of production rules. Each rule is of the form AwA\longrightarrow w where AA is one of the symbols in VV and ww is a string in the language (VΣ)(V\cup\Sigma)^*.

  4. SVS\in V. SS is the start symbol of the grammar.

Even though this is the formal definition, grammars are often specified informally simply by listing the set of production rules. When this is done it is assumed, unless otherwise specified, that the non-terminal symbols are just the symbols that occur on the left-hand sides of production rules of the grammar. The terminal symbols are all the other symbols that occur on the right-hand sides of production rules. The start symbol is the symbol that occurs on the left-hand side of the first production rule in the list. Thus, the list of production rules

TTTTAAaAaAbBBbBBε\begin{aligned} &T\longrightarrow TT\\ &T\longrightarrow A\\ &A\longrightarrow aAa\\ &A\longrightarrow bB\\ &B\longrightarrow bB\\ &B\longrightarrow \varepsilon \end{aligned}

specifies a grammar G=(V,Σ,P,T)G=(V,\Sigma,P,T) where VV is {T,A,B}\{T,A,B\}, Σ\Sigma is {a,b}\{a,b\}, and TT is the start symbol. PP, of course, is a set containing the six production rules in the list.

Let G=(V,Σ,P,S)G=(V,\Sigma,P,S) be a context-free grammar. Suppose that xx and yy are strings in the language (VΣ)(V\cup\Sigma)^*. The notation xGyx\Longrightarrow_G y is used to express the fact that yy can be obtained from xx by applying one of the production rules in PP. To be more exact, we say that xGyx\Longrightarrow_G y if and only if there is a production rule AwA\longrightarrow w in the grammar and two strings uu and vv in the language (VΣ)(V\cup\Sigma)^* such that x=uAvx=uAv and y=uwvy=uwv. The fact that x=uAvx=uAv is just a way of saying that AA occurs somewhere in xx. When the production rule AwA\longrightarrow w is applied to substitute ww for AA in uAvuAv, the result is uwvuwv, which is yy. Note that either uu or vv or both can be the empty string.

If a string yy can be obtained from a string xx by applying a sequence of zero or more production rules, we write xGyx\Longrightarrow_G^* y. In most cases, the "GG" in the notations G\Longrightarrow_G and G\Longrightarrow_G^* will be omitted, assuming that the grammar in question is understood. Note that \Longrightarrow is a relation on the set (VΣ)(V\cup\Sigma)^*. The relation \Longrightarrow^* is the reflexive, transitive closure of that relation. (This explains the use of "*", which is usually used to denote the transitive, but not necessarily reflexive, closure of a relation. In this case, \Longrightarrow^* is reflexive as well as transitive since x  xx\;\Longrightarrow^* x is true for any string xx.) For example, using the grammar that is defined by the above list of production rules, we have

aTBaTTBaTABaTAbBaTbBbBaTbbB\begin{aligned} aTB&\Longrightarrow aTTB\\ &\Longrightarrow aTAB\\ &\Longrightarrow aTAbB\\ &\Longrightarrow aTbBbB\\ &\Longrightarrow aTbbB \end{aligned}

From this, it follows that aTB  aTbbBaTB\;\Longrightarrow^* aTbbB. The relation \Longrightarrow is read "yields" or "produces" while \Longrightarrow^* can be read "yields in zero or more steps" or "produces in zero or more steps." The following theorem states some simple facts about the relations \Longrightarrow and \Longrightarrow^*:

Theorem: Let GG be the context-free grammar (V,Σ,P,S)(V,\Sigma,P,S). Then:

  1. If xx and yy are strings in (VΣ)(V\cup\Sigma)^* such that xyx\Longrightarrow y, then x  yx\;\Longrightarrow^* y.
  2. If xx, yy, and zz are strings in (VΣ)(V\cup\Sigma)^* such that x  yx\;\Longrightarrow^* y and y  zy\;\Longrightarrow^* z, then x  zx\;\Longrightarrow^* z.
  3. If xx and yy are strings in (VΣ)(V\cup\Sigma)^* such that xyx\Longrightarrow y, and if ss and tt are any strings in (VΣ)(V\cup\Sigma)^*, then sxtsytsxt\Longrightarrow syt.
  4. If xx and yy are strings in (VΣ)(V\cup\Sigma)^* such that x  yx\;\Longrightarrow^* y, and if ss and tt are any strings in (VΣ)(V\cup\Sigma)^*, then sxt  sytsxt\;\Longrightarrow^* syt.

Proof: Parts 1 and 2 follow from the fact that \Longrightarrow^* is the transitive closure of \Longrightarrow. Part 4 follows easily from Part 3. (I leave this as an exercise.) To prove Part 3, suppose that xx, yy, ss, and tt are strings such that xyx\Longrightarrow y. By definition, this means that there exist strings uu and vv and a production rule AwA\longrightarrow w such that x=uAvx=uAv and y=uwvy=uwv. But then we also have sxt=suAvtsxt=suAvt and syt=suwvtsyt=suwvt. These two equations, along with the existence of the production rule AwA\longrightarrow w show, by definition, that sxtsytsxt\Longrightarrow syt.

We can use \Longrightarrow^* to give a formal definition of the language generated by a context-free grammar:

Suppose that G=(V,Σ,P,S)G=(V,\Sigma,P,S) is a context-free grammar. Then the language generated by GG is the language L(G)L(G) over the alphabet Σ\Sigma defined by

L(G)={wΣ    SGw}L(G)=\{w\in \Sigma^*\;|\; S\Longrightarrow_G^* w\}

That is, L(G)L(G) contains any string of terminal symbols that can be obtained by starting with the string consisting of the start symbol, SS, and applying a sequence of production rules.

A language LL is said to be a context-free language if there is a context-free grammar GG such that L(G)L(G) is LL. Note that there might be many different context-free grammars that generate the same context-free language. Two context-free grammars that generate the same language are said to be equivalent.

Suppose GG is a context-free grammar with start symbol SS and suppose wL(G)w\in L(G). By definition, this means that there is a sequence of one or more applications of production rules which produces the string ww from SS. This sequence has the form Sx1x2wS\Longrightarrow x_1\Longrightarrow x_2\Longrightarrow\cdots\Longrightarrow w. Such a sequence is called a derivation of ww (in the grammar GG). Note that ww might have more than one derivation. That is, it might be possible to produce ww in several different ways.

Consider the language L={anbn    nN}L=\{a^nb^n\;|\; n\in\N\}. We already know that LL is not a regular language.1 However, it is a context-free language. That is, there is a context-free grammar such that LL is the language generated by GG. This gives us our first theorem about grammars:

Theorem: Let LL be the language L={anbn    nN}L=\{a^nb^n\;|\; n\in\N\}. Let GG be the context-free grammar (V,Σ,P,S)(V,\Sigma,P,S) where V={S}V=\{S\}, Σ={a,b}\Sigma=\{a,b\} and PP consists of the productions

SaSbSε\begin{aligned} &S\longrightarrow aSb\\ &S\longrightarrow \varepsilon \end{aligned}

Then L=L(G)L=L(G), so that LL is a context-free language. In particular, there exist context-free languages which are not regular.

Proof: To show that L=L(G)L=L(G), we must show both that LL(G)L\subseteq L(G) and that L(G)LL(G)\subseteq L. To show that LL(G)L\subseteq L(G), let ww be an arbitrary element of LL. By definition of LL, w=anbnw=a^nb^n for some nNn\in\N. We show that wL(G)w\in L(G) by induction on nn. In the case where n=0n=0, we have w=εw=\varepsilon. Now, εL(G)\varepsilon\in L(G) since ε\varepsilon can be produced from the start symbol SS by an application of the rule SεS\longrightarrow\varepsilon, so our claim is true for n=0n=0. Now, suppose that kNk\in\N and that we already know that akbkL(G)a^kb^k\in L(G). We must show that ak+1bk+1L(G)a^{k+1}b^{k+1}\in L(G). Since S  akbkS\;\Longrightarrow^* a^kb^k, we also have, by the previous theorem, that aSb  aakbkbaSb\;\Longrightarrow^* aa^kb^kb. That is, aSb  ak+1bk+1aSb\;\Longrightarrow^* a^{k+1}b^{k+1}. Combining this with the production rule SaSbS\longrightarrow aSb, we see that S  ak+1bk+1S\;\Longrightarrow^* a^{k+1}b^{k+1}. This means that ak+1bk+1L(G)a^{k+1}b^{k+1}\in L(G), as we wanted to show. This completes the proof that LL(G)L\subseteq L(G).

To show that L(G)LL(G)\subseteq L, suppose that wL(G)w\in L(G). That is, S  wS\;\Longrightarrow^* w. We must show that w=anbnw=a^nb^n for some nn. Since S  wS\;\Longrightarrow^* w, there is a derivation Sx0x1xnS\Longrightarrow x_0\Longrightarrow x_1\Longrightarrow\cdots\Longrightarrow x_n, where w=xnw=x_n. We first prove by induction on nn that in any derivation Sx0x1xnS\Longrightarrow x_0\Longrightarrow x_1\Longrightarrow\cdots\Longrightarrow x_n, we must have either xn=anbnx_n=a^nb^n or xn=an+1Sbn+1x_n=a^{n+1}Sb^{n+1}. Consider the case n=0n=0. Suppose Sx0S\Longrightarrow x_0. Then, we must have that Sx0S\longrightarrow x_0 is a rule in the grammar, so x0x_0 must be either ε\varepsilon or aSbaSb. Since ε=a0b0\varepsilon=a^0b^0 and aSb=a0+1Sb0+1aSb=a^{0+1}Sb^{0+1}, x0x_0 is of the required form. Next, consider the inductive case. Suppose that k>1k>1 and we already know that in any derivation Sx0x1xkS\Longrightarrow x_0\Longrightarrow x_1\Longrightarrow\cdots\Longrightarrow x_k, we must have xk=akbkx_k=a^kb^k or x=ak+1Sbk+1x=a^{k+1}Sb^{k+1}. Suppose that Sx0x1xkxk+1S\Longrightarrow x_0\Longrightarrow x_1\Longrightarrow\cdots\Longrightarrow x_k\Longrightarrow x_{k+1}. We know by induction that xk=akbkx_k=a^kb^k or x=ak+1Sbk+1x=a^{k+1}Sb^{k+1}, but since xkxk+1x_k\Longrightarrow x_{k+1} and akbka^kb^k contains no non-terminal symbols, we must have xk=ak+1Sbk+1x_k=a^{k+1}Sb^{k+1}. Since xk+1x_{k+1} is obtained by applying one of the production rules SεS\longrightarrow\varepsilon or SaSbS\longrightarrow aSb to xkx_k, xk+1x_{k+1} is either ak+1εbk+1a^{k+1}\varepsilon b^{k+1} or ak+1aSbbk+1a^{k+1}aSbb^{k+1}. That is, xk+1x_{k+1} is either ak+1bk+1a^{k+1}b^{k+1} or ak+2Sbk+2a^{k+2}Sb^{k+2}, as we wanted to show. This completes the induction. Turning back to ww, we see that ww must be of the form anbna^nb^n or of the form anSbna^nSb^n. But since wL(G)w\in L(G), it can contain no non-terminal symbols, so ww must be of the form anbna^nb^n, as we wanted to show. This completes the proof that L(G)LL(G)\subseteq L.

I have given a very formal and detailed proof of this theorem, to show how it can be done and to show how induction plays a role in many proofs about grammars. However, a more informal proof of the theorem would probably be acceptable and might even be more convincing. To show that LL(G)L\subseteq L(G), we could just note that the derivation SaSba2Sb2anSbnanbnS\Longrightarrow aSb\Longrightarrow a^2Sb^2\Longrightarrow\cdots\Longrightarrow a^nSb^n\Longrightarrow a^nb^n demonstrates that anbnLa^nb^n\in L. On the other hand, it is clear that every derivation for this grammar must be of this form, so every string in L(G)L(G) is of the form anbna^nb^n.

For another example, consider the language {anbm    nm0}\{a^nb^m\;|\; n\ge m\ge0\}. Let's try to design a grammar that generates this language. This is similar to the previous example, but now we want to include strings that contain more aa's than bb's. The production rule SaSbS\longrightarrow aSb always produces the same number of aa's and bb's. Can we modify this idea to produce more aa's than bb's?

One approach would be to produce a string containing just as many aa's as bb's, and then to add some extra aa's. A rule that can generate any number of aa's is AaAA\longrightarrow aA. After applying the rule SaSbS\longrightarrow aSb for a while, we want to move to a new state in which we apply the rule AaAA\longrightarrow aA. We can get to the new state by applying a rule SAS\longrightarrow A that changes the SS into an AA. We still need a way to finish the process, which means getting rid of all non-terminal symbols in the string. For this, we can use the rule AεA\longrightarrow\varepsilon. Putting these rules together, we get the grammar

SaSbSAAaAAε\begin{aligned} &S\longrightarrow aSb\\ &S\longrightarrow A\\ &A\longrightarrow aA\\ &A\longrightarrow \varepsilon \end{aligned}

This grammar does indeed generate the language {anbm    nm0}\{a^nb^m\;|\; n\ge m\ge 0\}. With slight variations on this grammar, we can produce other related languages. For example, if we replace the rule AεA\longrightarrow \varepsilon with AaA\longrightarrow a, we get the language {anbm    n>m0}\{a^nb^m\;|\; n> m\ge 0\}.

There are other ways to generate the language {anbm    nm0}\{a^nb^m\;|\; n\ge m\ge 0\}. For example, the extra non-terminal symbol, AA, is not really necessary, if we allow SS to sometimes produce a single aa without a bb. This leads to the grammar

SaSbSaSSε\begin{aligned} &S\longrightarrow aSb\\ &S\longrightarrow aS\\ &S\longrightarrow \varepsilon \end{aligned}

(But note that the rule SSaS\longrightarrow Sa would not work in place of SaSS\longrightarrow aS, since it would allow the production of strings in which an aa can follow a bb, and there are no such strings in the language {anbm    nm0}\{a^nb^m\;|\; n\ge m\ge 0\}.) And here are two more grammars that generate this language:

SABAaABaBbAεBε\begin{aligned} &S\longrightarrow AB\\ &A\longrightarrow aA\\ &B\longrightarrow aBb\\ &A\longrightarrow \varepsilon\\ &B\longrightarrow \varepsilon \end{aligned}

and

SASbAaASεAa\begin{aligned} &S\longrightarrow ASb\\ &A\longrightarrow aA\\ &S\longrightarrow\varepsilon\\ &A\longrightarrow a \end{aligned}

Consider another variation on the language {anbn    nN}\{a^nb^n\;|\; n\in\N\}, in which the aa's and bb's can occur in any order, but the number of aa's is still equal to the number of bb's. This language can be defined as L={w{a,b}    na(w)=nb(w)}L=\{w\in\{a,b\}^*\;|\; n_a(w) = n_b(w)\}. This language includes strings such as abbaababbaab, baabbaab, and bbbaaabbbaaa.

Let's start with the grammar containing the rules SaSbS\longrightarrow aSb and SεS\longrightarrow\varepsilon. We can try adding the rule SbSaS\longrightarrow bSa. Every string that can be generated using these three rules is in the language LL. However, not every string in LL can be generated. A derivation that starts with SaSbS\Longrightarrow aSb can only produce strings that begin with aa and end with bb. A derivation that starts with SbSaS\Longrightarrow bSa can only generate strings that begin with bb and end with aa. There is no way to generate the strings baabbaab or abbbabaabaabbbabaaba, which are in the language LL. But we shall see that any string in LL that begins and ends with the same letter can be written in the form xyxy where xx and yy are shorter strings in LL. To produce strings of this form, we need one more rule, SSSS\longrightarrow SS. The complete set of production rules for the language LL is

SaSbSbSaSSSSε\begin{aligned} &S\longrightarrow aSb\\ &S\longrightarrow bSa\\ &S\longrightarrow SS\\ &S\longrightarrow \varepsilon \end{aligned}

It's easy to see that every string that can be generated using these rules is in LL, since each rule introduces the same number of aa's as bb's. But we also need to check that every string ww in LL can be generated by these rules. This can be done by induction on the length of ww, using the second form of the principle of mathematical induction. In the base case, w=0|w|=0 and w=εw=\varepsilon. In this case, wLw\in L since SεS\Longrightarrow\varepsilon in one step. Suppose w=k|w|=k, where k>0k>0, and suppose that we already know that for any xLx\in L with x<k|x|<k, S  xS\;\Longrightarrow^* x. To finish the induction we must show, based on this induction hypothesis, that S  wS\;\Longrightarrow^* w.

Suppose that the first and last characters of ww are different. Then ww is either of the form axbaxb or of the form bxabxa, for some string xx. Let's assume that ww is of the form axbaxb. (The case where ww is of the form bxabxa is handled in a similar way.) Since ww has the same number of aa's and bb's and since xx has one fewer aa than ww and one fewer bb than ww, xx must also have the same number of aa's as bb's. That is, xLx\in L. But x=w2<k|x|=|w|-2<k, so by the induction hypothesis, xL(G)x\in L(G). So we have S  xS\;\Longrightarrow^* x. By the theorem above, we get then aSb  axbaSb\;\Longrightarrow^* axb. Combining this with the fact that SaSbS\Longrightarrow aSb, we get that S  axbS\;\Longrightarrow^* axb, that is, S  wS\;\Longrightarrow^* w. This proves that wL(G)w\in L(G).

Finally, suppose that the first and last characters of ww are the same. Let's say that ww begins and ends with aa. (The case where ww begins and ends with bb is handled in a similar way.) I claim that ww can be written in the form xyxy where xL(G)x\in L(G) and yL(G)y\in L(G) and neither xx nor yy is the empty string. This will finish the induction, since we will then have by the induction hypothesis that S  xS\;\Longrightarrow^* x and S  yS\;\Longrightarrow^* y, and we can derive xyxy from SS by first applying the rule SSSS\longrightarrow SS and then using the first SS on the right-hand side to derive xx and the second to derive yy.

It only remains to figure out how to divide ww into two strings xx and yy which are both in L(G)L(G). The technique that is used is one that is more generally useful. Suppose that w=c1c2ckw=c_1c_2\cdots c_k, where each cic_i is either aa or bb. Consider the sequence of integers r1r_1, r2r_2, \dots, rkr_k where for each i=1,2,,ki = 1, 2, \dots, k, rir_i is the number of aa's in c1c2cic_1c_2\cdots c_i minus the number of bb's in c1c2cic_1c_2\cdots c_i. Since c1=ac_1=a, r1=1r_1=1. Since wLw\in L, rk=0r_k=0. And since ck=ac_k=a, we must have rk1=rk1=1r_{k-1}= r_k-1 = -1. Furthermore the difference between ri+1r_{i+1} and rir_i is either 11 or 1-1, for i=1,2,,k1i=1,2,\dots,k-1.

Since r1=1r_1=1 and rk1=1r_{k-1}=-1 and the value of rir_i goes up or down by 1 when ii increases by 1, rir_i must be zero for some ii between 1 and k1k-1. That is, rir_i cannot get from 1 to 1-1 unless it passes through zero. Let ii be a number between 1 and k1k-1 such that ri=0r_i=0. Let x=c1c2cix=c_1c_2\cdots c_i and let y=ci+1ci+2cky=c_{i+1}c_{i+2}\cdots c_k. Note that xy=wxy=w. The fact that ri=0r_i=0 means that the string c1c2cic_1c_2\cdots c_i has the same number of aa's and bb's, so xL(G)x\in L(G). It follows automatically that yL(G)y\in L(G) also. Since ii is strictly between 1 and k1k-1, neither xx nor yy is the empty string. This is all that we needed to show to finish the proof that L=L(G)L=L(G).

The basic idea of this proof is that if ww contains the same number of aa's as bb's, then an aa at the beginning of ww must have a "matching" bb somewhere in ww. This bb matches the aa in the sense that the corresponding rir_i is zero, and the bb marks the end of a string xx which contains the same number of aa's as bb's. For example, in the string aababbabbaaababbabba, the aa at the beginning of the string is matched by the third bb, since aababbaababb is the shortest prefix of aababbabbaaababbabba that has an equal number of aa's and bb's.

Closely related to this idea of matching aa's and bb's is the idea of balanced parentheses. Consider a string made up of parentheses, such as (()(()))(()). The parentheses in this sample string are balanced because each left parenthesis has a matching right parenthesis, and the matching pairs are properly nested. A careful definition uses the sort of integer sequence introduced in the above proof. Let ww be a string of parentheses. Write w=c1c2cnw=c_1c_2\cdots c_n, where each cic_i is either ( or ). Define a sequence of integers r1r_1, r2r_2, \dots, rnr_n, where rir_i is the number of left parentheses in c1c2cic_1c_2\cdots c_i minus the number of right parentheses. We say that the parentheses in ww are balanced if rn=0r_n=0 and ri0r_i\ge0 for all i=1,2,,ni=1,2,\dots,n. The fact that rn=0r_n=0 says that ww contains the same number of left parentheses as right parentheses. The fact the ri0r_i\ge0 means that the nesting of pairs of parentheses is correct: You can't have a right parenthesis unless it is balanced by a left parenthesis in the preceding part of the string. The language that consists of all balanced strings of parentheses is context-free. It is generated by the grammar

S(S)SSSSε\begin{aligned} &S\longrightarrow (\,S\,)\\ &S\longrightarrow SS\\ &S\longrightarrow \varepsilon \end{aligned}

The proof is similar to the preceding proof about strings of aa's and bb's. (It might seem that I've made an awfully big fuss about matching and balancing. The reason is that this is one of the few things that we can do with context-free languages that we can't do with regular languages.)


Before leaving this section, we should look at a few more general results. Since we know that most operations on regular languages produce languages that are also regular, we can ask whether a similar result holds for context-free languages. We will see later that the intersection of two context-free languages is not necessarily context-free. Also, the complement of a context-free language is not necessarily context-free. However, some other operations on context-free languages do produce context-free languages.

Theorem: Suppose that LL and MM are context-free languages. Then the languages LML\cup M, LMLM, and LL^* are also context-free.

Proof: I will prove only the first claim of the theorem, that LML\cup M is context-free. In the exercises for this section, you are asked to construct grammars for LMLM and LL^* (without giving formal proofs that your answers are correct).

Let G=(V,Σ,P,S)G=(V,\Sigma,P,S) and H=(W,Γ,Q,T)H=(W,\Gamma,Q,T) be context-free grammars such that L=L(G)L=L(G) and M=L(H)M=L(H). We can assume that WV=W\cap V=\emptyset, since otherwise we could simply rename the non-terminal symbols in WW. The idea of the proof is that to generate a string in LML\cup M, we first decide whether we want a string in LL or a string in MM. Once that decision is made, to make a string in LL, we use production rules from GG, while to make a string in MM, we use rules from HH. We have to design a grammar, KK, to represent this process.

Let RR be a symbol that is not in any of the alphabets VV, WW, Σ\Sigma, or Γ\Gamma. RR will be the start symbol of KK. The production rules for KK consist of all the production rules from GG and HH together with two new rules:

RSRT\begin{aligned} &R\longrightarrow S\\ &R\longrightarrow T \end{aligned}

Formally, KK is defined to be the grammar

(VW{R},PQ{RS,RT},ΣΓ,R)(V\cup W\cup\{R\}, P\cup Q\cup \{R\longrightarrow S, R\longrightarrow T\}, \Sigma\cup\Gamma, R)

Suppose that wLw\in L. That is wL(G)w\in L(G), so there is a derivation SGwS\Longrightarrow_G^*w. Since every rule from GG is also a rule in KK, if follows that SKwS\Longrightarrow_K^* w. Combining this with the fact that RKSR\Longrightarrow_K S, we have that RKwR\Longrightarrow_K^* w, and wL(K)w\in L(K). This shows that LL(K)L\subseteq L(K). In an exactly similar way, we can show that ML(K)M\subseteq L(K). Thus, LML(K)L\cup M\subseteq L(K).

It remains to show that L(K)LML(K)\subseteq L\cup M. Suppose wL(K)w\in L(K). Then there is a derivation RKwR\Longrightarrow_K^*w. This derivation must begin with an application of one of the rules RSR\longrightarrow S or RTR\longrightarrow T, since these are the only rules in which RR appears. If the first rule applied in the derivation is RSR\longrightarrow S, then the remainder of the derivation shows that SKwS\Longrightarrow_K^*w. Starting from SS, the only rules that can be applied are rules from GG, so in fact we have SGwS\Longrightarrow_G^*w. This shows that wLw\in L. Similarly, if the first rule applied in the derivation RKwR\Longrightarrow_K^*w is RTR\longrightarrow T, then wMw\in M. In any case, wLMw\in L\cup M. This proves that L(K)LML(K)\subseteq L\cup M.

Finally, we should clarify the relationship between context-free languages and regular languages. We have already seen that there are context-free languages which are not regular. On the other hand, it turns out that every regular language is context-free. That is, given any regular language, there is a context-free grammar that generates that language. This means that any syntax that can be expressed by a regular expression, by a DFA, or by an NFA could also be expressed by a context-free grammar. In fact, we only need a certain restricted type of context-free grammar to duplicate the power of regular expressions.

A right-regular grammar is a context-free grammar in which the right-hand side of every production rule has one of the following forms: the empty string; a string consisting of a single non-terminal symbol; or a string consisting of a single terminal symbol followed by a single non-terminal symbol.

Examples of the types of production rule that are allowed in a right-regular grammar are AεA\longrightarrow\varepsilon, BCB\longrightarrow C, and DaED\longrightarrow aE. The idea of the proof is that given a right-regular grammar, we can build a corresponding NFANFA and vice-versa. The states of the NFANFA correspond to the non-terminal symbols of the grammar. The start symbol of the grammar corresponds to the starting state of the NFA. A production rule of the form AbCA\longrightarrow bC corresponds to a transition in the NFA from state AA to state CC while reading the symbol bb. A production rule of the form ABA\longrightarrow B corresponds to an ε\varepsilon-transition from state AA to state BB in the NFA. And a production rule of the form AεA\longrightarrow\varepsilon exists in the grammar if and only if AA is a final state in the NFA. With this correspondence, a derivation of a string ww in the grammar corresponds to an execution path through the NFA as it accepts the string ww. I won't give a complete proof here. You are welcome to work through the details if you want. But the important fact is:

Theorem: A language LL is regular if and only if there is a right-regular grammar GG such that L=L(G)L=L(G). In particular, every regular language is context-free.

Exercises

  1. Show that Part 4 of the theorem about \Longrightarrow^* follows from Part 3.

    Answer

    If xyx\Longrightarrow^* y, then there is a sequence of strings z0,z1,znz_0, z_1,\ldots z_n, for some n0n\geq 0, such that x=z0x=z_0, y=zny=z_n, and zizi+1z_i\Longrightarrow z_{i+1} for each i<ni<n. By Part 3 of the theorem, we also know that szitszi+1tsz_it\Longrightarrow sz_{i+1}t for each i<ni<n, so we may conclude sxysytsxy\Longrightarrow^* syt.

  2. Give a careful proof that the language {anbm    nm0}\{a^nb^m\;|\; n\ge m\ge 0\} is generated by the context-free grammar

    SaSbSAAaAAε\begin{aligned} &S\longrightarrow aSb\\ &S\longrightarrow A\\ &A\longrightarrow aA\\ &A\longrightarrow \varepsilon \end{aligned}
  3. Identify the language generated by each of the following context-free grammars.

    • SaaSbSε\begin{aligned} &S\longrightarrow aaSb\\ &S\longrightarrow \varepsilon \end{aligned}
      Answer

      {ambn    m=2n}\{a^mb^n\;|\; m=2n\}

    • SaSbSaaSbSε\begin{aligned} &S\longrightarrow aSb\\ &S\longrightarrow aaSb\\ &S\longrightarrow \varepsilon \end{aligned}
      Answer

      {ambn    nm2n}\{a^mb^n\;|\; n\le m\le 2n\}

    • STSSεTaTbTε\begin{aligned} &S\longrightarrow TS\\ &S\longrightarrow \varepsilon\\ &T\longrightarrow aTb\\ &T\longrightarrow \varepsilon \end{aligned}
      Answer

      {an1bn1an2bn2ankbnk    k0i(ni0)}\{a^{n_1}b^{n_1}a^{n_2}b^{n_2}\cdots a^{n_k}b^{n_k}\;|\; k\ge 0\land\forall i(n_i\ge 0)\}

    • SABAAaAAaBbBBcBBε\begin{aligned} &S\longrightarrow ABA\\ &A\longrightarrow aA\\ &A\longrightarrow a\\ &B\longrightarrow bB\\ &B\longrightarrow cB\\ &B\longrightarrow\varepsilon \end{aligned}
      Answer

      aa(b    c)aaaa^*(b\;|\;c)^*aa^*

  4. For each of the following languages find a context-free grammar that generates the language:

    • {anbm    nm>0}\{a^nb^m\;|\; n\ge m > 0\}
      Answer
      SaSbSaSSab\begin{aligned} &S\longrightarrow aSb\\ &S\longrightarrow aS\\ &S\longrightarrow ab \end{aligned}
    • {anbm    n,mN}\{a^nb^m\;|\; n, m\in\N \}
      Answer
      SaSSSbSε\begin{aligned} &S\longrightarrow aS\\ &S\longrightarrow Sb\\ &S\longrightarrow\varepsilon \end{aligned}
    • {anbm    n0m=n+1}\{a^nb^m\;|\; n\ge0\land m=n+1\}
      Answer
      SaSbSb\begin{aligned} &S\longrightarrow aSb\\ &S\longrightarrow b \end{aligned}
    • {anbmcn    n,mN}\{a^nb^mc^n\;|\; n,m\in\N \}
      Answer
      SaScSTTbTTε\begin{aligned} &S\longrightarrow aSc\\ &S\longrightarrow T\\ &T\longrightarrow bT\\ &T\longrightarrow\varepsilon \end{aligned}
    • {anbmck    n=m+k}\{ a^nb^mc^k \;|\; n=m+k \}
      Answer
      SaScSTTaTbTε\begin{aligned} &S\longrightarrow aSc\\ &S\longrightarrow T\\ &T\longrightarrow aTb\\ &T\longrightarrow\varepsilon \end{aligned}
    • {anbm    nm}\{a^nb^m\;|\; n\not=m\}
      Answer
      SaSbSaTSbUTaTTεUbUUε\begin{aligned} &S\longrightarrow aSb\\ &S\longrightarrow aT\\ &S\longrightarrow bU\\ &T\longrightarrow aT\\ &T\longrightarrow\varepsilon\\ &U\longrightarrow bU\\ &U\longrightarrow\varepsilon \end{aligned}
    • {anbmcrdt    n+m=r+t}\{ a^nb^mc^rd^t\;|\; n+m=r+t \}
      Answer
      SaSdSTSUTaTcTVUbUdUVVbVcVε\begin{aligned} &S\longrightarrow aSd\\ &S\longrightarrow T\\ &S\longrightarrow U\\ &T\longrightarrow aTc\\ &T\longrightarrow V\\ &U\longrightarrow bUd\\ &U\longrightarrow V\\ &V\longrightarrow bVc\\ &V\longrightarrow\varepsilon \end{aligned}
    • {anbmck    nm+k}\{ a^nb^mc^k \;|\; n\not=m+k \}
      Answer

      We must either have n>m+kn>m+k or n<m+kn<m+k. In the first case, we match the aa's with bb's and cc's by expanding SS and TT, and then generate at least one unmatched aa by expanding TT to aUaU. In the second case, if n<kn<k then when we stop generating aa's we will force at least one unmatched cc by expanding SS to VcVc, and then VV can produce additional cc's preceded by an arbitrary number of bb's. Alternately, if kn<m+kk\le n<m+k, we will match all of the aa's with cc's, and some aa's with bb's, by expanding S and T, and then switch to generating at least one unmatched bb by expanding TT to bWbW.

      SaScSTSVcTaTbTaUTbWUaUUεVVcVWWbWWε\begin{aligned} &S\longrightarrow aSc\\ &S\longrightarrow T\\ &S\longrightarrow Vc\\ &T\longrightarrow aTb\\ &T\longrightarrow aU\\ &T\longrightarrow bW\\ &U\longrightarrow aU\\ &U\longrightarrow\varepsilon\\ &V\longrightarrow Vc\\ &V\longrightarrow W\\ &W\longrightarrow bW\\ &W\longrightarrow\varepsilon\\ \end{aligned}
  5. Find a context-free grammar that generates the language {w{a,b}    na(w)>nb(w)}\{w\in\{a,b\}^*\;|\; n_a(w) > n_b(w)\}.

  6. Find a context-free grammar that generates the language {w{a,b,c}    na(w)=nb(w)}\{w\in\{a,b,c\}^*\;|\; n_a(w) = n_b(w)\}.

  7. A palindrome is a string that reads the same backwards and forwards, such as "mom", "radar", or "aabccbccbaa". That is, ww is a palindrome if w=wRw=w^R. Let L={w{a,b,c}    w is a palindrome}L=\{w\in\{a,b,c\}^*\;|\; w\textrm{ is a palindrome}\}. Show that LL is a context-free language by finding a context-free grammar that generates LL.

    Answer
    SaSaSbSbScScSaSbScSε\begin{aligned} &S\longrightarrow aSa\\ &S\longrightarrow bSb\\ &S\longrightarrow cSc\\ &S\longrightarrow a\\ &S\longrightarrow b\\ &S\longrightarrow c\\ &S\longrightarrow\varepsilon \end{aligned}
  8. Let Σ={(,),[,]}\Sigma=\{\,\texttt{(},\,\texttt{)},\,\texttt{[},\,\texttt{]}\,\}. That is, Σ\Sigma is the alphabet consisting of the four symbols (\texttt{(}, )\texttt{)}, [\texttt{[}, and ]\texttt{]}. Let LL be the language over Σ\Sigma consisting of strings in which both parentheses and brackets are balanced. For example, the string ([][()()])([])\texttt{([][()()])([])} is in LL but [(])\texttt{[(])} is not. Find a context-free grammar that generates the language LL.

    Answer
    S(S)SS[S]SSε\begin{aligned} &S\longrightarrow \texttt{(}S\texttt{)}S\\ &S\longrightarrow \texttt{[}S\texttt{]}S\\ &S\longrightarrow\varepsilon \end{aligned}
  9. Suppose that GG and HH are context-free grammars. Let L=L(G)L=L(G) and let M=L(H)M=L(H). Explain how to construct a context-free grammar for the language LMLM. You do not need to give a formal proof that your grammar is correct.

    Answer

    If the start symbols for GG and HH are SS and TT, respectively, then the desired grammar will contain all of the rules for GG and HH combined (with non-terminals renamed to avoid duplicates), plus a new start symbol RR whose only rule is RSTR\longrightarrow ST.

  10. Suppose that GG is a context-free grammar. Let L=L(G)L=L(G). Explain how to construct a context-free grammar for the language LL^*. You do not need to give a formal proof that your grammar is correct.

    Answer

    If the start symbol for GG is SS, then the desired grammar will have all of the rules of GG plus a new start symbol (call it RR) with the rules RSRR\longrightarrow SR and RεR\longrightarrow\varepsilon.

  11. Suppose that LL is a context-free language. Prove that LRL^R is a context-free language. (Hint: Given a context-free grammar GG for LL, make a new grammar, GRG^R, by reversing the right-hand side of each of the production rules in GG. That is, AwA\longrightarrow w is a production rule in GG if and only if AwRA\longrightarrow w^R is a production rule in GRG^R.)

  12. Define a left-regular grammar to be a context-free grammar in which the right-hand side of every production rule is of one of the following forms: the empty string; a single non-terminal symbol; or a non-terminal symbol followed by a terminal symbol. Show that a language is regular if and only if it can be generated by a left-regular grammar. (Hint: Use the preceding exercise and the theorem about right-regular grammars.)


  1. TODO