Skip to main content

Languages

(Content adapted from Critchlow & Eck)

With the set of mathematical tools from the first two chapters, we are now ready to study languages and formal language theory. Our intent is to examine the question of how, and which, languages can be mechanically generated and recognized; and, ultimately, to see what this tells us about what computers can and can't do.

In formal language theory, an alphabet is a finite, non-empty set. The elements of the set are called symbols. A finite sequence of symbols a1a2ana_1a_2\ldots a_n from an alphabet is called a string over that alphabet.


Example: Σ={0,1}\Sigma = \{0,1\} is an alphabet, and 011, 1010, and 1 are all strings over Σ\Sigma.


Note that strings really are sequences of symbols, which implies that order matters. Thus 011, 101, and 110 are all different strings, though they are made up of the same symbols. The strings x=a1a2amx=a_1a_2\ldots a_m and y=b1b2bny=b_1b_2\ldots b_n are equal only if m=nm=n (i.e., the strings contain the same number of symbols) and ai=bia_i=b_i for all 1in1 \leq i \leq n.

String Operations

Just as there are operations defined on numbers, truth values, sets, and other mathematical entities, there are operations defined on strings. Some important operations are:

  • length: the length of a string xx is the number of symbols in it. The notation for the length of xx is x|x|. Note that this is consistent with other uses of  |\ |, all of which involve some notion of size: number|number| measures how big a number is (in terms of its distance from 0); set|set| measures the size of a set (in terms of the number of elements).

    We will occasionally refer to a length-n string. This is a slightly awkward, but concise, shorthand for "a string whose length is nn".

  • concatenation: the concatenation of two strings x=a1a2amx=a_1 a_2\ldots a_m and y=b1b2bny=b_1b_2\ldots b_n is the sequence of symbols a1amb1bna_1\ldots a_mb_1\ldots b_n. Sometimes \cdot is used to denote concatenation, but it is far more usual to see the concatenation of xx and yy denoted by xyxy than by xyx\cdot y. You can easily convince yourself that concatenation is associative (i.e., (xy)z=x(yz)(xy)z = x(yz) for all strings x,yx,y and zz). Concatenation is not commutative (i.e., it is not always true that xy=yxxy = yx: for example, if x=ax=a and y=by=b then xy=abxy=ab while yx=bayx=ba and, as discussed above, these strings are not equal).

  • reversal: the reverse of a string x=a1a2anx=a_1a_2\ldots a_n is the string xR=anan1a2a1x^R = a_na_{n-1}\ldots a_2a_1.


Example: Let Σ={a,b}\Sigma = \{a,b\}, x=ax=a, y=abaay=abaa, and z=babz=bab. Then x=1|x| = 1, y=4|y| = 4, and z=3|z|=3. Also, xx=aaxx = aa, xy=aabaaxy = aabaa, xz=ababxz = abab, and zx=babazx = baba. Finally, xR=ax^R = a, yR=aabay^R = aaba, and zR=babz^R=bab.


By the way, the previous example illustrates a naming convention standard throughout language theory texts: if a letter is intended to represent a single symbol in an alphabet, the convention is to use a letter from the beginning of the English alphabet (aa, bb, cc, dd); if a letter is intended to represent a string, the convention is to use a letter from the end of the English alphabet (uu, vv, etc.).

Empty String

In set theory, we have a special symbol to designate the set that contains no elements. Similarly, language theory has a special symbol ε\varepsilon which is used to represent the empty string, the string with no symbols in it. (Some texts use the symbol λ\lambda instead.) It is worth noting that ε=0|\varepsilon| = 0, that εR=ε\varepsilon^R = \varepsilon, and that εx=xε=x\varepsilon \cdot x = x \cdot \varepsilon = x for all strings xx. (This last fact may appear a bit confusing. Remember that ε\varepsilon is not a symbol in a string with length 1, but rather the name given to the string made up of 0 symbols. Pasting those 0 symbols onto the front or back of a string xx still produces xx.)

The set of all strings over an alphabet Σ\Sigma is denoted Σ\Sigma^*. (In language theory, the symbol ^* is typically used to denote "zero or more", so Σ\Sigma^* is the set of strings made up of zero or more symbols from Σ\Sigma.) Note that while an alphabet Σ\Sigma is by definition a finite set of symbols, and strings are by definition finite sequences of those symbols, the set Σ\Sigma^* is always infinite (as long as Σ\Sigma is not empty). Why is this? Suppose Σ\Sigma contains nn elements. Then there is one string over Σ\Sigma with 0 symbols, nn strings with 1 symbol, n2n^2 strings with 2 symbols (since there are nn choices for the first symbol and nn choices for the second), n3n^3 strings with 3 symbols, etc.


Example: If Σ={1}\Sigma = \{1\}, then Σ={ε,1,11,111,}\Sigma^* = \{\varepsilon, 1, 11, 111, \ldots\}. If Σ={a,b}\Sigma = \{a,b\}, then Σ={ε,a,b,aa,ab,ba,bb,aaa,aab,}\Sigma^* = \{ \varepsilon, a, b, aa, ab, ba, bb, aaa, aab, \ldots\}.


We now come to the definition of a language in the formal language theoretical sense.

A language over an alphabet Σ\Sigma is a subset of Σ\Sigma^*. Thus, a language over Σ\Sigma is an element of P(Σ){\mathscr P}(\Sigma^*), the power set of Σ\Sigma^*.

In other words, any set of strings (over alphabet Σ\Sigma) constitutes a language (over alphabet Σ\Sigma).


Example: Let Σ={0,1}\Sigma = \{0,1\}. Then the following are all languages over Σ\Sigma:

  • L1={011,1010,111}L_1 = \{011, 1010, 111\}

  • L2={0,10,110,1110,11110,}L_2 = \{0, 10, 110, 1110, 11110, \ldots\}

  • L3={xΣ  n0(x)=n1(x)}L_3 = \{x \in \Sigma^* \ | \ n_0(x) = n_1(x) \}, where the notation n0(x)n_0(x) stands for the number of 0's in the string xx, and similarly for n1(x)n_1(x).

  • L4={x x represents a multiple of 5 in binary}L_4 = \{x \ | x\textrm{ represents a multiple of 5 in binary}\}


Note that languages can be either finite or infinite. Because Σ\Sigma^* is infinite, it clearly has an infinite number of subsets, and so there are an infinite number of languages over Σ\Sigma.

A programming language, such as Java or ReasonML, may be viewed as a formal language: if Σ\Sigma is the set of all ASCII (or, more generally, Unicode) characters, then the set of all valid programs in a particular programming language is a language over Σ\Sigma. We will see later how to formally specify such languages, but one step is often to describe smaller languages, such as the set of all keywords in the language (for Java, it is the 51-element set {\{abstract, assert, boolean, , while, _}\}), or the set of valid identifiers (in Java, a valid identifier starts with a letter, $, or _, and continues with zero or more letters, digits, $, or _1, except it may not be a keyword or reserved word such as true, false, or null).

Languages are sets and therefore, as for any sets, it makes sense to talk about the union, intersection, and complement of languages. (When taking the complement of a language over an alphabet Σ\Sigma, we always consider the univeral set to be Σ\Sigma^*, the set of all strings over Σ\Sigma.) Because languages are sets of strings, there are additional operations that can be defined on languages, operations that would be meaningless on more general sets. For example, the idea of concatenation can be extended from strings to languages.

For two sets of strings SS and TT, we define the concatenation of SS and TT (denoted STS\cdot T or just STST) to be the set ST={st  sStT}ST = \{ st \ | \ s \in S \land t \in T \}. For example, if S={ab,aab}S = \{ab, aab\} and T={ε,110,1010}T=\{\varepsilon, 110, 1010\}, then ST={ab,ab110,ab1010,aab,aab110,aab1010}ST = \{ab, ab110, ab1010, aab, aab110, aab1010\}. Note in particular that abSTab \in ST, because abSab \in S, εT\varepsilon \in T, and abε=abab \cdot \varepsilon = ab. Because concatenation of sets is defined in terms of the concatenation of the strings that the sets contain, concatenation of sets is associative and not commutative. (This can easily be verified.)

When a set SS is concatenated with itself, the notation SSSS is usually scrapped in favour of S2S^2; if S2S^2 is concatenated with SS, we write S3S^3 for the resulting set, etc. So S2S^2 is the set of all strings formed by concatenating two (possibly different, possibly identical) strings from SS, S3S^3 is the set of strings formed by concatenating three strings from SS, etc. Extending this notation, we take S1S^1 to be the set of strings formed from one string in SS (i.e., S1S^1 is SS itself), and S0S^0 to be the set of strings formed from zero strings in SS (i.e., S0={ε}S^0 = \{\varepsilon\}). If we take the union S0S1S2S^0 \cup S^1 \cup S^2 \cup \ldots, then the resulting set is the set of all strings formed by concatenating zero or more strings from SS, and is denoted SS^*. The set SS^* is called the Kleene closure2 of SS, and the ^* operator is called the Kleene star operator.


Example: Let S={01,ba}S = \{01, ba\}. Then

  • S0={ε}S^0 = \{\varepsilon\}

  • S1={01,ba}S^1 = \{01, ba\}

  • S2={0101,01ba,ba01,baba}S^2 = \{0101, 01ba, ba01, baba\}

  • S3={010101,0101ba,01ba01,01baba,ba0101,ba01ba,baba01,bababa}S^3 = \{010101, 0101ba, 01ba01, 01baba, ba0101, ba01ba, baba01, bababa\}

etc., so S={ε,01,ba,0101,01ba,ba01,baba,010101,0101ba,}.S^* =\{\varepsilon,01,ba,0101,01ba,ba01,baba,010101,0101ba,\ldots\}.


Note that this is the second time we have seen the notation something\textit{something}^*. We have previously seen that for an alphabet Σ\Sigma, Σ\Sigma^* is defined to be the set of all strings over Σ\Sigma. If you think of Σ\Sigma as being a set of length-1 strings, and take its Kleene closure, the result is once again the set of all strings over Σ\Sigma, and so the two notions of ^* coincide.


Example: Let Σ={a,b}\Sigma = \{a,b\}. Then

  • Σ0={ε}\Sigma^0 = \{\varepsilon\}

  • Σ1={a,b}\Sigma^1 = \{a,b\}

  • Σ2={aa,ab,ba,bb}\Sigma^2 = \{aa, ab, ba, bb\}

  • Σ3={aaa,aab,aba,abb,baa,bab,bba,bbb}\Sigma^3 = \{aaa, aab, aba, abb, baa, bab, bba, bbb\}

etc., so Σ={ε,a,b,aa,ab,ba,bb,aaa,aab,aba,abb,baa,bab,}.\Sigma^* =\{\varepsilon,a,b,aa,ab,ba,bb,aaa,aab,aba,abb,baa,bab,\ldots\}.


Exercises

  1. Let S={ε,ab,abab}S = \{\varepsilon, ab, abab\} and T={aa,aba,abba,abbba,}T = \{aa, aba, abba, abbba, \ldots\}. Find the following:

    • S2S^2
      Answer

      {ε,ab,abab,ababab,abababab}\{\varepsilon, ab, abab, ababab, abababab\}

    • S3S^3
      Answer

      {ε,ab,abab,ababab,abababab,ababababab,abababababab}\{\varepsilon, ab, abab, ababab, abababab, ababababab, abababababab\}

    • SS^*
      Answer

      {ε,ab,(ab)2,(ab)3,}\{\varepsilon, ab, (ab)^2, (ab)^3, \ldots\}

    • STST
      Answer

      {aa,abaa,(ab)2aa,aba,ababa,(ab)2aba,ab2a,abab2a,(ab)2ab2a,ab3a,abab3a,(ab)2ab3a,}\{aa, abaa, (ab)^2aa, aba, ababa, (ab)^2aba, ab^2a, abab^2a, (ab)^2ab^2a, ab^3a, abab^3a, (ab)^2ab^3a, \ldots\}

    • TSTS
      Answer

      {abia(ab)j  0i0j2}\{ab^ia(ab)^j\ |\ 0\leq i\land 0\leq j\leq 2\}

  2. The reverse of a language LL is defined to be LR={xR  xL}L^R = \{ x^R \ | \ x \in L\}. Find SRS^R and TRT^R for the SS and TT in the preceding problem.

    Answer

    SR={ε,ba,baba}S^R = \{\varepsilon, ba, baba\} and TR=TT^R = T.

  3. Give an example of a language LL such that L=LL=L^*.

    Answer

    Any language LL where L=ML=M^* for some language MM will satisfy the property. The only finite language LL that satisfies the property is {ε}\{\varepsilon\}, because if LL contains any non-empty string ww then it must also contain the infinite set of strings w2w^2, w3w^3, etc.


  1. Technically there are a few more classes of characters that are allowed in Java identifiers, such as other currency symbols and "letter numbers", and the categories of "letters" and "digits" themselves contain thousands of choices in Unicode.
  2. Kleene is pronounced KLAY-nee; the operation is named after the mathematician Stephen Kleene.