Languages
(Content adapted from Critchlow & Eck)
With the set of mathematical tools from the first two chapters, we are now ready to study languages and formal language theory. Our intent is to examine the question of how, and which, languages can be mechanically generated and recognized; and, ultimately, to see what this tells us about what computers can and can't do.
In formal language theory, an alphabet is a finite, non-empty set. The elements of the set are called symbols. A finite sequence of symbols from an alphabet is called a string over that alphabet.
Example: is an alphabet, and 011
,
1010
, and 1
are all strings over .
Note that strings really are sequences of symbols, which
implies that
order matters. Thus 011
, 101
, and 110
are all
different strings, though they are made up of the same symbols.
The strings and are equal only
if (i.e., the strings contain the same number of symbols) and
for all
.
String Operations
Just as there are operations defined on numbers, truth values, sets, and other mathematical entities, there are operations defined on strings. Some important operations are:
length: the length of a string is the number of symbols in it. The notation for the length of is . Note that this is consistent with other uses of , all of which involve some notion of size: measures how big a number is (in terms of its distance from 0); measures the size of a set (in terms of the number of elements).
We will occasionally refer to a length-n string. This is a slightly awkward, but concise, shorthand for "a string whose length is ".
concatenation: the concatenation of two strings and is the sequence of symbols . Sometimes is used to denote concatenation, but it is far more usual to see the concatenation of and denoted by than by . You can easily convince yourself that concatenation is associative (i.e., for all strings and ). Concatenation is not commutative (i.e., it is not always true that : for example, if and then while and, as discussed above, these strings are not equal).
reversal: the reverse of a string is the string .
Example: Let , , , and . Then , , and . Also, , , , and . Finally, , , and .
By the way, the previous example illustrates a naming convention standard throughout language theory texts: if a letter is intended to represent a single symbol in an alphabet, the convention is to use a letter from the beginning of the English alphabet (, , , ); if a letter is intended to represent a string, the convention is to use a letter from the end of the English alphabet (, , etc.).
Empty String
In set theory, we have a special symbol to designate the set that contains no elements. Similarly, language theory has a special symbol which is used to represent the empty string, the string with no symbols in it. (Some texts use the symbol instead.) It is worth noting that , that , and that for all strings . (This last fact may appear a bit confusing. Remember that is not a symbol in a string with length 1, but rather the name given to the string made up of 0 symbols. Pasting those 0 symbols onto the front or back of a string still produces .)
The set of all strings over an alphabet is denoted . (In language theory, the symbol is typically used to denote "zero or more", so is the set of strings made up of zero or more symbols from .) Note that while an alphabet is by definition a finite set of symbols, and strings are by definition finite sequences of those symbols, the set is always infinite (as long as is not empty). Why is this? Suppose contains elements. Then there is one string over with 0 symbols, strings with 1 symbol, strings with 2 symbols (since there are choices for the first symbol and choices for the second), strings with 3 symbols, etc.
Example: If , then . If , then .
We now come to the definition of a language in the formal language theoretical sense.
A language over an alphabet is a subset of . Thus, a language over is an element of , the power set of .
In other words, any set of strings (over alphabet ) constitutes a language (over alphabet ).
Example: Let . Then the following are all languages over :
, where the notation stands for the number of 0's in the string , and similarly for .
Note that languages can be either finite or infinite. Because is infinite, it clearly has an infinite number of subsets, and so there are an infinite number of languages over .
A programming language, such as Java or ReasonML, may be viewed as a formal language:
if is the set of all ASCII (or, more generally, Unicode) characters, then
the set of all valid programs in a particular programming language is a language over
. We will see later how to formally specify such languages, but one step is
often to describe smaller languages, such as the set of all keywords in the language
(for Java, it is the 51-element set abstract
, assert
, boolean
, …, while
, _
),
or the set of valid identifiers (in Java, a valid identifier starts with a letter, $
, or _
, and
continues with zero or more letters, digits, $
, or _
1, except it may not be a keyword or reserved word such as true
, false
, or null
).
Languages are sets and therefore, as for any sets, it makes sense to talk about the union, intersection, and complement of languages. (When taking the complement of a language over an alphabet , we always consider the univeral set to be , the set of all strings over .) Because languages are sets of strings, there are additional operations that can be defined on languages, operations that would be meaningless on more general sets. For example, the idea of concatenation can be extended from strings to languages.
For two sets of strings and , we define the concatenation of and (denoted or just ) to be the set . For example, if and , then . Note in particular that , because , , and . Because concatenation of sets is defined in terms of the concatenation of the strings that the sets contain, concatenation of sets is associative and not commutative. (This can easily be verified.)
When a set is concatenated with itself, the notation is usually scrapped in favour of ; if is concatenated with , we write for the resulting set, etc. So is the set of all strings formed by concatenating two (possibly different, possibly identical) strings from , is the set of strings formed by concatenating three strings from , etc. Extending this notation, we take to be the set of strings formed from one string in (i.e., is itself), and to be the set of strings formed from zero strings in (i.e., ). If we take the union , then the resulting set is the set of all strings formed by concatenating zero or more strings from , and is denoted . The set is called the Kleene closure2 of , and the operator is called the Kleene star operator.
Example: Let . Then
etc., so
Note that this is the second time we have seen the notation . We have previously seen that for an alphabet , is defined to be the set of all strings over . If you think of as being a set of length-1 strings, and take its Kleene closure, the result is once again the set of all strings over , and so the two notions of coincide.
Example: Let . Then
etc., so
Exercises
Let and . Find the following:
Answer
Answer
Answer
Answer
Answer
The reverse of a language is defined to be . Find and for the and in the preceding problem.
Answer
and .
Give an example of a language such that .
Answer
Any language where for some language will satisfy the property. The only finite language that satisfies the property is , because if contains any non-empty string then it must also contain the infinite set of strings , , etc.
- Technically there are a few more classes of characters that are allowed in Java identifiers, such as other currency symbols and "letter numbers", and the categories of "letters" and "digits" themselves contain thousands of choices in Unicode.↩
- Kleene is pronounced KLAY-nee; the operation is named after the mathematician Stephen Kleene.↩