(Content adapted from Critchlow & Eck)
Context-free grammars are used to describe some aspects of
the syntax of programming languages. However, a notation
that is often used for grammars in the context of programming languages
is somewhat different from the notation introduced in the
preceding section. The notation that is used is called
Backus-Naur Form or BNF. It is named after computer
scientists John Backus and Peter Naur, who developed the
notation. Actually, several variations of BNF exist.
I will discuss one of them here. BNF can be used to describe
the syntax of natural languages, as well as programming languages,
and some of the examples in this section will deal with the
syntax of English.
Like context-free grammars, BNF grammars make use of production rules, non-terminals,
and terminals. The non-terminals are usually given meaningful,
multi-character names. Here, I will follow a common practice
of enclosing non-terminals in angle brackets, so that they can
be easily distinguished. For example, ⟨noun⟩ and ⟨sentence⟩
could be non-terminals in a BNF grammar for English, while
⟨program⟩ and ⟨if-statement⟩ might be used in a BNF grammar
for a programming language. Note that a BNF non-terminal
usually represents a meaningful syntactic category,
that is, a certain type of building block in the syntax of
the language that is being described, such as an adverb,
a prepositional phrase, or a variable declaration statement.
The terminals of a BNF grammar are the things that actually
appear in the language that is being described. In the case
of natural language, the terminals are individual words.
In BNF production rules, I will use the symbol "::="
in place of the "⟶" that is used in context-free grammars.
BNF production rules are more powerful than the production rules in
context-free grammars. That is, one BNF rule might be equivalent to
several context-free grammar rules. As for context-free grammars,
the left-hand side of a BNF production rule is a single
non-terminal symbol. The right hand side can include terminals
and non-terminals, and can also use the following notations,
which should remind you of notations used in regular expressions:
A vertical bar, ∣, indicates a choice of
alternatives. For example,
⟨digit⟩ ::= 0∣1∣2∣3∣4∣5∣6∣7∣8∣9 indicates that the non-terminal ⟨digit⟩ can be replaced
by any one of the terminal symbols 0, 1, \dots, 9.
Items enclosed in brackets are optional. For example,
⟨declaration⟩ ::= ⟨type⟩ ⟨variable⟩ [ = ⟨expression⟩ ] ; says that ⟨declaration⟩ can be replaced either
by "⟨type⟩ ⟨variable⟩ ;" or by "⟨type⟩ ⟨variable⟩ = ⟨expression⟩ ;".
(The symbols "=" and ";" are terminal symbols in this rule.)
Items enclosed between "[" and "]…"
can be repeated zero or more times. (This has the same effect
as a "∗"in a regular expression.) For example,
⟨integer⟩ ::= ⟨digit⟩ [ ⟨digit⟩ ]… says that an ⟨integer⟩ consists of a ⟨digit⟩ followed
optionally by any number of additional ⟨digit⟩'s. That is,
the non-terminal ⟨integer⟩ can be replaced by ⟨digit⟩ or
by ⟨digit⟩ ⟨digit⟩ or by ⟨digit⟩ ⟨digit⟩ ⟨digit⟩, and so on.
Parentheses can be used as usual, for grouping.
All these notations can be expressed in a context-free grammar
by introducing additional production rules. For example, the
BNF rule "⟨sign⟩ ::= +∣−" is equivalent
to the two rules, "⟨sign⟩ ::= +"
and "⟨sign⟩ ::= −". A rule that contains an
optional item can also be replaced by two rules. For example,
⟨declaration⟩ ::= ⟨type⟩ ⟨variable⟩ [ = ⟨expression⟩ ] ; can be replaced by the two rules
⟨declaration⟩ ::= ⟨type⟩ ⟨variable⟩ ;⟨declaration⟩ ::= ⟨type⟩ ⟨variable⟩ = ⟨expression⟩ ; In context-free grammars, repetition can be expressed by
using a recursive rule such as "S⟶aS", in which the
same non-terminal symbol appears both on the left-hand side and on the right-hand
side of the rule. BNF-style notation using "[" and "]…" can
be eliminated by replacing it with a new non-terminal symbol and adding
a recursive rule to allow that symbol to repeat zero or more times.
For example, the production rule
⟨integer⟩ ::= ⟨digit⟩ [ ⟨digit⟩ ]… can be replaced by three rules using a new non-terminal symbol
⟨digit-list⟩ to represent a string of zero or more ⟨digit⟩'s:
⟨integer⟩ ::= ⟨digit⟩ ⟨digit-list⟩⟨digit-list⟩ ::= ⟨digit⟩ ⟨digit-list⟩⟨digit-list⟩ ::= ε As an example of a complete BNF grammar, let's look at a BNF grammar for a very small
subset of English. The start symbol for the grammar
is ⟨sentence⟩, and the terminal symbols are English words.
All the sentences that can be produced from this grammar
are syntactically correct English sentences, although you wouldn't
encounter many of them in conversation. Here is the grammar:
⟨sentence⟩ ::= ⟨simple-sentence⟩ [ and ⟨simple-sentence⟩ ]…⟨simple-sentence⟩ ::= ⟨noun-part⟩ ⟨verb-part⟩⟨noun-part⟩ ::= ⟨article⟩ ⟨noun⟩ [ who ⟨verb-part⟩ ]…⟨verb-part⟩ ::= ⟨intransitive-verb⟩∣( ⟨transitive-verb⟩ ⟨noun-part⟩ )⟨article⟩ ::= the∣a⟨noun⟩ ::= man∣woman∣dog∣cat∣computer⟨intransitive-verb⟩ ::= runs∣jumps∣hides⟨transitive-verb⟩ ::= knows∣loves∣chases∣owns This grammar can generate sentences such as "A dog chases the cat and
the cat hides" and "The man loves a woman who runs."
The second sentence, for example, is generated by the derivation
⟨sentence⟩ ⟹ ⟨simple-sentence⟩⟹ ⟨noun-part⟩ ⟨verb-part⟩⟹ ⟨article⟩ ⟨noun⟩ ⟨verb-part⟩⟹ the ⟨noun⟩ ⟨verb-part⟩⟹ the man ⟨verb-part⟩⟹ the man ⟨transitive-verb⟩ ⟨noun-part⟩⟹ the man loves ⟨noun-part⟩⟹ the man loves ⟨article⟩ ⟨noun⟩ who ⟨verb-part⟩⟹ the man loves a ⟨noun⟩ who ⟨verb-part⟩⟹ the man loves a woman who ⟨verb-part⟩⟹ the man loves a woman who ⟨intransitive-verb⟩⟹ the man loves a woman who runs BNF is most often used to specify the syntax of programming languages.
Most programming languages are not, in fact, context-free languages, and
BNF is not capable of expressing all aspects of their syntax.
For example, BNF cannot express the fact that a variable must
be declared before it is used or the fact that the number of
actual parameters in a subroutine call statement must match the number
of formal parameters in the declaration of the subroutine.
So BNF is used to express the context-free aspects of the syntax
of a programming language, and other restrictions on the syntax—such
as the rule about declaring a variable before it is used—are expressed
using informal English descriptions.
When BNF is applied to programming languages, the terminal symbols
are generally "tokens," which are the minimal meaningful units in
a program. For example, the pair of symbols <=
constitute a
single token, as does a string such as "Hello World"
.
Every number is represented by a single token. (The actual value
of the number is stored as a so-called "attribute" of the token,
but the value plays no role in the context-free syntax of the
language.) I will use the
symbol number to represent a numerical token.
Similarly, every variable name, subroutine name, or other identifier
in the program is represented by the same token, which I will denote
as ident. One final complication: Some symbols
used in programs, such as "]" and "(", are also used with
a special meaning in BNF grammars. When such a symbol occurs as
a terminal symbol, I will enclose it in double quotes. For
example, in the BNF production rule
⟨array-reference⟩ ::= ident ‘‘[" ⟨expression⟩ ‘‘]" the "[" and "]" are terminal symbols in the language
that is being described, rather than the BNF notation for an
optional item. With this notation, here is part of a BNF
grammar that describes statements in the Java programming
language:
⟨statement⟩ ::= ⟨block-statement⟩∣⟨if-statement⟩∣⟨while-statement⟩∣⟨assignment-statement⟩∣⟨null-statement⟩⟨block-statement⟩ ::= { [ ⟨statement⟩ ]… }⟨if-statement⟩ ::= if ‘‘(" ⟨condition⟩ ‘‘)" ⟨statement⟩ [ else ⟨statement⟩ ]⟨while-statement⟩ ::= while ‘‘(" ⟨condition⟩ ‘‘)" ⟨statement⟩⟨assignment-statement⟩ ::= ⟨variable⟩ = ⟨expression⟩ ;⟨null-statement⟩ ::= ε The non-terminals ⟨condition⟩, ⟨variable⟩, and
⟨expression⟩ would, of course, have to be defined by other
production rules in the grammar. Here is a set of rules that
define simple expressions, made up of numbers, identifiers,
parentheses and the arithmetic operators +, −, ∗ and /:
⟨expression⟩ ::= ⟨term⟩ [ ( +∣− ) ⟨term⟩ ]…⟨term⟩ ::= ⟨factor⟩ [ ( ∗∣/ ) ⟨factor⟩ ]…⟨factor⟩ ::= ident∣number∣‘‘(" ⟨expression⟩ ‘‘)" The first rule says that an ⟨expression⟩ is a sequence of
one or more ⟨term⟩'s, separated by plus or minus signs.
The second rule defines a ⟨term⟩ to be a sequence of one or more
⟨factor⟩'s, separated by multiplication or division operators.
The last rule says that a ⟨factor⟩ can be either an identifier
or a number or an ⟨expression⟩ enclosed in parentheses.
This small BNF grammar can generate expressions
such as "3∗5" and "x∗(x+1)−3/(z+2∗(3−x))+7".
The latter expression is made up of three terms: x∗(x+1),
3/(z+2∗(3−x)), and 7. The first of these terms is made
up of two factors, x and (x+1). The factor (x+1) consists
of the expression x+1 inside a pair of parentheses.
The nice thing about this grammar is that the precedence rules
for the operators are implicit in the grammar. For example, according
to the grammar, the expression 3+5∗7 is seen as ⟨term⟩ + ⟨term⟩
where the first term is 3 and the second term is 5∗7.
The 5∗7 occurs as a group, which must be evaluated before the
result is added to 3. Parentheses can change the order of
evaluation. For example, (3+5)∗7 is generated by the grammar
as a single ⟨term⟩ of the form ⟨factor⟩ ∗ ⟨factor⟩.
The first ⟨factor⟩ is (3+5). When (3+5)∗7 is evaluated,
the value of (3+5) is computed first and then multiplied
by 7. This is an example of how a grammar that describes
the syntax of a language can also reflect its meaning.
Although this section has not introduced any really new ideas
or theoretical results, I hope it has demonstrated how
context-free grammars can be applied in practice.
Exercises
One of the examples in this section was a grammar for
a subset of English. Give five more examples of sentences that
can be generated from that grammar. Your examples should, collectively,
use all the rules of the grammar.
Rewrite the example BNF grammar for a subset of English as
a context-free grammar.
Write a single BNF production rule that is equivalent to
the following context-free grammar:
SSBB⟶aSa⟶bB⟶bB⟶ε Answer
S ::= a(S∣b[b]…)a Write a BNF production rule that specifies the syntax of
real numbers, as they appear in programming languages such as Java and C.
Real numbers can include a sign, a decimal point and an exponential part.
Some examples are: 17.3
, .73
, 23.1e67
, -1.34E-12
, +0.2
, and 100E+100
.
Variable references in the Java programming language can be
rather complicated. Some examples include:
x
, list.next
, A[7]
, a.b.c
, S[i+1].grid[r][c].red
, ….
Write a BNF production rule for Java variables.
You can use the token ident and the non-terminal
⟨expression⟩ in your rule.
Use BNF to express the syntax of the try ... catch
statement in the
Java programming language.
Give a BNF grammar for compound propositions made up
of propositional variables, parentheses, and the logical operators
∧, ∨, and ¬. Use the non-terminal symbol ⟨pv⟩ to represent
a propositional variable. You do not have to give a definition of
⟨pv⟩.
Answer
⟨prop⟩ ::= ⟨term⟩ [∨ ⟨term⟩]…⟨term⟩ ::= ⟨factor⟩ [∧ ⟨factor⟩]…⟨factor⟩ ::= ⟨pv⟩ ∣ ¬ ⟨factor⟩ ∣ ‘‘(" ⟨prop⟩ ‘‘)"