Syntax definition
The syntax of textual programming languages is usually defined using a combination of
regular expressions (for
lexical structure) and
Backus-Naur Form (for
grammatical structure) to inductively specify
syntactic categories (nonterminals) and
terminal symbols. Syntactic categories are defined by rules called
productions, which specify the values that belong to a particular syntactic category.
[1] Terminal symbols are the concrete characters or strings of characters (for example
keywords such as
define,
if,
let, or
void) from which syntactically valid programs are constructed.
Below is a simple grammar, based on
Lisp, which defines productions for the syntactic categories
expression,
atom,
number,
symbol, and
list:
expression ::= atom | list
atom ::= number | symbol
number ::= [+-]?['0'-'9']+
symbol ::= ['A'-'Z''a'-'z'].*
list ::= '(' expression* ')'
This grammar specifies the following:
- an expression is either an atom or a list;
- an atom is either a number or a symbol;
- a number is an unbroken sequence of one or more decimal digits, optionally preceded by a plus or minus sign;
- a symbol is a letter followed by zero or more of any characters (excluding whitespace); and
- a list is a matched pair of parentheses, with zero or more expressions inside it.
Here the decimal digits, upper- and lower-case characters, and parentheses are terminal symbols.
The following are examples of well-formed token sequences in this grammar: '
12345
', '
()
', '
(a b c232 (1))
'
The grammar needed to specify a programming language can be classified by its position in the
Chomsky hierarchy. The syntax of most programming languages can be specified using a Type-2 grammar, i.e., they are
context-free grammars.
[2] However, there are exceptions. In some languages like Perl and Lisp the specification (or implementation) of the language allows constructs that execute during the parsing phase. Furthermore, these languages have constructs that allow the programmer to alter the behavior of the parser. This combination effectively blurs the distinction between parsing and execution, and makes syntax analysis an
undecidable problem in these languages, meaning that the parsing phase may not finish. For example, in Perl it is possible to execute code during parsing using a
BEGIN
statement, and Perl function prototypes may alter the syntactic interpretation, and possibly even the syntactic validity of the remaining code.
[3] Similarly, Lisp
macros introduced by the
defmacro
syntax also execute during parsing, meaning that a Lisp compiler must have an entire Lisp run-time system present. In contrast C macros are merely string replacements, and do not require code execution.
[4][5]
Syntax versus semantics
The syntax of a language describes the form of a valid program, but does not provide any information about the meaning of the program or the results of executing that program. The meaning given to a combination of symbols is handled by semantics (either
formal or hard-coded in a
reference implementation). Not all syntactically correct programs are semantically correct. Many syntactically correct programs are nonetheless ill-formed, per the language's rules; and may (depending on the language specification and the soundness of the implementation) result in an error on translation or execution. In some cases, such programs may exhibit
undefined behavior. Even when a program is well-defined within a language, it may still have a meaning that is not intended by the person who wrote it.
Using
natural language as an example, it may not be possible to assign a meaning to a grammatically correct sentence or the sentence may be false:
- "Colorless green ideas sleep furiously." is grammatically well-formed but has no generally accepted meaning.
- "John is a married bachelor." is grammatically well-formed but expresses a meaning that cannot be true.
The following C language fragment is syntactically correct, but performs an operation that is not semantically defined (because
p is a
null pointer, the operations
p->real and
p->im have no meaning):
complex *p = NULL;
complex abs_p = sqrt (p->real * p->real + p->im * p->im);
No comments:
Post a Comment