Wednesday, February 08, 2006

Class IV: Describing Syntax - BNF

Formal definition of the syntax.
Why?
1) feedback cycle in language formation
2) implementors of compilers/interpreters
3) users of language

syntax = form of the language, vs. semantics = meaning of language

syntax of if:
if ({expr}) {statement}
but this tells us nothing about semantics.

ALGOL difficulties.

Describing Syntax:
1) Languages (natural and programming) are sets of strings of chars from an alphabet. These strings are sentences or statements. Syntax rules: which sentences are in the language.

Formal descriptions often don't deal with basic parts - lexemes, which might be specified by a separate lexical specification. Lex vs. Yacc. Lexemes: identifiers, literals, ops, keywords. programs = string of chars, or string of lexemes.

token = category of its lexemes. the output of Lex.
identifier might be total, count, etc..
plus_op just has + as the literal

index = 2 * count + 17;

Q: what are the tokens, what are the lexemes?

2) Two ways of defining lang: lang recognizer vs. lang generator. R (recognizer) takes in input and tells if it valid program. G generates all possible programs. These are related problems.

3) Formal methods of Describing Syntax:

BNF - Backus Naur Form. related to Chomsky's context free grammars, in context of natural language. ALGOL 58, ALGOL 60. Panini to describe Sanskrit.

series of rules, productions.
Alas, the limits of blogspot. {} instead of pointy brackets.
{assign} -> {var} = {expression}

terminals vs. nonterminals. capital vs. lowercase. BNF grammar = collection of rules.

| for "or." Pascal:
{if_statement} -> if {logic_expression} then {statement}
|
{if_statement} -> if {logic_expression} then {statement} else {statement}


use recursion to describe lists:
{identifier_list} -> identifier | identifier, {identifier_list}

Grammars, derivations:
Start symbol S, {program}. Sentence generation called derivation.

{program} -> begin {statement_list} end
{statement_list} -> {statement} | {statement} ; {statement_list}
{statement} -> {var} = {expression}
{var} -> A | B | C
{expression} -> {var} + {var} | {var} - {var} | {var}

show derivation for begin A = B + C; B = C end

derivation order matters not, but this common approach is leftmost derivation.
HW: To implement such a grammar in C++ using STL and rand function, of by labelling rules.

Simple Grammar for Assignment:
{assign} -> {id} = {expr}
{id} -> A | B | C
{expr} -> {id} + {expr} | {id} * {expr} | ( {expr} ) | {id}

show derivation for A = B * (A+C)

4) Parse trees
show parse tree for above. parse tree matches a derivation.

5) Ambiguity
An ambiguous grammar:

{assign} -> {id} = {expr}
{id} -> A | B | C
{expr} -> {expr} + {expr} | {expr} * {expr} | ( {expr} ) | {id}

without using parentheses, what is the meaning of A = B + C * A?
Two parse trees, one with the first expansion of {expr} being the *, the other with the first being the +. Lower down in the tree, the higher its precedence.

No way to uniquely define the meaning of the structure.

6) Operator precedence:
An unambiguous grammar.
First, simple, example where went to {id} was unambiguous, but wrong precedence. Rightmost is lowest in tree, thus highest precedence.
Solution:

{assign} -> {id} = {expr}
{id} -> A | B | C
{expr} -> {expr} + {term} | {term}
{term} -> {term} * {factor} | {factor}
{factor} -> ({expr}) | {id}

show derivation of A = B + C * A on board, and simultaneously draw parse tree.

7) Associativity.
A = B + C + A
does grouping, ordering matter? Based on prev grammar, would do B + C at lowest level, first. Does in cases of large floats with limited precision. Using our unambiguous grammar, we get a left associative interpretation. (Right associative not possible because we do not have {expr} going into two {expr}s. Left associative because recursion all the way to the left. Left recursive. Can make right recursive and thus right associative.

e.g.

{factor} --> {expr} ** {factor} | {expr}
{expr} --> ({expr}) | {id}

Unambiguous if-else
look at if-else definition above. if have {statement} -> {if_statement}, ambiguity. example in C. show the two parse trees. can require begin and end
an unambiguous grammar:
the rule is that an else is matched with nearest previous unmatched then.

{statement} -> {matched} | {unmatched}
{matched} -> if {logic_expr} then {matched} else {matched} | any non-if statement
{unmatched} -> if {logic_expr} then {statement} |
if {logic_expr} then {matched} else {unmatched}

EBNF
{selection} -> if ({expr} ) {statement} [else {statement} ]
brackets for optional component rather than |. Optional recurring via actual curly braces, rather than using recursion:
{ident_list} -> {identifier} {{identifier}}

place options in non-terminal parentheses and separate via |.

Some allow number after right } to indicate upper limit on repetitions, or + to mean 1+.


No comments: