Compiler I: Syntax Analysis

(1)

www.nand2tetris.org

Building a Modern Computer From First Principles

Compiler I: Syntax Analysis

(2)

Course map

Assembler Chapter 6

H.L. Language

&

Operating Sys.

abstract interface

Compiler

Chapters 10 - 11

VM Translator

Chapters 7 - 8

Computer Architecture

Chapters 4 - 5

Gate Logic

Chapters 1 - 3

Electrical

Engineering

Physics Virtual

Machine

abstract interface

Software hierarchy

Assembly Language

abstract interface

Hardware

Machine Language

abstract interface

Hardware Platform

abstract interface

Chips &

abstract interface

Human Thought

Abstract design

Chapters 9, 12

(3)

Motivation: Why study about compilers?

The first compiler is FORTRAN compiler developed by an IBM team led by John Backus (Turing Award, 1977) in 1957. It took 18 man-month.

Because Compilers …

 Are an essential part of applied computer science

 Are very relevant to computational linguistics

 Are implemented using classical programming techniques

 Employ important software engineering principles

 Train you in developing software for transforming one structure to another (programs, files, transactions, …)

 Train you to think in terms of ”description languages”.

 Parsing files of some complex syntax is very common in many

(4)

The big picture

. . .

RISC machine language

Hack machine language

CISC

machine

language . . . a high-level ^{written in}

language

. . .

HW lectures (Projects Intermediate code

VM implementation

over CISC platforms

VM imp.

over RISC platforms

VM imp.

over the Hack platform

VM emulator

VM lectures (Projects

7-8)

Some Other

language

Jack language

Some

compiler Some Other compiler

Jack compiler

. . .

Some

language . . .

Compiler lectures (Projects

10,11)

Modern compilers are two-tiered:

 Front-end:

from high-level language to some intermediate

language

 Back-end:

from the

intermediate

language to

binary code.

(5)

Compiler architecture (front end)

. . .

Intermediate code

RISC machine language

Hack machine language CISC

machine

language . . . a high-level^{written in}

language

. . .

VM implementation

over CISC platforms

VM imp.

over RISC platforms

VM imp.

over the Hack platform VM emulator Some Other

language

Jack language

Some compiler Some Other

compiler Jack compiler

. . .

Some language . . .

 Syntax analysis: understanding the structure of the source code

 Code generation: reconstructing the semantics using the

 Tokenizing: creating a stream of “atoms”

 Parsing: matching the atom stream with the language grammar XML output = one way to demonstrate that the syntax analyzer

works

(Chapter 11)

Keyboar d

Jack Program

Toke-

nizer Parser

Code Gene -ration

Syntax Analyzer Jack Compiler

VM code XML code

(Chapter 10)

(source) (target)

scanner

(6)

Tokenizing / Lexical analysis / scanning

 Remove white space

 Construct a token list (language atoms)

 Things to worry about:

 Language specific rules: e.g. how to treat “++”

 Language-specific classifications:

keyword, symbol, identifier, integerCconstant,

stringConstant,...

(7)

C function to split a string into tokens

 **char* strtok(char* str, const char* delimiters);**

 str : string to be broken into tokens

 delimiters : string containing the delimiter characters

(8)

Jack Tokenizer

if (x < 153) {let city = ”Paris”;} Source code

<tokens>

<keyword> if </keyword>

<symbol> ( </symbol>

<identifier> x </identifier>

<symbol> < </symbol>

<integerConstant> 153 </integerConstant>

<symbol> ) </symbol>

<symbol> { </symbol>

<keyword> let </keyword>

<identifier> city </identifier>

<symbol> = </symbol>

<stringConstant> Paris </stringConstant>

Tokenizer’s output

Tokenizer

(9)

Parsing

 The tokenizer discussed thus far is part of a larger program called parser

 Each language is characterized by a grammar.

The parser is implemented to recognize this grammar in given texts

 The parsing process:

 A text is given and tokenized

 The parser determines weather or not the text can be generated from the grammar

 In the process, the parser performs a complete structural analysis of the text

 The text can be in an expression in a :

 Natural language (English, …)

(10)

Parsing examples

He ate an apple on the desk.

English

ate

he an apple

the desk parse

on

**(5+3)2 – sqrt(94)**

-

sqrt

+ ² *

*

Jack

(11)

Regular expressions

 a|b*

{  , “a”, “b”, “bb”, “bbb”, …}

 (a|b)*

{  , “a”, “b”, “aa”, “ab”, “ba”, “bb”, “aaa”, …}

 ab*(c|  )

{a, “ac”, “ab”, “abc”, “abb”, “abbc”, …}

(12)

Lex

 A computer program that generates lexical analyzers (scanners or lexers)

 Commonly used with the yacc parser generator.

 Structure of a Lex file

Definition section

%%

Rules section

%%

C code section

(13)

Example of a Lex file

/* Definition section */

%{

**/* C code to be copied verbatim */**

#include <stdio.h>

%}

**/* This tells flex to read only one input file */**

%option noyywrap

/* Rules section */

%%

[0-9]+ {

**/* yytext is a string containing the matched text. */**

printf("Saw an integer: %s\n", yytext);

}

**.|\n { /* Ignore all other characters. */ }**

(14)

Example of a Lex file

%%

/* C Code section */

int main(void) {

**/* Call the lexer, then quit. */**

yylex();

return 0;

}

(15)

Example of a Lex file

> flex test.lex

(a file lex.yy.c with 1,763 lines is generated)

> gcc lex.yy.c

(an executable file a.out is generated)

> ./a.out < test.txt Saw an integer: 123 Saw an integer: 2 Saw an integer: 6

**abc123z.!&*2gj6**

test.txt

(16)

Another Lex example

%{

int num_lines = 0, num_chars = 0;

%}

%option noyywrap

%%

\n ++num_lines; ++num_chars;

. ++num_chars;

%%

main() { yylex();

printf( "# of lines = %d, # of chars = %d\n", num_lines, num_chars );

}

(17)

A more complex Lex example

%{

**/* need this for the call to atof() below */**

#include <math.h>

%}

%option noyywrap DIGIT [0-9]

ID [a-z][a-z0-9]*

%%

{DIGIT}+ {

printf( "An integer: %s (%d)\n", yytext, atoi( yytext ) );

}

**{DIGIT}+"."{DIGIT}* {**

printf( "A float: %s (%g)\n", yytext,

atof( yytext ) );

(18)

A more complex Lex example

if|then|begin|end|procedure|function {

printf( "A keyword: %s\n", yytext );

}

{ID} printf( "An identifier: %s\n", yytext );

"+"|"-"|"="|"("|")" printf( “Symbol: %s\n", yytext );

**[ \t\n]+ /* eat up whitespace */**

. printf("Unrecognized char: %s\n", yytext );

%%

void main(int argc, char argv ) {**

if ( argc > 1 ) yyin = fopen( argv[1], "r" );

else yyin = stdin;

(19)

A more complex Lex example

if (a+b) then foo=3.1416 else

foo=12

pascal.txt

A keyword: if Symbol: (

An identifier: a Symbol: +

An identifier: b Symbol: )

A keyword: then

An identifier: foo Symbol: =

A float: 3.1416 (3.1416) An identifier: else

An identifier: foo Symbol: =

An integer: 12 (12)

output

(20)

Context-free grammar

 Terminals: 0, 1, #

 Non-terminals: A, B

 Start symbol: A

 Rules:

 A  0A1

 A  B

 B#

 Simple (terminal) forms / complex (non-terminal) forms

 Grammar = set of rules on how to construct complex forms from simpler forms

 Highly recursive.

(21)

Examples of context-free grammar

 S() S  (S) SSS

 Sa|aS|bS

strings ending with ‘a’

 S  x S  y S  S+S S  S-S S  S*S S  S/S S  (S)

(x+y)x-xy/(x+x)

(22)

Examples of context-free grammar

 non-terminals: S, E, Elist

 terminals: ID, NUM, PRINT, +, :=, (, ), ;

 rules:

S  S; S S  ID := E

S  PRINT ( Elist )

E  ID E  NUM E  E + E

E  ( S , Elist )

Elist  E

Elist  Elist , E

(23)

Examples of context-free grammar

 non-terminals: S, E, Elist

 terminals: ID, NUM, PRINT, +, :=, (, ), ;

 rules:

S  S; S S  ID := E

S  PRINT ( Elist )

E  ID E  NUM E  E + E

E  ( S , Elist )

Elist  E

Elist  Elist , E

S S ; S

ID = E ; S

ID = NUM ; S

ID = NUM ; PRINT ( Elist ) ID = NUM ; PRINT ( E )

ID = NUM ; PRINT ( NUM )

S S ; S

S ; PRINT ( Elist ) S ; PRINT ( E )

S ; PRINT ( NUM )

ID = E ; PRINT ( NUM )

ID = NUM ; PRINT ( NUM )

left-most derivation right-most derivation

(24)

Parse tree

 Two derivations, but 1 tree

S S ; S

ID = E ; S

ID = NUM ; S

ID = NUM ; PRINT ( Elist ) ID = NUM ; PRINT ( E )

ID = NUM ; PRINT ( NUM ) S S ; S

S ; PRINT ( Elist ) S ; PRINT ( E )

S ; PRINT ( NUM )

ID = E ; PRINT ( NUM )

S

S S

ID := E

NUM

NUM E

L ) PRINT (

;

(25)

Ambiguous Grammars

 a grammar is ambiguous if the same sequence of tokens can give rise to two or more parse trees

 non-terminals: E

 terminals: ID, NUM, PLUS, MUL

 rules: E  ID

E  NUM E  E + E E  E * E

characters: 4 + 5 * 6

tokens: NUM(4) PLUS NUM(5) MUL NUM(6)

(26)

Ambiguous Grammars

E  ID E  NUM E  E + E E  E * E

characters: 4 + 5 * 6

tokens: NUM(4) PLUS NUM(5) MUL NUM(6) E

E E

NUM(4)

* E E

+

NUM(6) NUM(5)

E E E

E E

*

(27)

Ambiguous Grammars

 problem: compilers use parse trees to interpret the meaning of parsed expressions

 different parse trees have different meanings

 eg: (4 + 5) * 6 is not 4 + (5 * 6)

 languages with ambiguous grammars are

DISASTROUS; The meaning of programs isn’t well- defined! You can’t tell what your program might do!

 solution: rewrite grammar to eliminate ambiguity

 fold precedence rules into grammar to disambiguate

 fold associativity rules into grammar to disambiguate

 other tricks as well

(28)

Recursive descent parser

 Recursive Descent Parsing

 aka: predictive parsing; top-down parsing

 simple, efficient

 can be coded by hand in ML quickly

 parses many, but not all CFGs

 parses LL(1) grammars

 Left-to-right parse; Leftmost-derivation; 1 symbol lookahead

 key ideas:

 one recursive function for each non terminal

 each production becomes one clause in the

(29)

Recursive descent parser

 Non-terminals: S, E, L

 Terminals: NUM, IF, THEN, ELSE, BEGIN, END, PRINT, =, ;

 Rules:

1. S -> IF E THEN S ELSE S

2. | BEGIN S L

3. | PRINT E

4. L -> END

5. | ; S L

6. E -> NUM = NUM

(30)

Recursive descent parser

 Non-terminals: S, E, L

 Terminals: NUM, IF, THEN, ELSE, BEGIN, END, PRINT, =, ;

 Rules:

1. S -> IF E THEN S ELSE S

2. | BEGIN S L

3. | PRINT E

4. L -> END

5. | ; S L

6. E -> NUM = NUM

S() {

switch (next()) { case IF:

eat(IF); E(); eat(THEN);

S(); eat(ELSE); S();

break;

case BEGIN:

eat(BEGIN); S(); L();

break;

case PRINT:

eat(PRINT); E();

break;

(31)

Recursive descent parser

 Non-terminals: S, E, L

 Terminals: NUM, IF, THEN, ELSE, BEGIN, END, PRINT, EQ(=), SEMI(;)

 Rules:

1. S -> IF E THEN S ELSE S

2. | BEGIN S L

3. | PRINT E

4. L -> END

5. | ; S L

6. E -> NUM = NUM

L() {

switch (next()) { case END:

eat(END);

break;

case SEMI:

eat(SEMI); S(); L();

break;

default:

error();

}

(32)

Recursive descent parser

 Non-terminals: S, E, L

 Terminals: NUM, IF, THEN, ELSE, BEGIN, END, PRINT, EQ(=), SEMI(;)

 Rules:

1. S -> IF E THEN S ELSE S

2. | BEGIN S L

3. | PRINT E

4. L -> END

5. | ; S L

6. E -> NUM = NUM

E() {

eat(NUM);

eat(EQ);

eat(NUM);

}

(33)

Recursive descent parser

 Non-terminals: S, A, E, L

 Terminals: EOF, ID, NUM, ASSIGN(:=), PRINT,

LPAREN((), RPAREN())

 Rules:

1. S -> A EOF

2. A -> ID := E

3. | PRINT(L)

4. E -> ID

5. | NUM

6. L -> E

7. | L, E

(34)

Recursive descent parser

 Non-terminals: S, A, E, L

 Terminals: EOF, ID, NUM, ASSIGN(:=), PRINT,

LPAREN((), RPAREN())

 Rules:

1. S -> A EOF

2. A -> ID := E

3. | PRINT(L)

4. E -> ID

5. | NUM

6. L -> E

S() {

A();

eat(EOF);

}

(35)

Recursive descent parser

 Non-terminals: S, A, E, L

 Terminals: EOF, ID, NUM, ASSIGN(:=), PRINT,

LPAREN((), RPAREN())

 Rules:

1. S -> A EOF

2. A -> ID := E

3. | PRINT(L)

4. E -> ID

5. | NUM

6. L -> E

7. | L, E

A() {

switch (next()) { case ID:

eat(ID); eat(ASSIGN);

E();

break;

case PRINT:

eat(PRINT); eat(LPAREN);

L(); eat(RPAREN);

break;

}

(36)

Recursive descent parser

 Non-terminals: S, A, E, L

 Terminals: EOF, ID, NUM, ASSIGN(:=), PRINT,

LPAREN((), RPAREN())

 Rules:

1. S -> A EOF

2. A -> ID := E

3. | PRINT(L)

4. E -> ID

5. | NUM

6. L -> E

E() {

switch (next()) { case ID:

eat(ID);

break;

case NUM:

eat(NUM);

break;

}

(37)

Recursive descent parser

 Non-terminals: S, A, E, L

 Terminals: EOF, ID, NUM, ASSIGN(:=), PRINT,

LPAREN((), RPAREN())

 Rules:

1. S -> A EOF

2. A -> ID := E

3. | PRINT(L)

4. E -> ID

5. | NUM

6. L -> E

7. | L, E

L() {

switch (next()) { case ID:

???

case NUM:

???

} Problem: }

E could be ID

L could be E could be ID

(38)

Recursive descent parser

 Non-terminals: S, A, E, L

 Terminals: EOF, ID, NUM, ASSIGN(:=), PRINT,

LPAREN((), RPAREN())

 Rules:

1. S -> A EOF

2. A -> ID := E

3. | PRINT(L)

4. E -> ID

5. | NUM

6. L -> E

Problem:

E could be ID

L could be E could be ID L -> E M

M -> , E M

(39)

A typical grammar of a typical C-like language

while (expression) { if (expression)

statement;

while (expression) { statement;

if (expression) statement;

}

while (expression) { statement;

statement;

} }

Code samples

if (expression) { statement;

while (expression) statement;

statement;

}

if (expression)

if (expression) statement;

}

(40)

A typical grammar of a typical C-like language

program: statement;

statement: whileStatement

| ifStatement

| // other statement possibilities ...

| '{' statementSequence '}'

whileStatement: 'while' '(' expression ')' statement ifStatement: simpleIf

| ifElse

simpleIf: 'if' '(' expression ')' statement ifElse: 'if' '(' expression ')' statement

'else' statement

statementSequence: '' // null, i.e. the empty sequence

(41)

Parse tree

statement

whileStatement

expression

statementSequence statement

statement statementSequence

Input Text:

while (count<=100) { / demonstration */**

count++;

// ...

Tokenized:

while ( count

<=

100 ) { count ++

; ...

program: statement;

statement: whileStatement

| ifStatement

| // other statement possibilities ...

| '{' statementSequence '}' whileStatement: 'while'

'(' expression ')' statement

...

(42)

Recursive descent parsing

Parser implementation: a set of parsing methods, one for each rule:

 parseStatement()

 parseWhileStatement()

 Highly recursive

 LL(0) grammars: the first token determines in which rule we are

 In other grammars you have to

while (expression) { statement;

statement;

while (expression) { while (expression)

statement;

} }

code sample

(43)

The Jack grammar

’x’: x appears verbatim

x: x is a language construct

x?: x appears 0 or 1 times

**x*:** x appears 0 or more times

x|y: either x or y appears

(44)

The Jack grammar

’x’: x appears verbatim

x: x is a language construct

x?: x appears 0 or 1 times

(45)

The Jack grammar

’x’: x appears verbatim

x: x is a language construct

x?: x appears 0 or 1 times

**x*:** x appears 0 or more times

x|y: either x or y appears

(46)

The Jack grammar

’x’: x appears verbatim

x: x is a language construct

x?: x appears 0 or 1 times

(47)

Jack syntax analyzer in action

Class Bar {

method Fraction foo(int y) { var int temp; // a variable **let temp = (xxx+12)*-63;**

...

Syntax analyzer

 With the grammar, we can write a syntax

analyzer program (parser)

 The syntax analyzer takes a source text file and

attempts to match it on the language grammar

 If successful, it can

generate a parse tree in some structured format,

Syntax analyzer

<varDec>

<keyword> var </keyword>

<keyword> int </keyword>

<identifier> temp </identifier>

<symbol> ; </symbol>

</varDec>

<statements>

<letStatement>

<keyword> let </keyword>

<identifier> temp </identifier>

<symbol> = </symbol>

<expression>

<term>

<symbol> ( </symbol>

<expression>

<term>

<identifier> xxx </identifier>

</term>

<symbol> + </symbol>

<term>

<int.Const.> 12 </int.Const.>

</term>

(48)

Jack syntax analyzer in action

Class Bar {

method Fraction foo(int y) { var int temp; // a variable **let temp = (xxx+12)*-63;**

...

Syntax analyzer

 If xxx is non-terminal, output:

<xxx>

Recursive code for the body of xxx

</xxx>

 If xxx is terminal

(keyword, symbol, constant, or identifier) , output:

<varDec>

<keyword> var </keyword>

<keyword> int </keyword>

<identifier> temp </identifier>

<symbol> ; </symbol>

</varDec>

<statements>

<letStatement>

<keyword> let </keyword>

<identifier> temp </identifier>

<symbol> = </symbol>

<expression>

<term>

<symbol> ( </symbol>

<expression>

<term>

<identifier> xxx </identifier>

</term>

<symbol> + </symbol>

<term>

(49)

The Jack grammar

’x’: x appears verbatim

x: x is a language construct

x?: x appears 0 or 1 times

**x*:** x appears 0 or more times

x|y: either x or y appears

(50)

Recursive descent parser (simplified expression)

 EXP  TERM (OP TERM)*

 TERM  integer | variable

 OP  + | - | * | /

(51)

From parsing to code generation

 EXP  TERM (OP TERM)*

 TERM  integer | variable

 OP  + | - | * | /

EXP() :

TERM();

while (next()==OP) OP();

TERM();

(52)

From parsing to code generation

 EXP  TERM (OP TERM)*

 TERM  integer | variable

 OP  + | - | * | /

EXP() :

TERM();

while (next()==OP) OP();

TERM();

TERM():

switch (next()) case INT:

eat(INT);

case VAR:

eat(VAR);

(53)

From parsing to code generation

 EXP  TERM (OP TERM)*

 TERM  integer | variable

 OP  + | - | * | /

EXP() :

TERM();

while (next()==OP) OP();

TERM();

OP():

switch (next())

case +: eat(ADD);

case -: eat(SUB);

case *: eat(MUL);

case /: eat(DIV);

TERM():

switch (next()) case INT:

eat(INT);

case VAR:

eat(VAR);

(54)

From parsing to code generation

 EXP  TERM (OP TERM)*

 TERM  integer | variable

 OP  + | - | * | /

EXP() :

TERM();

while (next()==OP) OP();

TERM();

OP():

switch (next())

case +: eat(ADD);

case -: eat(SUB);

case *: eat(MUL);

case /: eat(DIV);

TERM():

switch (next()) case INT:

eat(INT);

case VAR:

eat(VAR);

(55)

From parsing to code generation

 EXP  TERM (OP TERM)*

 TERM  integer | variable

 OP  + | - | * | /

EXP() : print(‘<exp>’);

TERM();

while (next()==OP) OP();

TERM();

print(‘</exp>’);

OP(): print(‘<op>’);

switch (next())

case +: eat(ADD);

print(‘<sym> + </sym>’);

case -: eat(SUB);

print(‘<sym> - </sym>’);

case *: eat(MUL);

print(‘<sym> * </sym>’);

case /: eat(DIV);

print(‘<sym> / </sym>’);

TERM(): print(‘<term>’);

switch (next())

case INT: print(‘<int> next() </int>’);

eat(INT);

case VAR: print(‘<id> next() </id>’);

eat(VAR);

print(‘</term>’);

(56)

Summary and next step

(Chapter 11)

Jack Program

Toke-

nizer Parser

Code Gene -ration

Syntax Analyzer Jack Compiler

VM code XML code

(Chapter 10)

 Syntax analysis: understanding syntax

 Code generation: constructing semantics

The code generation challenge:

 Extend the syntax analyzer into a full-blown compiler that,

(57)

Perspective

 The parse tree can be constructed on the fly

 The Jack language is intentionally simple:

 Statement prefixes: let, do, ...

 No operator priority

 No error checking

 Basic data types, etc.

 The Jack compiler: designed to illustrate the key ideas that underlie modern compilers, leaving advanced

Compiler I: Syntax Analysis

www.nand2tetris.org

Building a Modern Computer From First Principles