• 沒有找到結果。

Compiler I: Syntax Analysis

N/A
N/A
Protected

Academic year: 2022

Share "Compiler I: Syntax Analysis"

Copied!
34
0
0

加載中.... (立即查看全文)

全文

(1)

www.nand2tetris.org

Building a Modern Computer From First Principles

Compiler I: Syntax Analysis

(2)

Course map

Assembler Chapter 6

H.L. Language

&

Operating Sys.

abstract interface

Compiler

Chapters 10 - 11

VM Translator

Chapters 7 - 8

Computer Architecture

Chapters 4 - 5

Gate Logic

Chapters 1 - 3

Electrical

Engineering

Physics Virtual

Machine

abstract interface

Software hierarchy

Assembly Language

abstract interface

Hardware hierarchy

Machine Language

abstract interface

Hardware Platform

abstract interface

Chips &

Logic Gates

abstract interface

Human Thought

Abstract design

Chapters 9, 12

(3)

Motivation: Why study about compilers?

The first compiler is FORTRAN compiler developed by an IBM team led by John Backus (Turing Award, 1977) in 1957. It took 18 man-month.

Because Compilers …

 Are an essential part of applied computer science

 Are very relevant to computational linguistics

 Are implemented using classical programming techniques

 Employ important software engineering principles

 Train you in developing software for transforming one structure to another (programs, files, transactions, …)

 Train you to think in terms of ”description languages”.

 Parsing files of some complex syntax is very common in many

applications.

(4)

The big picture

. . .

RISC machine

other digital platforms, each equipped with its VM implementation RISC

machine language

Hack computer

Hack machine language CISC

machine language

CISC machine

. . .

a high-levelwritten in language

Any computer

. . .

HW lectures (Projects

1-6) Intermediate code

VM implementation

over CISC platforms

VM imp.

over RISC platforms

VM imp.

over the Hack platform

VM emulator

VM lectures (Projects

7-8) Some Other

language

Jack language

compilerSome Some Other compiler

compiler Jack

. . .

Some

language

. . .

Compiler lectures (Projects

10,11)

Modern compilers are two-tiered:

 Front-end:

from high-level language to some intermediate

language

 Back-end:

from the

intermediate

language to

binary code.

(5)

Compiler architecture (front end)

. . .

Intermediate code

RISC machine language

Hack machine language CISC

machine

language . . . a high-levelwritten in

language

. . .

VM implementation

over CISC platforms

VM imp.

over RISC platforms

VM imp.

over the Hack platform VM emulator Some Other

language Jack

language

Some compiler Some Other

compiler Jack compiler

. . .

Some language . . .

 Syntax analysis: understanding the structure of the source code

 Code generation: reconstructing the semantics using the syntax of the target code.

 Tokenizing: creating a stream of “atoms”

 Parsing: matching the atom stream with the language grammar XML output = one way to demonstrate that the syntax analyzer

works

(Chapter 11)

Jack Program

Toke-

nizer Parser

Code Gene -ration

Syntax Analyzer Jack Compiler

VM code XML code

(Chapter 10)

(source)

scanner

(target)

(6)

Tokenizing / Lexical analysis / scanning

 Remove white space

 Construct a token list (language atoms)

 Things to worry about:

 Language specific rules: e.g. how to treat “++”

 Language-specific classifications:

keyword, symbol, identifier, integerCconstant, stringConstant,...

 While we are at it, we can have the tokenizer record not only

the token, but also its lexical classification (as defined by the

source language grammar).

(7)

C function to split a string into tokens

char* strtok(char* str, const char* delimiters);

str : string to be broken into tokens

delimiters : string containing the delimiter characters

(8)

Jack Tokenizer

if (x < 153) {let city = ”Paris”;} Source code

<tokens>

<keyword> if </keyword>

<symbol> ( </symbol>

<identifier> x </identifier>

<symbol> &lt; </symbol>

<integerConstant> 153 </integerConstant>

<symbol> ) </symbol>

<symbol> { </symbol>

<keyword> let </keyword>

<identifier> city </identifier>

<symbol> = </symbol>

<stringConstant> Paris </stringConstant>

<symbol> ; </symbol>

<symbol> } </symbol>

</tokens>

Tokenizer’s output

Tokenizer

(9)

Parsing

 The tokenizer discussed thus far is part of a larger program called parser

Each language is characterized by a grammar.

The parser is implemented to recognize this grammar in given texts

 The parsing process:

 A text is given and tokenized

 The parser determines weather or not the text can be generated from the grammar

 In the process, the parser performs a complete structural analysis of the text

 The text can be in an expression in a :

 Natural language (English, …)

Programming language (Jack, …).

(10)

Parsing examples

He ate an apple on the desk.

English

ate

he an apple

the desk parse

on

(5+3)*2 – sqrt(9*4)

-

5

sqrt

+ *

3

2

9 4

*

Jack

(11)

Regular expressions

 a|b*

{, “a”, “b”, “bb”, “bbb”, …}

 (a|b)*

{, “a”, “b”, “aa”, “ab”, “ba”, “bb”, “aaa”, …}

 ab*(c|)

{a, “ac”, “ab”, “abc”, “abb”, “abbc”, …}

(12)

Context-free grammar

 S() S(S) SSS

 Sa|aS|bS

strings ending with ‘a’

 S  x S  y S  S+S S  S-S S  S*S S  S/S S  (S)

(x+y)*x-x*y/(x+x)

 Simple (terminal) forms / complex (non-terminal) forms

 Grammar = set of rules on how to construct complex forms from simpler forms

 Highly recursive.

(13)

Recursive descent parser

 A=bB|cC

A() {

if (next()==‘b’) { eat(‘b’);

B();

} else if (next()==‘c’) { eat(‘c’);

C();

} }

 A=(bB)*

A() {

while (next()==‘b’) { eat(‘b’);

B();

}

}

(14)

A typical grammar of a typical C-like language

while (expression) { if (expression)

statement;

while (expression) { statement;

if (expression) statement;

}

while (expression) { statement;

statement;

} }

Code samples

if (expression) { statement;

while (expression) statement;

statement;

}

if (expression)

if (expression) statement;

}

(15)

A typical grammar of a typical C-like language

program: statement;

statement: whileStatement

| ifStatement

| // other statement possibilities ...

| '{' statementSequence '}'

whileStatement: 'while' '(' expression ')' statement ifStatement: simpleIf

| ifElse

simpleIf: 'if' '(' expression ')' statement ifElse: 'if' '(' expression ')' statement

'else' statement

statementSequence: '' // null, i.e. the empty sequence

| statement ';' statementSequence

expression: // definition of an expression comes here

(16)

Parse tree

while ( count <= 100 ) { count ++ . . .

statement

whileStatement

expression

statementSequence statement

;

statement statementSequence

Input Text:

while (count<=100) { /** demonstration */

count++;

// ...

Tokenized:

while ( count

<=

100 ) { count ++

; ...

program: statement;

statement: whileStatement

| ifStatement

| // other statement possibilities ...

| '{' statementSequence '}' whileStatement: 'while'

'(' expression ')' statement

...

(17)

Recursive descent parsing

Parser implementation: a set of parsing methods, one for each rule:

parseStatement()

parseWhileStatement()

parseIfStatement()

parseStatementSequence()

 Highly recursive

 LL(0) grammars: the first token determines in which rule we are

 In other grammars you have to look ahead 1 or more tokens

 Jack is almost LL(0).

while (expression) { statement;

statement;

while (expression) { while (expression)

statement;

statement;

} }

code sample

(18)

The Jack grammar

’x’: x appears verbatim

x: x is a language construct x?: x appears 0 or 1 times x*: x appears 0 or more times x|y: either x or y appears

(x,y): x appears, then y.

(19)

The Jack grammar

’x’: x appears verbatim

x: x is a language construct x?: x appears 0 or 1 times x*: x appears 0 or more times x|y: either x or y appears

(x,y): x appears, then y.

(20)

The Jack grammar

’x’: x appears verbatim

x: x is a language construct x?: x appears 0 or 1 times x*: x appears 0 or more times x|y: either x or y appears

(x,y): x appears, then y.

(21)

The Jack grammar

’x’: x appears verbatim

x: x is a language construct x?: x appears 0 or 1 times x*: x appears 0 or more times x|y: either x or y appears

(x,y): x appears, then y.

(22)

Jack syntax analyzer in action

Class Bar {

method Fraction foo(int y) { var int temp; // a variable let temp = (xxx+12)*-63;

...

...

Syntax analyzer

 With the grammar, we can write a syntax

analyzer program (parser)

 The syntax analyzer takes a source text file and

attempts to match it on the language grammar

 If successful, it can

generate a parse tree in some structured format, e.g. XML.

Syntax analyzer

<varDec>

<keyword> var </keyword>

<keyword> int </keyword>

<identifier> temp </identifier>

<symbol> ; </symbol>

</varDec>

<statements>

<letStatement>

<keyword> let </keyword>

<identifier> temp </identifier>

<symbol> = </symbol>

<expression>

<term>

<symbol> ( </symbol>

<expression>

<term>

<identifier> xxx </identifier>

</term>

<symbol> + </symbol>

<term>

<int.Const.> 12 </int.Const.>

</term>

</expression>

...

(23)

Jack syntax analyzer in action

Class Bar {

method Fraction foo(int y) { var int temp; // a variable let temp = (xxx+12)*-63;

...

...

Syntax analyzer

 If xxx is non-terminal, output:

<xxx>

Recursive code for the body of xxx

</xxx>

If xxx is terminal

(keyword, symbol, constant, or identifier) , output:

<xxx>

xxx value

<varDec>

<keyword> var </keyword>

<keyword> int </keyword>

<identifier> temp </identifier>

<symbol> ; </symbol>

</varDec>

<statements>

<letStatement>

<keyword> let </keyword>

<identifier> temp </identifier>

<symbol> = </symbol>

<expression>

<term>

<symbol> ( </symbol>

<expression>

<term>

<identifier> xxx </identifier>

</term>

<symbol> + </symbol>

<term>

<int.Const.> 12 </int.Const.>

</term>

</expression>

(24)

Summary and next step

(Chapter 11)

Jack Program

Toke-

nizer Parser

Code Gene -ration

Syntax Analyzer Jack Compiler

VM code XML code

(Chapter 10)

 Syntax analysis: understanding syntax

 Code generation: constructing semantics

The code generation challenge:

 Extend the syntax analyzer into a full-blown compiler that, instead of passive XML code, generates executable VM code

 Two challenges: (a) handling data, and (b) handling commands.

(25)

Perspective

 The parse tree can be constructed on the fly

 The Jack language is intentionally simple:

Statement prefixes: let, do, ...

 No operator priority

 No error checking

 Basic data types, etc.

 The Jack compiler: designed to illustrate the key ideas that underlie modern compilers, leaving advanced

features to more advanced courses

 Richer languages require more powerful compilers

(26)

Perspective

 Syntax analyzers can be built using:

Lex tool for tokenizing (flex)

Yacc tool for parsing (bison)

 Do everything from scratch (our approach ...)

 Industrial-strength compilers: (LLVM)

 Have good error diagnostics

 Generate tight and efficient code

 Support parallel (multi-core) processors.

(27)

Lex (from wikipedia)

 A computer program that generates lexical analyzers (scanners or lexers)

 Commonly used with the yacc parser generator.

 Structure of a Lex file

Definition section

%%

Rules section

%%

C code section

(28)

Example of a Lex file

/*** Definition section ***/

%{

/* C code to be copied verbatim */

#include <stdio.h>

%}

/* This tells flex to read only one input file */

%option noyywrap

/*** Rules section ***/

%%

[0-9]+ {

/* yytext is a string containing the matched text. */

printf("Saw an integer: %s\n", yytext);

}

.|\n { /* Ignore all other characters. */ }

(29)

Example of a Lex file

%%

/*** C Code section ***/

int main(void) {

/* Call the lexer, then quit. */

yylex();

return 0;

}

(30)

Example of a Lex file

> flex test.lex

(a file lex.yy.c with 1,763 lines is generated)

> gcc lex.yy.c

(an executable file a.out is generated)

> ./a.out < test.txt Saw an integer: 123 Saw an integer: 2 Saw an integer: 6

abc123z.!&*2gj6

test.txt

(31)

Another Lex example

%{

int num_lines = 0, num_chars = 0;

%}

%option noyywrap

%%

\n ++num_lines; ++num_chars;

. ++num_chars;

%%

main() { yylex();

printf( "# of lines = %d, # of chars = %d\n", num_lines, num_chars );

}

(32)

A more complex Lex example

%{

/* need this for the call to atof() below */

#include <math.h>

%}

%option noyywrap DIGIT [0-9]

ID [a-z][a-z0-9]*

%%

{DIGIT}+ {

printf( "An integer: %s (%d)\n", yytext, atoi( yytext ) );

}

{DIGIT}+"."{DIGIT}* {

printf( "A float: %s (%g)\n", yytext, atof( yytext ) );

}

(33)

A more complex Lex example

if|then|begin|end|procedure|function {

printf( "A keyword: %s\n", yytext );

}

{ID} printf( "An identifier: %s\n", yytext );

"+"|"-"|"="|"("|")" printf( “Symbol: %s\n", yytext );

[ \t\n]+ /* eat up whitespace */

. printf("Unrecognized char: %s\n", yytext );

%%

void main(int argc, char **argv ) {

if ( argc > 1 ) yyin = fopen( argv[1], "r" );

else yyin = stdin;

yylex();

}

(34)

A more complex Lex example

if (a+b) then foo=3.1416 else

foo=12

pascal.txt

A keyword: if Symbol: (

An identifier: a Symbol: +

An identifier: b Symbol: )

A keyword: then

An identifier: foo Symbol: =

A float: 3.1416 (3.1416) An identifier: else

An identifier: foo Symbol: =

An integer: 12 (12)

output

參考文獻

相關文件

Elements of Computing Systems, Nisan &amp; Schocken, MIT Press, www.nand2tetris.org , Chapter 1: Compiler II: Code Generation slide

 Parsing: matching the atom stream with the language grammar XML output = one way to demonstrate that the syntax

• tiny (a single segment, used by .com programs), small (one code segment and one data segment), medium (multiple code segments and a single data segment), compact (one code

Once you have created a complete dialog with widgets or arbitrary window hier- archy, the Interface Builder generates the source code needed to recreate it programmati- cally and

The format of the URI in the first line of the header is not specified. For example, it could be empty, a single slash, if the server is only handling XML-RPC calls. However, if the

• The Java programming language is based on the virtual machine concept. • A program written in the Java language is translated by a Java compiler into Java

Second, the 80186 object code (Real Mode, Large Model) generated using the Borland C/C++ compiler is compatible with all 80x86 derivative processors from Intel, AMD or Cyrix..

In Paper I, we presented a comprehensive analysis that took into account the extended source surface brightness distribution, interacting galaxy lenses, and the presence of dust