• 沒有找到結果。

Compiler I: Syntax Analysis

N/A
N/A
Protected

Academic year: 2022

Share "Compiler I: Syntax Analysis"

Copied!
34
0
0
顯示更多 ( 頁)

全文

(1)

www.nand2tetris.org

Building a Modern Computer From First Principles

Compiler I: Syntax Analysis

(2)

Course map

Assembler Chapter 6

H.L. Language

&

Operating Sys.

abstract interface

Compiler

Chapters 10 - 11

VM Translator

Chapters 7 - 8

Computer Architecture

Chapters 4 - 5

Gate Logic

Chapters 1 - 3

Electrical

Engineering

Physics Virtual

Machine

abstract interface

Software hierarchy

Assembly Language

abstract interface

Hardware hierarchy

Machine Language

abstract interface

Hardware Platform

abstract interface

Chips &

Logic Gates

abstract interface

Human Thought

Abstract design

Chapters 9, 12

(3)

Motivation: Why study about compilers?

The first compiler is FORTRAN compiler developed by an IBM team led by John Backus (Turing Award, 1977) in 1957. It took 18 man-month.

Because Compilers …

 Are an essential part of applied computer science

 Are very relevant to computational linguistics

 Are implemented using classical programming techniques

 Employ important software engineering principles

 Train you in developing software for transforming one structure to another (programs, files, transactions, …)

 Train you to think in terms of ”description languages”.

 Parsing files of some complex syntax is very common in many

applications.

(4)

The big picture

. . .

RISC machine

other digital platforms, each equipped with its VM implementation RISC

machine language

Hack computer

Hack machine language CISC

machine language

CISC machine

. . .

a high-levelwritten in language

Any computer

. . .

HW lectures (Projects

1-6) Intermediate code

VM implementation

over CISC platforms

VM imp.

over RISC platforms

VM imp.

over the Hack platform

VM emulator

VM lectures (Projects

7-8) Some Other

language

Jack language

compilerSome Some Other compiler

compiler Jack

. . .

Some

language

. . .

Compiler lectures (Projects

10,11)

Modern compilers are two-tiered:

 Front-end:

from high-level language to some intermediate

language

 Back-end:

from the

intermediate

language to

binary code.

(5)

Compiler architecture (front end)

. . .

Intermediate code

RISC machine language

Hack machine language CISC

machine

language . . . a high-levelwritten in

language

. . .

VM implementation

over CISC platforms

VM imp.

over RISC platforms

VM imp.

over the Hack platform VM emulator Some Other

language Jack

language

Some compiler Some Other

compiler Jack compiler

. . .

Some language . . .

 Syntax analysis: understanding the structure of the source code

 Code generation: reconstructing the semantics using the syntax of the target code.

 Tokenizing: creating a stream of “atoms”

 Parsing: matching the atom stream with the language grammar XML output = one way to demonstrate that the syntax analyzer

works

(Chapter 11)

Jack Program

Toke-

nizer Parser

Code Gene -ration

Syntax Analyzer Jack Compiler

VM code XML code

(Chapter 10)

(source)

scanner

(target)

(6)

Tokenizing / Lexical analysis / scanning

 Remove white space

 Construct a token list (language atoms)

 Things to worry about:

 Language specific rules: e.g. how to treat “++”

 Language-specific classifications:

keyword, symbol, identifier, integerCconstant, stringConstant,...

 While we are at it, we can have the tokenizer record not only

the token, but also its lexical classification (as defined by the

source language grammar).

(7)

C function to split a string into tokens

char* strtok(char* str, const char* delimiters);

str : string to be broken into tokens

delimiters : string containing the delimiter characters

(8)

Jack Tokenizer

if (x < 153) {let city = ”Paris”;} Source code

<tokens>

<keyword> if </keyword>

<symbol> ( </symbol>

<identifier> x </identifier>

<symbol> &lt; </symbol>

<integerConstant> 153 </integerConstant>

<symbol> ) </symbol>

<symbol> { </symbol>

<keyword> let </keyword>

<identifier> city </identifier>

<symbol> = </symbol>

<stringConstant> Paris </stringConstant>

<symbol> ; </symbol>

<symbol> } </symbol>

</tokens>

Tokenizer’s output

Tokenizer

(9)

Parsing

 The tokenizer discussed thus far is part of a larger program called parser

Each language is characterized by a grammar.

The parser is implemented to recognize this grammar in given texts

 The parsing process:

 A text is given and tokenized

 The parser determines weather or not the text can be generated from the grammar

 In the process, the parser performs a complete structural analysis of the text

 The text can be in an expression in a :

 Natural language (English, …)

Programming language (Jack, …).

(10)

Parsing examples

He ate an apple on the desk.

English

ate

he an apple

the desk parse

on

(5+3)*2 – sqrt(9*4)

-

5

sqrt

+ *

3

2

9 4

*

Jack

(11)

Regular expressions

 a|b*

{, “a”, “b”, “bb”, “bbb”, …}

 (a|b)*

{, “a”, “b”, “aa”, “ab”, “ba”, “bb”, “aaa”, …}

 ab*(c|)

{a, “ac”, “ab”, “abc”, “abb”, “abbc”, …}

(12)

Context-free grammar

 S() S(S) SSS

 Sa|aS|bS

strings ending with ‘a’

 S  x S  y S  S+S S  S-S S  S*S S  S/S S  (S)

(x+y)*x-x*y/(x+x)

 Simple (terminal) forms / complex (non-terminal) forms

 Grammar = set of rules on how to construct complex forms from simpler forms

 Highly recursive.

(13)

Recursive descent parser

 A=bB|cC

A() {

if (next()==‘b’) { eat(‘b’);

B();

} else if (next()==‘c’) { eat(‘c’);

C();

} }

 A=(bB)*

A() {

while (next()==‘b’) { eat(‘b’);

B();

}

}

(14)

A typical grammar of a typical C-like language

while (expression) { if (expression)

statement;

while (expression) { statement;

if (expression) statement;

}

while (expression) { statement;

statement;

} }

Code samples

if (expression) { statement;

while (expression) statement;

statement;

}

if (expression)

if (expression) statement;

}

(15)

A typical grammar of a typical C-like language

program: statement;

statement: whileStatement

| ifStatement

| // other statement possibilities ...

| '{' statementSequence '}'

whileStatement: 'while' '(' expression ')' statement ifStatement: simpleIf

| ifElse

simpleIf: 'if' '(' expression ')' statement ifElse: 'if' '(' expression ')' statement

'else' statement

statementSequence: '' // null, i.e. the empty sequence

| statement ';' statementSequence

expression: // definition of an expression comes here

(16)

Parse tree

while ( count <= 100 ) { count ++ . . .

statement

whileStatement

expression

statementSequence statement

;

statement statementSequence

Input Text:

while (count<=100) { /** demonstration */

count++;

// ...

Tokenized:

while ( count

<=

100 ) { count ++

; ...

program: statement;

statement: whileStatement

| ifStatement

| // other statement possibilities ...

| '{' statementSequence '}' whileStatement: 'while'

'(' expression ')' statement

...

(17)

Recursive descent parsing

Parser implementation: a set of parsing methods, one for each rule:

parseStatement()

parseWhileStatement()

parseIfStatement()

parseStatementSequence()

 Highly recursive

 LL(0) grammars: the first token determines in which rule we are

 In other grammars you have to look ahead 1 or more tokens

 Jack is almost LL(0).

while (expression) { statement;

statement;

while (expression) { while (expression)

statement;

statement;

} }

code sample

(18)

The Jack grammar

’x’: x appears verbatim

x: x is a language construct x?: x appears 0 or 1 times x*: x appears 0 or more times x|y: either x or y appears

(x,y): x appears, then y.

(19)

The Jack grammar

’x’: x appears verbatim

x: x is a language construct x?: x appears 0 or 1 times x*: x appears 0 or more times x|y: either x or y appears

(x,y): x appears, then y.

(20)

The Jack grammar

’x’: x appears verbatim

x: x is a language construct x?: x appears 0 or 1 times x*: x appears 0 or more times x|y: either x or y appears

(x,y): x appears, then y.

(21)

The Jack grammar

’x’: x appears verbatim

x: x is a language construct x?: x appears 0 or 1 times x*: x appears 0 or more times x|y: either x or y appears

(x,y): x appears, then y.

(22)

Jack syntax analyzer in action

Class Bar {

method Fraction foo(int y) { var int temp; // a variable let temp = (xxx+12)*-63;

...

...

Syntax analyzer

 With the grammar, we can write a syntax

analyzer program (parser)

 The syntax analyzer takes a source text file and

attempts to match it on the language grammar

 If successful, it can

generate a parse tree in some structured format, e.g. XML.

Syntax analyzer

<varDec>

<keyword> var </keyword>

<keyword> int </keyword>

<identifier> temp </identifier>

<symbol> ; </symbol>

</varDec>

<statements>

<letStatement>

<keyword> let </keyword>

<identifier> temp </identifier>

<symbol> = </symbol>

<expression>

<term>

<symbol> ( </symbol>

<expression>

<term>

<identifier> xxx </identifier>

</term>

<symbol> + </symbol>

<term>

<int.Const.> 12 </int.Const.>

</term>

</expression>

...

(23)

Jack syntax analyzer in action

Class Bar {

method Fraction foo(int y) { var int temp; // a variable let temp = (xxx+12)*-63;

...

...

Syntax analyzer

 If xxx is non-terminal, output:

<xxx>

Recursive code for the body of xxx

</xxx>

If xxx is terminal

(keyword, symbol, constant, or identifier) , output:

<xxx>

xxx value

<varDec>

<keyword> var </keyword>

<keyword> int </keyword>

<identifier> temp </identifier>

<symbol> ; </symbol>

</varDec>

<statements>

<letStatement>

<keyword> let </keyword>

<identifier> temp </identifier>

<symbol> = </symbol>

<expression>

<term>

<symbol> ( </symbol>

<expression>

<term>

<identifier> xxx </identifier>

</term>

<symbol> + </symbol>

<term>

<int.Const.> 12 </int.Const.>

</term>

</expression>

(24)

Summary and next step

(Chapter 11)

Jack Program

Toke-

nizer Parser

Code Gene -ration

Syntax Analyzer Jack Compiler

VM code XML code

(Chapter 10)

 Syntax analysis: understanding syntax

 Code generation: constructing semantics

The code generation challenge:

 Extend the syntax analyzer into a full-blown compiler that, instead of passive XML code, generates executable VM code

 Two challenges: (a) handling data, and (b) handling commands.

(25)

Perspective

 The parse tree can be constructed on the fly

 The Jack language is intentionally simple:

Statement prefixes: let, do, ...

 No operator priority

 No error checking

 Basic data types, etc.

 The Jack compiler: designed to illustrate the key ideas that underlie modern compilers, leaving advanced

features to more advanced courses

 Richer languages require more powerful compilers

(26)

Perspective

 Syntax analyzers can be built using:

Lex tool for tokenizing (flex)

Yacc tool for parsing (bison)

 Do everything from scratch (our approach ...)

 Industrial-strength compilers: (LLVM)

 Have good error diagnostics

 Generate tight and efficient code

 Support parallel (multi-core) processors.

(27)

Lex (from wikipedia)

 A computer program that generates lexical analyzers (scanners or lexers)

 Commonly used with the yacc parser generator.

 Structure of a Lex file

Definition section

%%

Rules section

%%

C code section

(28)

Example of a Lex file

/*** Definition section ***/

%{

/* C code to be copied verbatim */

#include <stdio.h>

%}

/* This tells flex to read only one input file */

%option noyywrap

/*** Rules section ***/

%%

[0-9]+ {

/* yytext is a string containing the matched text. */

printf("Saw an integer: %s\n", yytext);

}

.|\n { /* Ignore all other characters. */ }

(29)

Example of a Lex file

%%

/*** C Code section ***/

int main(void) {

/* Call the lexer, then quit. */

yylex();

return 0;

}

(30)

Example of a Lex file

> flex test.lex

(a file lex.yy.c with 1,763 lines is generated)

> gcc lex.yy.c

(an executable file a.out is generated)

> ./a.out < test.txt Saw an integer: 123 Saw an integer: 2 Saw an integer: 6

abc123z.!&*2gj6

test.txt

(31)

Another Lex example

%{

int num_lines = 0, num_chars = 0;

%}

%option noyywrap

%%

\n ++num_lines; ++num_chars;

. ++num_chars;

%%

main() { yylex();

printf( "# of lines = %d, # of chars = %d\n", num_lines, num_chars );

}

(32)

A more complex Lex example

%{

/* need this for the call to atof() below */

#include <math.h>

%}

%option noyywrap DIGIT [0-9]

ID [a-z][a-z0-9]*

%%

{DIGIT}+ {

printf( "An integer: %s (%d)\n", yytext, atoi( yytext ) );

}

{DIGIT}+"."{DIGIT}* {

printf( "A float: %s (%g)\n", yytext, atof( yytext ) );

}

(33)

A more complex Lex example

if|then|begin|end|procedure|function {

printf( "A keyword: %s\n", yytext );

}

{ID} printf( "An identifier: %s\n", yytext );

"+"|"-"|"="|"("|")" printf( “Symbol: %s\n", yytext );

[ \t\n]+ /* eat up whitespace */

. printf("Unrecognized char: %s\n", yytext );

%%

void main(int argc, char **argv ) {

if ( argc > 1 ) yyin = fopen( argv[1], "r" );

else yyin = stdin;

yylex();

}

(34)

A more complex Lex example

if (a+b) then foo=3.1416 else

foo=12

pascal.txt

A keyword: if Symbol: (

An identifier: a Symbol: +

An identifier: b Symbol: )

A keyword: then

An identifier: foo Symbol: =

A float: 3.1416 (3.1416) An identifier: else

An identifier: foo Symbol: =

An integer: 12 (12)

output

參考文獻

相關文件

static: holds values of global variables, shared by all functions in the same class.. argument: holds values of the argument variables of the current function local: holds values

First, in the Intel documentation, the encoding of the MOV instruction that moves an immediate word into a register is B8 +rw dw, where +rw indicates that a register code (0-7) is to

It allows a VHDL structural description to be written in a top-down manner, because the VHDL compiler can check the consistency of the design unit that uses the component before

– The The readLine readLine method is the same method used to read method is the same method used to read  from the keyboard, but in this case it would read from a 

?: {machine learning, data structure, data mining, object oriented programming, artificial intelligence, compiler, architecture, chemistry, textbook, children book,. }. a

?: {machine learning, data structure, data mining, object oriented programming, artificial intelligence, compiler, architecture, chemistry, textbook, children book,. }. a

?: {machine learning, data structure, data mining, object oriented programming, artificial intelligence, compiler, architecture, chemistry, textbook, children book,. }. a

?: { machine learning, data structure, data mining, object oriented programming, artificial intelligence, compiler, architecture, chemistry, textbook, children book,. }. a

Elements of Computing Systems, Nisan &amp; Schocken, MIT Press, www.nand2tetris.org , Chapter 1: Compiler II: Code Generation slide

 Parsing: matching the atom stream with the language grammar XML output = one way to demonstrate that the syntax

• tiny (a single segment, used by .com programs), small (one code segment and one data segment), medium (multiple code segments and a single data segment), compact (one code

Once you have created a complete dialog with widgets or arbitrary window hier- archy, the Interface Builder generates the source code needed to recreate it programmati- cally and

The format of the URI in the first line of the header is not specified. For example, it could be empty, a single slash, if the server is only handling XML-RPC calls. However, if the

• The Java programming language is based on the virtual machine concept. • A program written in the Java language is translated by a Java compiler into Java

Second, the 80186 object code (Real Mode, Large Model) generated using the Borland C/C++ compiler is compatible with all 80x86 derivative processors from Intel, AMD or Cyrix..

In Paper I, we presented a comprehensive analysis that took into account the extended source surface brightness distribution, interacting galaxy lenses, and the presence of dust

• A function is a piece of program code that accepts input arguments from the caller, and then returns output arguments to the caller.. • In MATLAB, the syntax of functions is

// VM command following the label c In the VM language, the program flow abstraction is delivered using three commands:.. VM

// VM command following the label c In the VM language, the program flow abstraction is delivered using three commands:. VM

Constrain the data distribution for learned latent codes Generate the latent code via a prior

 The syntax analyzer takes a source text file and attempts to match it on the language grammar.  If successful, it can generate a parse tree in some structured

static:  holds values of global variables, shared by all functions in the same class argument: holds values of the argument variables of the current function. local: holds values of

• Handling, preparation, and storage of food i n ways that prevent foodborne illness.. • Microorganisms are essential for the product ion of foods