• 沒有找到結果。

Compiler I: Syntax Analysis

N/A
N/A
Protected

Academic year: 2022

Share "Compiler I: Syntax Analysis"

Copied!
27
0
0

加載中.... (立即查看全文)

全文

(1)

www.nand2tetris.org

Building a Modern Computer From First Principles

Compiler I: Syntax Analysis

(2)

Course map

Assembler Chapter 6

H.L. Language

&

Operating Sys.

abstract interface

Compiler

Chapters 10 - 11

VM Translator

Chapters 7 - 8

Computer Architecture

Chapters 4 - 5

Gate Logic

Chapters 1 - 3

Electrical

Engineering

Physics Virtual

Machine

abstract interface

Software hierarchy

Assembly Language

abstract interface

Hardware hierarchy

Machine Language

abstract interface

Hardware Platform

abstract interface

Chips &

Logic Gates

abstract interface

Human Thought

Abstract design

Chapters 9, 12

(3)

Motivation: Why study about compilers?

Because Compilers …

 Are an essential part of applied computer science

 Are very relevant to computational linguistics

 Are implemented using classical programming techniques

 Employ important software engineering principles

 Train you in developing software for transforming one structure to another (programs, files, transactions, …)

 Train you to think in terms of ”description languages”.

 Parsing files of some complex syntax is very common in many

applications.

(4)

The big picture

. . .

RISC machine language

Hack machine language CISC

machine

language

. . .

a high-levelwritten in

language

. . .

HW lectures (Projects

1-6) Intermediate code

VM implementation

over CISC platforms

VM imp.

over RISC platforms

VM imp.

over the Hack platform

VM emulator

VM lectures (Projects

7-8) Some Other

language

Jack language

Some

compiler Some Other compiler

Jack compiler

. . .

Some

language

. . .

Compiler lectures (Projects

10,11)

Modern compilers are two-tiered:

 Front-end:

from high-level language to some intermediate language

 Back-end:

from the

intermediate

language to

binary code.

(5)

Compiler architecture (front end)

. . .

Intermediate code

RISC machine language

Hack machine language CISC

machine

language . . . a high-levelwritten in

language

. . .

VM implementation

over CISC platforms

VM imp.

over RISC platforms

VM imp.

over the Hack platform VM emulator Some Other

language Jack language

Some compiler Some Other

compiler Jack compiler

. . .

Some language . . .

 Syntax analysis: understanding the structure of the source code

 Tokenizing: creating a stream of “atoms”

 Parsing: matching the atom stream with the language grammar

XML output = one way to demonstrate that the syntax analyzer works

(Chapter 11)

Jack Program

Toke-

nizer Parser

Code Gene -ration

Syntax Analyzer Jack Compiler

VM code XML code

(Chapter 10)

(source) (target)

scanner

(6)

Tokenizing / Lexical analysis / scanning

 Remove white space

 Construct a token list (language atoms)

 Things to worry about:

 Language specific rules:

e.g. how to treat “++”

 Language-specific classifications:

keyword, symbol, identifier, integerCconstant, stringConstant,...

 While we are at it, we can have the tokenizer record not only the token, but

also its lexical classification (as defined by the source language grammar).

(7)

C function to split a string into tokens

 char* strtok ( char* str, const char* delimiters );

 str: string to be broken into tokens

 delimiters: string containing the delimiter characters

(8)

Jack Tokenizer

if (x < 153) {let city = ”Paris”;}

Source code

<tokens>

<keyword> if </keyword>

<symbol> ( </symbol>

<identifier> x </identifier>

<symbol> &lt; </symbol>

<integerConstant> 153 </integerConstant>

<symbol> ) </symbol>

<symbol> { </symbol>

<keyword> let </keyword>

<identifier> city </identifier>

<symbol> = </symbol>

<stringConstant> Paris </stringConstant>

<symbol> ; </symbol>

<symbol> } </symbol>

</tokens>

Tokenizer’s output

Tokenizer

(9)

Parsing

 The tokenizer discussed thus far is part of a larger program called parser

 Each language is characterized by a grammar.

The parser is implemented to recognize this grammar in given texts

 The parsing process:

 A text is given and tokenized

 The parser determines weather or not the text can be generated from the grammar

 In the process, the parser performs a complete structural analysis of the text

 The text can be in an expression in a :

 Natural language (English, …)

 Programming language (Jack, …).

(10)

Parsing examples

He ate an apple on the desk.

English

ate

he an apple

the desk parse

on

(5+3)*2 – sqrt(9*4)

-

5

sqrt

+ *

3

2

9 4

*

Jack

(11)

Regular expressions

 a|b*

{, “a”, “b”, “bb”, “bbb”, …}

 (a|b)*

{, “a”, “b”, “aa”, “ab”, “ba”, “bb”, “aaa”, …}

 ab*(c|)

{a, “ac”, “ab”, “abc”, “abb”, “abbc”, …}

(12)

Context-free grammar

 S() S(S) SSS

 Sa|aS|bS

strings ending with ‘a’

 S  x S  y S  S+S S  S-S S  S*S S  S/S S  (S)

(x+y)*x-x*y/(x+x)

 Simple (terminal) forms / complex (non-terminal) forms

 Grammar = set of rules on how to construct complex forms from simpler forms

 Highly recursive.

(13)

Recursive descent parser

 A=bB|cC

A() {

if (next()==‘b’) { eat(‘b’);

B();

} else if (next()==‘c’) { eat(‘c’);

C();

} }

 A=(bB)*

A() {

while (next()==‘b’) { eat(‘b’);

B();

}

}

(14)

A typical grammar of a typical C-like language

while (expression) { if (expression)

statement;

while (expression) { statement;

if (expression) statement;

}

while (expression) { statement;

statement;

} }

if (expression) { statement;

while (expression) statement;

statement;

}

if (expression) if (expression)

statement;

}

Code sample program: statement;

statement: whileStatement

| ifStatement

| // other statement possibilities ...

| '{' statementSequence '}'

whileStatement: 'while' '(' expression ')' statement ifStatement: simpleIf

| ifElse

simpleIf: 'if' '(' expression ')' statement ifElse: 'if' '(' expression ')' statement

'else' statement

statementSequence: '' // null, i.e. the empty sequence

| statement ';' statementSequence expression: // definition of an expression comes here

// more definitions follow

Grammar

(15)

Parse tree

statement

whileStatement

expression

statementSequence statement

statement statementSequence

Input Text:

while (count<=100) { /** demonstration */

count++;

// ...

Tokenized:

while ( count

<=

100 ) { count ++

; ...

program: statement;

statement: whileStatement

| ifStatement

| // other statement possibilities ...

| '{' statementSequence '}' whileStatement: 'while'

'(' expression ')' statement

...

(16)

Recursive descent parsing

Parser implementation: a set of parsing methods, one for each rule:

parseStatement()

parseWhileStatement()

parseIfStatement()

 Highly recursive

 LL(0) grammars: the first token determines in which rule we are

 In other grammars you have to look ahead 1 or more tokens

while (expression) { statement;

statement;

while (expression) { while (expression)

statement;

statement;

} }

code sample

(17)

The Jack grammar

’x’: x appears verbatim

x: x is a language construct

x appears 0 or 1 times

(18)

The Jack grammar (cont.)

’x’: x appears verbatim

x: x is a language construct x?: x appears 0 or 1 times x*: x appears 0 or more times

either x or y appears

(19)

Jack syntax analyzer in action

Class Bar {

method Fraction foo(int y) { var int temp; // a variable let temp = (xxx+12)*-63;

...

...

Syntax analyzer

 Using the language grammar, a programmer can write

a syntax analyzer program (parser)

 The syntax analyzer takes a source text file and attempts to match it on the language grammar

 If successful, it can generate a parse tree in some structured format, e.g. XML.

<varDec>

<keyword> var </keyword>

<keyword> int </keyword>

<identifier> temp </identifier>

<symbol> ; </symbol>

</varDec>

<statements>

<letStatement>

<keyword> let </keyword>

<identifier> temp </identifier>

<symbol> = </symbol>

<expression>

<term>

<symbol> ( </symbol>

<expression>

<term>

<identifier> xxx </identifier>

</term>

<symbol> + </symbol>

<term>

Syntax analyzer

The syntax analyzer’s algorithm shown in this slide:

 If xxx is non-terminal, output:

<xxx>

Recursive code for the body of xxx

</xxx>

(20)

JackTokenizer: a tokenizer for the Jack language (proposed implementation)

(21)

JackTokenizer (cont.)

(22)

CompilationEngine: a recursive top-down parser for Jack

The CompilationEngine effects the actual compilation output.

It gets its input from a JackTokenizer and emits its parsed structure into an output file/stream.

The output is generated by a series of compilexxx() routines, one for every syntactic element xxx of the Jack grammar.

The contract between these routines is that each compilexxx() routine should read the syntactic construct xxx from the input, advance() the tokenizer exactly beyond xxx, and output the parsing of xxx.

Thus, compilexxx()may only be called if indeed xxx is the next syntactic element of the input.

In the first version of the compiler, which we now build, this module emits a

structured printout of the code, wrapped in XML tags (defined in the specs of project 10). In the final version of the compiler, this module generates

executable VM code (defined in the specs of project 11).

In both cases, the parsing logic and module API are exactly the same.

(23)

CompilationEngine (cont.)

(24)

CompilationEngine (cont.)

(25)

CompilationEngine (cont.)

(26)

Summary and next step

(Chapter 11)

Jack Program

Toke-

nizer Parser

Code Gene -ration

Syntax Analyzer Jack Compiler

VM code XML code

(Chapter 10)

 Syntax analysis: understanding syntax

 Code generation: constructing semantics

The code generation challenge:

 Extend the syntax analyzer into a full-blown compiler that, instead of generating passive XML code, generates executable VM code

Two challenges: (a) handling data, and (b) handling commands.

(27)

Perspective

 The parse tree can be constructed on the fly

 Syntax analyzers can be built using:

Lex tool for tokenizing (flex)

Yacc tool for parsing (bison)

 Do everything from scratch (our approach ...)

 The Jack language is intentionally simple:

Statement prefixes: let, do, ...

 No operator priority

 No error checking

 Basic data types, etc.

 Richer languages require more powerful compilers

 The Jack compiler: designed to illustrate the key ideas that underlie modern compilers, leaving advanced features to more advanced courses

 Industrial-strength compilers: (LLVM)

參考文獻

相關文件

The format of the URI in the first line of the header is not specified. For example, it could be empty, a single slash, if the server is only handling XML-RPC calls. However, if the

• The Java programming language is based on the virtual machine concept. • A program written in the Java language is translated by a Java compiler into Java

 If I buy a call option from you, I am paying you a certain amount of money in return for the right to force you to sell me a share of the stock, if I want it, at the strike price,

• Teaching grammar through texts enables students to see how the choice of language items is?. affected by the context and how it shapes the tone, style and register of

a) Excess charge in a conductor always moves to the surface of the conductor. b) Flux is always perpendicular to the surface. c) If it was not perpendicular, then charges on

According to the United Nations Educational, Scientific and Cultural Organization (UNESCO), a language is considered endangered when “its speakers cease to use it, use it in fewer

Corpus-based information ― The grammar presentations are based on a careful analysis of the billion-word Cambridge English Corpus, so students and teachers can be

If general metaphysics insists on positing something ‘infinite’, qualitatively different from finite things, and takes it to be the only object worth pursuing, then such a view