• 沒有找到結果。

Method for allocating registers for a processor

N/A
N/A
Protected

Academic year: 2022

Share "Method for allocating registers for a processor"

Copied!
21
0
0

加載中.... (立即查看全文)

全文

(1)

717/157 717/159 717/152 717/156 717/151 717/140

.. 717/151

FOR A PROCESSOR

6,009,272 A * 12/1999 Goebel 6,090,156 A * 7/2000 MacLeod

717/156

m m m m m n

we . "tn

n

IS "6 .?

demh nHmong

wwehskmrmm

oeumomh

GBKPKTZ 036670000 0000000 0000000 2222222 ///////

02600262

1 1

*******

1

122222

ABBBBBB 03005534 0744640

2,l,5,0,6,8,4,

9396569 326000006

l,5,0,0,3,3,4,

6677777

W\W) g,

u

vu1Y( , HUT

gW y

\.),.mn .m

WP w

Gvnc .w a

n.sg n a)n U .mwm a 8TO U eimT

e.m1s n) g

LPPW m , hs T T

9mm an

KLO q.mT nhC mhm am .mn

JCR NH

s m e.

n g m e .a

W S

I A

) )

5 7 3

(

7

(

( * ) Notice: Subject to any disclaimer, the term of this OTHER PUBLICATIONS

G. J. Chaitin, “Register Allocation & Spilling Via Graph Coloring”, IBM Research, ACM 0-89791-074-5/82/006/0098,

98-105, IBM Research, Yorktown Heights, NY.

1982, pp.

patent is extended or adjusted under 35

U.S.C. 154(b) by 752 days.

* cited by examiner

(21) Appl.No.: 11/463,538

(22) Flled: Aug‘ 9’ 2006 Primary ExamineriThomas K Pham

(65) Prior Publication Data (74) Attorney, Agent, or FirmiEgbert LaW Of?ces PLLC

US 2008/0052694 A1 Feb. 28, 2008

(57)

ABSTRACT

(51) Int, C], A method of allocating registers for a PAC processor. The

PAC processor has a ?rst cluster and a second cluster. Each G06F 12/00 (2006.01)

(52) U.S.Cl.

cluster includes a ?rst functional unit, a second functional

717/144; 717/151; 717/156;

unit, a ?rst local register ?le connected to the ?rst functional unit, a second local register ?le connected to the second register ?le, and a global register ?le having a ping-pong structure formed by a ?rst register bank and a second register 717/157

(58) Field of Classi?cation Search 717/144, 717/151,156,157; 711/109

See application ?le for complete search history.

bank. After building a Component/ Register Type Associated

Data Dependency Graph (CRTA-DDG), a functional unit

assignment, register ?le assignment, ping-pong register bank

assignment, and cluster assignment of the invention are per formed to take full advantage of the properties of a PAC processor.

(56) References Cited

U.S. PATENT DOCUMENTS 4,571,678 A 2/1986 Chaitin

5,367,696 A * 11/1994 Abe 717/153 5,890,000 A * 3/1999 AiZikoWitZ etal. 717/154

5,901,317 A * 5/1999 Goebel 717/156 14 Claims, 15 Drawing Sheets

M w

6

1k 13.1

3.1 .1

WT Mf.

l

.l.

r. r

e e

tt mt

S S S

v1.1 ..n at web e

r. r. S

ll.

6

mil .1

.w

dm

r T. S

1mm“

.T.

r 6

1% 1k

3.1 “W1.

mf of

l 1 r r

e e

TIL ‘Mt

S S O S

r..1 0.1

..n m. m. e

r. S f

c

4. 6

1 1

(2)

cm S cm NN

\IIIIIIIIIIIIIII

02% “32mm!

:82 E83

, r I I I l I I I I

m '

Nvq

v-l I

I I I I I l

\

~~_----__..._--______-.__________.________.__________________"

2

0:“ H323.“

W

E2 2%

,.__._-._-....---_____----___._____--_ _

\IIIIIIIIIIIIIII

2TH H323“

E02 E83

I I

-'I

2$ HBBMPH

:82 E:

2.. 2

8 mm

(3)

if

Function unit assignment N 202

if

Register file assignment N203

Ping-pong register bank assignment N204

Cluster assignment ~205

if

Communication code insertion N206

FIG. 2

(4)

m .wE

NE. E a: E0 m ésoes? . H

m . HE.

16a u

H

530535

(5)

"411

M/I

41< 413-»—~M/I M M

414 M/I I

F5? 01

(6)

411

M/I

412“ M M/I M/I M/I M/I

413w M M M

N A M/I 421

v 422

414 M I M/ I 42

(7)

FIG. 4(0)

(8)

M 482*- M M

O\-o 0/Q\’0 _|_

M I I

(9)

442 441

(10)

461

W M 452

M _n_

I

2

O\OO-/O\O

FIG. 46)

(11)

w é

A

WW 30/0

I

(12)
(13)

/

@ 0

group A I group B

FIG. 4(1)

(14)

3% 0,;

Q) : first register bank (2 : second register bank 6) : local register file

FIG. 40')

(15)

M @F

3

Ma

b

m1, ‘3%

Ma

Mb

lb

3

(f

B

: first cluster : first cluster : second cluster : second cluster

$5256“:

FIG. 4(k)

(16)

490b 490a

FIG. 4(1)

(17)

Not applicable.

REFERENCE TO MICROFICHE APPENDIX

Not applicable.

FIELD OF THE INVENTION

The present invention relates to a method for allocating registers for a processor, and more particularly, to a method for allocating registers for a Parallel Architecture Core (PAC) processor.

BACKGROUND OF THE INVENTION Most computers contain a form of high-performance data storage elements called registers, Which need to be used effec tively to achieve high performance at runtime. The process of choosing language elements to allocate instructions to regis ters and the data movement required to use them is called

“register allocation.” Register allocation has a major impact on the ultimate quality and performance of codes. A poor allocation can degrade both code siZe and runtime perfor mance. HoWever, ?nding a truly optimal solution has been proven to be computationally intractable. Several general approaches for register allocation have been proposed. For

example, register allocation by graph coloring Was described

by Chaitin, et al. in Computer Languages, Vol. 6, pp 47-57, and in US. Pat. No. 4,571,678, titled “Register Allocation

and Spilling via Graph Coloring.”

While there are register allocation algorithms for ?nding good solutions in the prior art, they cannot directly apply to the machine that utiliZes multiple register ?les and complex access constraints because the code insertion/replacement is required in the register allocation to validate the code With the

allocated registers. This impacts the complexity of register

allocation problems in the machine.

BRIEF SUMMARY OF THE INVENTION The objective of the present invention is to provide a method for allocating registers for a PAC processor With multiple register ?les and access constraints.

The PAC processor comprises a ?rst cluster and a second cluster. Each cluster comprises a ?rst functional unit, a sec ond functional unit, a ?rst local register ?le connected to the

?rst functional unit, a second local register ?le connected to the second functional unit, and a global register ?le having a ping-pong structure formed by a ?rst register bank and a second register bank. The global register ?le comprises a single set of access ports shared by the ?rst and second func tional units.

The method for allocating registers comprises steps (a)-(e).

In step (a), a Component/Register Type Associated Data

Dependency Graph (CRTA-DDG) comprising nodes, circles,

and edges is built, Wherein each node represents an operator, each circle represents an operand, the operand is a constant or a virtual register required to be allocated to a physical register in the machine level, and each edge represents a data depen

20

25

35

40

45

50

55

60

65

inserted to make the operation Work in the PAC DSP struc ture.

In addition, the method could further comprise a step of performing a cluster assignment to partition the nodes on the CRTA-DDG into tWo groups, assigning one group to the ?rst cluster and the other group to the second cluster.

The advantage of using the CRTA-DDG is that it clari?es the allocation and schedule restrictions for each node With the consideration of complex constraints in the PAC architecture.

The functional unit assignment, the register ?le assign ment, the ping-pong register bank assignment, and the cluster assignment of the invention are performed to take full advan tage of the properties of a PAC processor, so as to obtain a

good performance of allocating registers.

BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS

The objectives and advantages of the present invention Will

become apparent upon reading the folloWing description and

upon reference to the accompanying draWings.

FIG. 1 is a schematic vieW illustrating the architecture of a PAC processor.

FIG. 2 is a schematic vieW of a How chart illustrating the method for allocating registers according to one embodiment of the present invention.

FIG. 3 shoWs a schematic vieW of an illustration of a CRTA-DDG built from an input program fragment.

FIGS. 4(a) through 4(l) are schematic vieWs illustrating the process of allocating registers according to one embodiment of the present invention.

DETAILED DESCRIPTION OF THE INVENTION FIG. 1 illustrates the architecture of a Parallel Architecture Core (PAC) processor 10. The PAC processor 10 comprises a

?rst cluster 12A and a second cluster 12B, Wherein each cluster 12A or 12B comprises a ?rst functional unit 20, a second functional unit 30, a ?rst local register ?le 14 con nected to the ?rst functional unit 20, a second local register

?le 16 connected to the second functional unit 30, and a global register ?le 22 having a ping-pong structure formed by a ?rst register bank B1 and a second register bank B2. Each register ?le includes a plurality of registers.

Also, the PAC processor 10 comprises a third functional unit 40, Which is placed independently and outside the ?rst cluster 12A and the second cluster 12B. A third local register

?le 18 is connected to the third functional unit 40.

The ?rst functional unit 20 is a load/ store unit (M-Unit), the second functional unit 30 is an arithmetic unit (I-Unit), and the third functional unit 40 is a scalar unit (B -unit). The third functional unit 40 is in charge of branch operations and also capable of performing simple load/ store and address arith metic.

The global register ?les 22 are used to communicate across clusters 12A and 12B; only the third functional unit 40, being able to access all global register ?les 22, is capable of execut ing such copy operations across clusters 12A and 12B.

The ?rst local register ?le 14, the second local register ?le 16, and the third local register ?le 18 are only accessible by the M-Unit 20, I-Unit 30, and B-Unit 40, respectively.

(18)

Referring to FIG. 2, the method for allocating registers

requires building a Component/Register Type Associated Data Dependency Graph (CRTA-DDG) (step 201) ?rst,

Which preserves the information of the execution and storage

relationship for processor constraint analysis. The advantage

of using the CRTA-DDG is that it clari?es the allocation and schedule restrictions for each node With the consideration of complex constraints in the PAC architecture.

FIG. 3 shoWs a CRTA-DDG 30 built from an input program

fragment. The CRTA-DDG 30 comprises nodes, circles, and

edges. Each rectangular node is labeled With its component type association and represents an operator, Whereas each circle is labeled With its register-type association and repre sents an operand. Each edge represents a data dependency betWeen tWo operands.

The component-type association indicates Which func tional unit is scheduled for this node. The M-Unit is sched uled for a node 301 and the I-Unit is scheduled for nodes 302 and 303. The nodes 301, 302 and 303 represent operators of

instruction 1 (movi), 2 (movi) and 3 (add), respectively.

The register-type association annotates the appreciated physical register ?le/bank to Where the operands Will be allo cated. A Temporary Name (TN) represents a virtual register required to be allocated to a physical register in the machine

level intermediate representation used by Open Research

Compiler (ORC). The operand is a constant or a virtual reg ister. A circle 301 a represents a constant 5 in the instruction 1.A circle 302a represents a constant 6 in the instruction 2. A circle 3011) represents a virtual register TN1 in the instruction 1. A circle 3021) represents a virtual register TN2 in the instruction 2. Circles 303a, 303b, and 3030 represent virtual registers TN1, TN2, and TN3 in the instruction 3, respec

tively. Edge 305 linking the circles 301b, 303a and edge 306

linking the circles 302b, 3031) represent data dependency that serialiZes the execution order to be folloWed in the scheduled code sequence.

FIGS. 4(a) through 4(l) illustrate the process for allocating

registers according to one embodiment of the present inven tion.

FIG. 4(a) shoWs a CRTA-DDG built from an input program fragment. Nodes marked With M, I, B in the CRTA-DDG represent operators assigned to the M-Units, I-Units and B-Units, respectively. Nodes marked With M/I indicate that they are not assigned to the M-Units or I-Units yet. Black circles indicate virtual registers have been allocated to dedi cated registers or constants, Whereas White circles indicate virtual registers not allocated to register ?les yet.

Referring to FIG. 2, after building a CRTA-DDG (step 201), a functional unit assignment is performed for unas signed nodes on the CRTA-DDG (step 202) to determine Which function units are assigned to execute the unassigned nodes.

FIGS. 4(b) through 4(g) illustrate the process of the step 202 performed on the CRTA-DDG. The main concept of the step 202 is performing a functional unit assignment that attempts to utiliZe as many local register ?les as possible, to distribute operations to the M-unit and I-unit, roughly in equal amounts, and to increase instruction level parallelism.

20

25

30

35

40

45

50

55

60

65

M-Unit, as shoWn in FIG. 4(b).

Then, except for the data-?oW path 41, another longest

data-?oW path 42 is found in FIG. 4(b). The data-?ow path 42 comprises nodes 421, 422, and 423, Wherein the nodes 421, 422 and 423 can be operated either on the M-Unit or I-Unit, and its functional units are not assigned to the nodes 421, 422 and 423 yet. The nodes 421, 422 and 423 are determined to be operated on the I-Unit, as shoWn in FIG. 4(c).

Except for the data-?oW paths 41 and 42, another longest data-?oW path 43 is found in FIG. 4(0). The data-?oW path 43 comprises nodes 431 and 432, Wherein the node 431 can be operated either on the M-Unit or I-Unit, and its functional unit is not assigned yet. The node 431 is determined to be operated on the M-Unit, as shoWn in FIG. 4(d).

LikeWise, another data-?oW path 44 is found in FIG. 4(d).

The data-?oW path 44 comprises nodes 441 and 442, Wherein the nodes 441 and 442 can be operated either on the M-Unit or I-Unit, and the functional units are not assigned to them yet. The nodes 441 and 442 are determined to be operated on the I-Unit, as shoWn in FIG. 4(e).

Moreover, another longest data-?oW path 45 is found in FIG. 4(e). The data-?oW path 45 comprises nodes 451 and 452, Wherein the node 451 can be operated either on the M-Unit or I-Unit, and its functional unit is not assigned yet.

The node 451 is determined to be operated on the M-Unit, as shoWn in FIG. 4(f).

Finally, a node 461 Whose functional unit is not assigned yet is found in FIG. 4(f), and the node 461 is determined to be operated on the I-Unit, as shoWn in FIG. 4(g).

Given the above, the functional unit assignment alternates betWeen the M-unit and I-unit in each iteration so as to bal ance the amount of nodes of the M-unit and I-Unit.

After determining the functional unit type of all unassigned nodes (the step 202), a register ?le assignment for unallocated circles in the CRTA-DDG (step 203) is performed to deter mine Which register ?les are allocated to the unallocated circles.

First, the global register ?le 22 is allocated to the virtual registers With data dependency across the M-Unit and I-Unit.

This avoids unnecessary communication codes caused by data sharing betWeen different functional units. FIG. 4(h) shoWs an example of a global register ?le assignment base in FIG. 4(g). The global register ?le 22 is allocated to the virtual registers on M-I pairs 481, 482, 483 and 484, and an M-I pair indicates that an M node links an I node through an edge and circles at the ends of the edge.

Then, the ?rst local register ?le 14, the second local register 16, and the third local register ?le 18 are assigned to the other unassigned virtual registers With data dependency across the M-Unit and M-Unit, I-Unit and I-Unit, and B-Unit and

B-Unit, respectively.

After performing register ?le assignment for the unallo

cated circles (the step 203), a ping-pong register bank assign ment (step 204) is performed.

First, an inverse graph is built based on the M-I pairs With their related virtual registers that are allocated to global reg ister ?les in the step 202. Referring to FIG. 4(i), an inverse

(19)

B2, respectively.

FIG. 4(]') shoWs the results of the register ?le and ping-pong

register bank assignments. Virtual registers (circles) marked

With 1, 2, and 3 are allocated to the ?rst register bank B1, the

second register bank, and the corresponding local register

?les of the functional units, respectively.

After step 204 of ping-pong register bank assignment, an optional step 205 for cluster assignment is performed to take advantages of the tWo-cluster property of PAC DSP. The step 205 is to partition all nodes of the M-Unit and l-Unit into tWo groups based on the Whole graph Without nodes assigned to the B-Unit.

FIG. 4(k) shoWs the result of the cluster assignment based on FIG. 4(j). Nodes marked With Ma and la are assigned to the cluster 12A, Whereas nodes marked With Mb and lb are assigned to the cluster 12B.

The last step before real register allocation is communica tion code insertion (step 206). FIG. 4(l) shoWs the result of the communication code insertion based on FIG. 4(k). In the PAC scheme, communication code insertion is performed in the folloWing situations to make the operation Work.

When a B node is linked to an M node or an 1 node, another B node (490a) and anotherl node (49019) are inserted for the communication because B node can only use its oWn third local register ?le 18.

When a possible communication is generated by the cluster assignment step 205, a communication link is generated betWeen the ?rst cluster 12A and the second cluster 12B. M nodes (492a and 49219) are inserted for the inter-cluster com munication.

Another case may occur in the ping-pong bank assignment (the step 204) While cutting on an edge of the inverse graph, Which means a node simultaneously accesses tWo banks of the global register ?le. Therefore, an additional node is needed to be inserted to copy data from one register bank of the global register ?le to the local register ?le so as to make the operation Work.

The functional unit assignment, the register ?le assign ment, the ping-pong register bank assignment, and the cluster assignment of the invention are performed to take full advan tage of the properties of a PAC processor, so as to obtain a

good performance of allocating registers.

The above-described embodiments of the present inven tion are intended to be illustrative only. Numerous alternative embodiments may be devised by those skilled in the art With out departing from the scope of the folloWing claims.

We claim:

1. A method for allocating registers for a processor, said processor comprising a ?rst cluster and a second cluster, each cluster comprising a ?rst functional unit, a second functional unit, a ?rst local register ?le connected to said ?rst functional unit, a second local register ?le connected to said second functional unit, and a global register ?le having a ping-pong structure formed by a ?rst register bank and a second register bank, said global register ?le being connected to the ?rst and second functional units, said method comprising steps of:

(a) building a graph comprising nodes, circles and ?rst

edges, Wherein each node is labeled With at least one of said ?rst functional unit and said second functional unit, each circle indicating Whether the register is allocated,

30

35

40

45

55

60

65

(d) allocating said ?rst register bank and said second reg ister bank to the circles allocated to the global register

?le based on Whether the circles allocated to the global register ?le are linked through only one node in the

graph; and

(e) adding at least one node to communicate betWeen the

?rst cluster and the second cluster or betWeen the global register ?le and the ?rst and second local register ?les.

2. The method for allocating registers of claim 1, further comprising a step of:

partitioning the nodes of the ?rst functional unit and the second functional unit into a ?rst group assigned to the

?rst cluster and a second group assigned to the second cluster.

3. The method for allocating registers of claim 1, Wherein

the step (b) comprises steps of:

(bl) ?nding a ?rst data-?oW path having a largest number of nodes and determining the nodes on the ?rst data path to be of the ?rst functional unit; and

(b2) ?nding a second data-?oW path Whose number of nodes is equal to or less than the number of the nodes on the ?rst data-?ow path, and determining the nodes on the second data-?ow path to be of the second functional

unit;

Wherein the steps (bl) and (b2) are repeated for the unde termined nodes.

4. The method for allocating registers of claim 1, Wherein

the step (c) comprises steps of:

(cl) allocating the global register ?le to the unallocated circles With the ?rst edges linking the nodes of the ?rst functional unit and the nodes of the second functional

unit;

(c2) allocating the ?rst local register ?le to the unallocated circles With the ?rst edges linking the nodes of the ?rst functional unit; and

(c3) allocating the second local register ?le to the unallo cated circles With the ?rst edges linking the nodes of the second functional unit.

5. The method for allocating registers of claim 4, Wherein said processor further comprises a third functional unit con nected betWeen the ?rst cluster and the second cluster, and a third local register ?le connected to the third functional unit, the step (c) further comprising a step of:

allocating the third local register ?le to the unallocated circles With the ?rst edges linking the nodes of the third functional unit.

6. The method for allocating registers of claim 4, Wherein

the step (d) comprises steps of:

(dl) building an inverse graph comprising vertices and second edges connecting the vertices for the circles allo cated to the global register ?le in the step (cl), each vertex being converted from a combination of tWo of the circles and the ?rst edge connected therebetWeen, the nodes corresponding to the tWo circles being of different functional units, and each second edge being converted from the node betWeen tWo of the combinations;

(d2) partitioning the inverse graph into a ?rst group and a second group With a minimal number of cuts; and

(20)

the step (e) is comprised of:

further inserting a pair of nodes on the graph for commu nication betWeen the ?rst cluster and the second cluster.

9. The method for allocating registers of claim 1, Wherein said processor further comprises a third functional unit con nected betWeen the ?rst cluster and second cluster, and a ?rst node allocated to the third functional unit links to a second node allocated to the ?rst or the second functional unit, the

step (e) being comprised of:

shared by the ?rst and second functional units that can only access different register banks in each operation cycle.

14. The method for allocating registers of claim 1, Wherein each node represents an operator, each circle represents an operand, the operand is a constant or a Virtual register required to be allocated to a physical register in the machine level, and each edge represents a data dependency betWeen tWo of the operands.

(21)

參考文獻

相關文件

The disadvantage of the inversion methods of that type, the encountered dependence of discretization and truncation error on the free parameters, is removed by

3.16 Career-oriented studies provide courses alongside other school subjects and learning experiences in the senior secondary curriculum. They have been included in the

The research proposes a data oriented approach for choosing the type of clustering algorithms and a new cluster validity index for choosing their input parameters.. The

Chen, The semismooth-related properties of a merit function and a descent method for the nonlinear complementarity problem, Journal of Global Optimization, vol.. Soares, A new

The Hilbert space of an orbifold field theory [6] is decomposed into twisted sectors H g , that are labelled by the conjugacy classes [g] of the orbifold group, in our case

In this paper, we develop a novel volumetric stretch energy minimization algorithm for volume-preserving parameterizations of simply connected 3-manifolds with a single boundary

Based on the reformulation, a semi-smooth Levenberg–Marquardt method was developed, and the superlinear (quadratic) rate of convergence was established under the strict

Moreover, for the merit functions induced by them for the second-order cone complementarity problem (SOCCP), we provide a condition for each stationary point being a solution of