Analysis of shaders - 可動態重組之處理單元於頂點與像素處理

Chapter 3 Design

3.1 Analysis of shaders

The architecture of vertex/pixel shader in DirectX(spec. of GPU) is below:

Destination register modifier Instruction slot

1. Program counter: a register which stores the address of the instruction being executed.

2. Instruction slot: a storage unit which stores all shader codes for vertex/pixel shader(s).

3. Instruction decoder: a combinational circuit to translate an instruction into the control signals of the data path.

4. Register file: a storage un for each vertex or pixel.

5. Source register modifier: a sim source data.

6. Computation unit: the m add, mul, mad …).

Instruction Decoder

Computation Unit

Source register modifier

er modifier: a simple computation unit which is similar to source

DR-s pes to be a multi-function shader.

shared between vertex shader type

and p

ed.

Under these policies, we decide source register modifier, computation unit, and

we deliberate upon how to design those sharable units for both two shader

types tex

xel

m of data

nit are i*v bits in vector form,

7. Destination regist

hader must support all functions in both two shader ty

Therefore, it also contains those units and it must have to double the units which can’t be shared between vertex shader type and pixel shader type.

Firstly, we consider which units in DR-shader can be

ixel shader type to reduce the hardware overhead of DR-shader. The sharing policies are:

1. If and only if a storage unit must store data, which may be states, instructions or temporary results, for vertex shader and pixel shader simultaneously, it can’t be shar 2. All logic units are sharable.

destination register modifier are sharable units because all of them are logic units. Instruction slot is non-sharable unit, for it must store vertex shader codes and pixel shader codes in the same time. Besides, we can’t decide whether program counter and register file can be shared We will make the decision for them when we discuss the architecture and flexibility of DR-shader.

Secondly,

. In those sharable units, source modifier and destination modifier are the same in ver shader type and pixe shader type. Therefore, we will focus on how to design a sharable computation unit in the following sections. There are some assumptions of vertex and pi shaders’ architecture for us to design a sharable computation unit, listed below:

Single issue and single execution: because shaders expose the parallelis better than the parallelism of instructions for single issue and multi-shaders respectively execute instead of multi-execution

The widths of all operations in the computation u

where i is currently 32 (most probable), and v may be 1 or 4: for the precision

.2 Analysis of Computation requirements

, we n

mputes four fields (x, y, z, w) of source registers and

2. a computation on one field of a source register and produces

3. us without any

In the be ow what instructions are in the three types with their operations and

struction Belong Operations Requirements

requirement described in DirectX.

3

Before design a sharable computation unit for vertex shader type and pixel shader type eed to understand using data, function units and processing flow in all vertex and pixel instructions individually to decide how to design the computation unit in DR-shader. We divide all vertex and pixel shader instructions in DirectX into three types by their using da and processing flows, which are:

1. Vector type: separately co produces four results.

Scalar type: only does

one result. In this type of instructions we use a changed second Taylor formula to reduce the complexity of their computations. (See Appendix A)

Non-computation type: only send the data of source register to b computation.

low, we will sh computation requirements.

In add sub

VS, PS Dst.x = Src0.x + Src1.x Dst.y = Src0.y + Src1.y Dst.z = Src0.z + Src1.z Dst.w = Src0.w + Src1.w

2in-fpSUM₃₂ ＊4

cmp PS 1.x : Src2.x

Dst.y = (Src0.y >= 0)? Src1.y : Src2.y

2in-MUX₃₂ ＊4 Dst.x = (Src0.x >= 0)? Src

Dst.z = (Src0.z >= 0)? Src1.z : Src2.z Dst.w = (Src0.w >= 0)? Src1.w : Src2.w

dp2add VS Dst = Src0.x ＊ Src1.x + Src0.y ＊ Src1.y fpMUL + Src2.w

32 ＊2 3in-fpSUM32 ＊1 dp3 (vs) VS Dst = Src0.x ＊ Src1.x + Src0.y ＊ Src1.y

+ Src0.w

fpMUL

＊ Src1.w

32 ＊3 3in-fpSUM32 ＊1 dp3 (ps)

dp4

Src0.w ＊ Src1.w VS, Dst = Src0.x ＊ Src1.x + Src0.y ＊ Src1.y

+ Src0.w ＊ Src1.w +

fpMUL32 ＊4 4in-fpSUM₃₂ ＊1 max VS, PS Dst.x = (Src0.x > Src1.x)? Src0.x : Sr

Dst.y = (Src0.y > Src1.y)? Src0.y : Src1.y c1.x

w Dst.z = (Src0.z > Src1.z)? Src0.z : Src1.z Dst.w = (Src0.w > Src1.w)? Src0.w : Src1.

2in-fpSUM32 4

＊ 2in-MUX ＊4

min VS, PS

2in-fpSUM Dst.x = (Src0.x < Src1.x)? Src0.x : Src1.x

Dst.y = (Src0.y < Src1.y)? Src0.y : Src1.y Dst.z = (Src0.z < Src1.z)? Src0.z : Src1.z Dst.w = (Src0.w < Src1.w)? Src0.w : Src1.

32 ＊4

32 ＊4 2in-MUX

mul VS, PS Dst.x = Src0.x ＊ Src1.x fpMUL

Dst.y = Src0.y ＊ Src1.y Dst.z = Src0.z ＊ Src1.z Dst.w = Src0.w ＊ Src1.w

32 ＊4

mad VS, PS Src2.x

rc2.y

fpMUL Dst.x = Src0.x ＊ Src1.x +

Dst.y = Src0.y ＊ Src1.y + S Dst.z = Src0.z ＊ Src1.z + Src2.z Dst.w = Src0.w ＊ Src1.w + Src2.w

32 ＊4 2in-fpSUM₃₂ ＊4

sge VS

Dst.y = (Src0.y >= Src1.y)? 1.0f : 0.0f Dst.z = (Src0.z >= Src1.z)? 1.0f : 0.0f Dst.w = (Src0.w >= Src1.w)? 1.0f : 0.0f

2in-fpSUM

Dst.x = (Src0.x >= Src1.x)? 1.0f : 0.0f 32 ＊4 2in-MUX₃₂ ＊4

slt VS Dst.x = (src0.x < src1.x)? 1.0f : 0.0f Dst.y = (src0.y < src1.y)? 1.0f : 0.0f Dst.z = (src0.z < src1.z)? 1.0f : 0.0f Dst.w = (src0.w < src1.w)? 1.0f : 0.0f;

2in-fpSUM₃₂ ＊4 2in-MUX32 ＊4

sgn VS Dst.x = (Src0.x > 0)? 1.0f : (Src0.x = 0)?

0.0f : -1.0f

Dst.y = (Src0.y > 0)? 1.0f : (Src0.y = 0)?

0.0f : -1.0f

Dst.z = (Src0.z > 0)? 1.0f : (Src0.z = 0)?

0.0f : -1.0f

Dst.w = (Src0.w > 0)? 1.0f : (Src0.w = 0)?

0.0f : -1.0f

Compare to 0₃₂ ＊4 2in-MUX32 ＊8

Table.3-1 Requirements for vector type instructions

Instruction Belong Operations Requirements

add sub

VS, PS Dst.x = Src0.x + Src1.x Dst.y = Src0.y + Src1.y Dst.z = Src0.z + Src1.z Dst.w = Src0.w + Src1.w

2in-fpSUM32 ＊4

cmp PS Dst.x = (Src0.x >= 0)? Src1.x : Src2.x Dst.y = (Src0.y >= 0)? Src1.y : Src2.y

2in-MUX32 ＊4

Dst.z = (Src0.z >= 0)? Src1.z : Src2.z Dst.w = (Src0.w >= 0)? Src1.w : Src2.w

dp2add VS Dst = Src0.x ＊ Src1.x + Src0.y ＊ Src1.y fpMUL + Src2.w

32 ＊2 3in-fpSUM32 ＊1 dp3 VS Dst = Src0.x ＊ Src1.x + Src0.y ＊ Src1.y

+ Src0.w ＊ Src1.w

fpMUL₃₂ ＊3 3in-fpSUM32 ＊1

dp4 VS, PS

Src0.w ＊ Src1.w Dst = Src0.x ＊ Src1.x + Src0.y ＊ Src1.y + Src0.w ＊ Src1.w +

fpMUL32 ＊4 4in-fpSUM₃₂ ＊1 max VS, PS Dst.x = (Src0.x > Src1.x)? Src0.x : S

Dst.y = (Src0.y > Src1.y)? Src0.y : Src1.y rc1.x

w Dst.z = (Src0.z > Src1.z)? Src0.z : Src1.z Dst.w = (Src0.w > Src1.w)? Src0.w : Src1.

2in-fpSUM32 4

＊ 2in-MUX ＊4

min VS, PS

2in-fpSUM Dst.x = (Src0.x < Src1.x)? Src0.x : Src1.x

Dst.y = (Src0.y < Src1.y)? Src0.y : Src1.y Dst.z = (Src0.z < Src1.z)? Src0.z : Src1.z Dst.w = (Src0.w < Src1.w)? Src0.w : Src1.

32 ＊4

32 ＊4 2in-MUX

Tabl ns

Instruction Belong

e.3-2 Requirements for scalar type instructio

Operations Requirments

branch VS PC = (Src0 !=0)?PC+1 : Src1 Bus to program counter texld PS Dst = Mem#Src1(Src0) Bus to texture memory

Table.3-3 R

From above tab ith maximum

quirements for each kind to execute any instruction, as shown in the following tables:

equirements for non-computation type instructions

les, we conclude that there are only five kinds of computations w re

Computation fpMUL 2in-fpSUM 3in-fpSUM 4in-fpSUM Compare to 0 Maximum

4 4 1 1 4 requirement

Table.3-4 Maximum requirement of each computation

3.3 Design

In this section, we want to implement the computation unit with its area as small as ign of computation unit is

of computation unit

possible while keeping one-cycle execution. The tradeoff in the des

that the more sharing we want the more routing overhead we may have. Therefore, we mu carefully decide whether functions of any computation unit can be shared by others. To solve this problem, we divide each computation into sub-function nodes with requirement of each node individually to discover potential sharing possibility and then use an algorithm to choose nodes covering all computations. The computations we divide are called the tree of

computation requirements.

fpMUL₃₂ 4

2in-fpSUM₃₂ 4

&SWAP ALIGN 2in-adder₂₄ normalize₂₅ +INV 4

ALIGN

+INV 4 4 4

2 3in-fpSUM₃₂

1 3in-partial sort

ALIGN

+INV 3in-adder₂₄ normalize₂₆

2 1 1

CMP

&SWAP 2 2in-adder₂₄ 2

3 4in-fpSUM₃₂

1 4in-partial sort

ALIGN

+INV 4in-adder₂₄ normalize₂₆

3 1

CMP

&SWAP 3 2in-adder₂₄ 3 Compare

to 0 ₃₂ 4

multiply₂₄

8 4

adder₈

normalize₂₆ normalize₂₅

1 3

2in-fpSUM₃₂ 2in-fpSUM₃₂

A node can be divided into sub-function nodes Another way to be divided

Logic

Maximum requirement Fig.3-2 Trees of computation requirements

The meaning of covering is that if we choose a node in the tree of computation requirements, we can say the node has been covered. Besides, if all children of a node have been covered, the node also is covered. We will compare the average and maximum area requirement of all vertex and pixel instructions and choose the one with smallest average and maximum area requirement.

3.3.1 Sharing all units within nin-fpSUM

In nin-fpSUM, we find that there are some possible sharing logics when we divide nin-fpSUM into many sub-function nodes. There are two possible partitions of nin-fpSUM , which are: 1. partition 3 or 4in-fpSUM to several 2in-fpSUMs

Src1

2in-fpSUM

₃₂

Src2

2in-fpSUM

₃₂

2in-fpSUM

₃₂

Result

Src3

2in-fpSUM

₃₂

Src4

4in-fpSum

Fig.3-3 How three 2in-fpSUM s be reconfigured to one 4in-fpSUM

32 32

2. Sharing all units of nin-fpSUM within each other

normalize₂₅

normalize₂₆ Result normalize₂₅

normalize₂₅

normalize₂₆ Result CMP 2in-adder₂₄

2in-adder₂₄

2in-adder₂₅

2in-adder₂₄

2in-adder₂₅

2in-adder₂₅ ALIGN Largest for fpadd4

ALIGN Largest for fpadd4

Fig.3-4 How three 2in-fpSUM32s be reconfigured to one 4in-fpSUM32

Although “CMP&SWAP”, “ALIGN+INV”, “normalize” can be easily shared within nin-fpSUM , the problem is in adders, especially at we use three 2in-adders to form a

4in-adder. In these three 2in-adders, two adders will be carry-save adders and the last one will be normal adders to add the carry and sum of the second carry-save. However, there are three problems we need to get over for this kind of design, which are:

1. We need to add four 24-bit numbers by two 24-bit carry save adder and one 24-bit normal adder. Is there any extension in adder?

2. After “ALIGN + INV”, there may be three carry-ins from the inverters. How do we add the three carry-ins by existent adders

3. The result has 24+2 bits and the sources may be minus from inverters. How do we solve the sign-extensions of minuses?

In problem1, the first carry-save adder adds three 24-bit summands from “ALIGN +

V”, so it doesn’t need to extend. Besides, the normal adder must give 26-bit result, so it to

hest bit

In problem carry-in to adder for the negation of 2’s comp

-in

Fig.3-6 The solution of problem2

In problem3, the final solution is 26 bits and summands may be minus. Therefore, we must add compensation, which we call sign-com

must be extend to 25-bit adder. However, in the second carry-save adder, do we need extend it to 25-bit adder? The answer is no because we doesn’t need to process the hig

of the carry from first carry-save adder. The figure below can give us more carefully concept:

Fig.3-5 The solution of problem

2, “ALIGN + INV” may send 1-bit

lement. How can we sum carry-ins, which are at most three, to summands without any additional logics? To solve this problem, we use the vacant position in the carries of the two carry-save adder to add two carry-ins. Then, the last carry-in will be added as normal carry by the normal adder.

pensation, to temporary result and get correct

＋

Normal adder 24+1 bits

Sum2 Carry2

Result (24+2 bits)

＋

Carry save adder 1

Src1

Carry1 24 bits

Src3

Sum1 Src2

＋

Carry2 Carry save adder 2

24 bits

Sum1 Carry1 Src4

Sum2

Carryin3

as normal carry-in

＋

Normal adder 24+1 bits Sum2

Carry2

Result (24+2 bits)

Carryin2

＋

Carry2 Carry save adder 2

24 bits Sum1

Carry1 Src4

Sum2 Carryin1

2

Sign-compensat

01

3

Number of minuses

0

result.

11 10

ion

00

1 11 10

ion

00 1 2

Sign-compensat

01

3

Number of minuses

0

＋

Src2 24 bits

Src3 Src4

Result (24+2 bits) 1 Src1

1 1 11 11 11

24 bits

＋

Src2 Src3 Src4

Temp result (24+2 bits) Src1

1 1 1

＋ 01

Real result (24+2 bits)

Fig.3-7 The solution of problem3

s to choose node covering all computations:

¾ Algorithm1- minimum routing overhead: use the fewer choices to cover all computation requirements. The advantage of this algorithm is that there may be fewer routing overhead with enough sharing logic. However, the disadvantage is that it may loss some possible sharing opportunity for smaller area requirement. The steps of the algorithm are described in below:

Step1: collect nodes with the same logic (sharable nodes) and indicate the most maximum requirement.

Step2: group nodes into several sets and let there are no links or the same nodes within different sets and ignore the sets which only have one computation. See below:

3.3.2 Algorithm1 & 2 to choose nodes

Here, we propose two algorithm

32-bit fpMUL 4

Compare fpSUM2 4

2in 24-bit adder fpSUM2 2 32-bit

fpSUM3 1

3in partial sort

ALIGN +INV

3in 24-bit adder

fpSUM3 normalize

3 1 1

CMP

&SWAP 2

2in 24-bit adder 2

32-bit fpSUM2 3 32-bit

fpSUM4 1

1 4in partial sort

ALIGN +INV

4in 24-bit adder

fpSUM4 normalize

3 1 1

CMP

&SWAP 3

2in 24-bit adder 3 fpSUM4

Fig.3-8 Different sets of computation trees

Step3: For each set we do that, we firstly choose a sharable node in the highest level and see if all computations have been covered. If there are computations haven’t been covered, delete chosen nodes with all their children and choose another sharable node in highest level. Recursively, all computation requirements have been covered or no sharable node.

Fig d

most .3-9 The result of minimum routing overhea

Algorithm2-Maximum sharing logics: find more sharing choices to cover all computation requirements. The advantage of this algorithm is that there are the sharing logics. However, the routing overhead may become more serious.

Step1.2: the same as step1 and step2 in minimum routing overhead to group nodes into several sets.

Step3: For each set we do that, we firstly choose a sharable node in the lowest level and see if all computations have been covered. If there are computations haven’t been covered, delete chosen nodes with all their children and choose another sharable node in lowest level. Recursively, all computation requirements have been covered or no sharable node.

fpMUL₃₂ 4

2in-fpSUM₃₂ 4

4 CMP

&SWAP 4 CMP

&SWAP ALIGN 2in-adder₂₄ normalize₂₅ +INV 4

ALIGN

+INV 4 4 4

2 3in-fpSUM₃₂

1 3in-partial sort

ALIGN

+INV 3in-adder₂₄ normalize₂₆

3 1 1

CMP

&SWAP 2 2in-adder₂₄ 2

3 4in-fpSUM₃₂

1 4in-partial sort

ALIGN

+INV 4in-adder₂₄ normalize₂₆

3 1

CMP

&SWAP 3 2in-adder₂₄ 3 Compare

to 0 ₃₂ 4

multiply₂₄

8 4

adder₈

normalize₂₆1 normalize₂₅3

2in-fpSUM₃₂ 2in-fpSUM₃₂

The following figures show how three 2in-fpSUM32s be reconfigured to one 4in-fpSUM32 in these two algorithms as an example:

3.3.3 Algorithm3-optimal area-time:

In minimum routing overhead and maximum sharing logic, we find that some factors for sharing logic haven’t been considered. In these two algorithms, we choose nodes as basic unit but we don’t consider about different proportions within nodes. Besides, the silicon area is not the same in all nodes. Therefore, we use a new algorithm.

¾ Search by integer programming: weight each sharable node with hardware cost and use integer programming to minimize total cost. Here, we estimate hardware cost of each node by num

cases. If one sharable node with the most maximum requirement it means that logic

without routing overhead divided by area of Fig.3-10 The result of maximum sharing logic

ber of multiplexer area it may need. The cost function has two

of the node will be shared to other nodes which needed the same logic. Therefore, the cost will be its implementation area

fpMUL₃₂ 4

2in-fpSUM₃₂ 4

4 CMP

&SWAP 4 CMP

&SWAP ALIGN 2in-adder₂₄ normalize₂₅ +INV 4

ALIGN

+INV 4 4

2 3in-fpSUM₃₂

1 3in-partial sort

ALIGN

+INV 3in-adder₂₄ normalize₂₆

3 1 1

CMP

&SWAP 2 2in-adder₂₄ 2

3 4in-fpSUM₃₂

1 4in-partial sort

ALIGN

+INV 4in-adder₂₄ normalize₂₆

3 1

CMP

&SWAP 3 2in-adder₂₄ 3 Compare

to 0 ₃₂ 4

ad ₈

normalize₂₆1 normalize₂₅3 der multiply₂₄

8 4

2in-fpSUM₃₂ 2in-fpSUM₃₂

a multiplexer. Except those nodes, other sharable nodes will have a cost equal to ultiplexers and one output three meaning the routing overhead on two input m

multiplexer.

⎪ ⎪

randomly choose one nodes with

otherw

(meaning rout

⎩

⎪ the most maximum requirement

ise

ing overhead)

Fig.3-11 Cost function of search by integer programming

The advantage of this algorithm both consider sharing logic and routing overhead.

However, the disadvantage is that the qualities of results depend on the precision of cost. For the integer programming, we change the display of computation trees and give more information.

fpMUL₃₂ (1) 50.7 4

2in-fpSUM₃₂ (1) 48 4

CMP&SWAP (1) 2.3 CMP&SWAP

(1) 2.3

2in-adder24

(1) 2.6

normalize25

(1) 3.4 ALIGN+INV

(1) 3.6 ALIGN+INV

(1) 3.6

3in-fpSUM₃₂ (1) 52.7

3in-partial sort (1) 18.4

ALIGN+INV (2) 3

3in-adder24

(1) 21.6

normalize26

(1) 3

CMP&SWAP (2) 3

2in-adder24

(2) 3

4in-fpSUM₃₂ (1) 76.1

4in-partial sort (1) 27.6

ALIGN+INV (3) 3

4in-adder₂₄ ) 32.4 (1

normalize₂₆ (1) 3

CMP&SWAP

(3) 3 2in-adder24

(3) 3

Compare to 0 ₃₂ (1) 1.2

normalize₂₆ 5.3

normalize₂₅ 3.4 adder8

(2) 3.5

Multiply24

(1) 43.7

2in-fpSUM₃₂ (2) 3

2in-fpSUM₃₂ (3) 3

3in-fpSUM32

(1) 52.7

4in-fpSUM32

(1) 76.1

1 1

A function node be divided into several sub-function nodes A function node has several kinds of design

Logic (R_i) cost

Fig.3-12 The computation trees for integer programming

Step1.2: the same as step1 and step2 in minimum routing overhead to group nodes into several sets.

Step3: To find optimal result using integer programming, we set

z Variables

z Constraints from

children node

Reqs from leaves to roots and get all constraints for integer

R

i i

I

cost of node node

minimize ∑ ∗ ∀

Step4: reduce all programming.

A 4

Fig.3-13 An example for Reqs reducing

Step5: apply integer programming and get the result with minimum cost The result of optimal area-time is in below:

fpMUL₃₂ (1) 50.7 4

2in-fpSUM₃₂ (1) 48 4

CMP&SWAP (1) 2.3 CMP&SWAP

(1) 2.3

2in-adder₂₄ (1) 2.6

normalize₂₅ (1) 3.4 ALIGN+INV

(1) 3.6 ALIGN+INV

(1) 3.6

3in-fpSUM₃₂ (1) 52.7

3in-partial sort (1) 18.4

ALIGN+INV (2) 3

3in-adder₂₄ (1) 21.6

normalize₂₆ (1) 3

CMP&SWAP (2) 3

2in-adder₂₄ (2) 3

4in-fpSUM₃₂ (1) 76.1

4in-partial sort (1) 27.6

ALIGN+INV (3) 3

4in-adder₂₄ (1) 32.4

normalize₂₆ (1) 3 CMP&SWAP

(3) 3 2in-adder₂₄

(3) 3

Compare to 0 ₃₂ (1) 1.2

normalize₂₆ 5.3

normalize₂₅ 3.4 adder₈

(2) 3.5

Multiply₂₄ (1) 43.7

2in-fpSUM₃₂ (2) 3

2in-fpSUM₃₂ (3) 3

3in-fpSUM₃₂ (1) 52.7

4in-fpSUM₃₂ (1) 76.1

2 3

Fig.3-14 The result of optimal area-time algorithm

Then, we compare the result of three algorithms

3.3.4 Comparison within algorithms

Firstly, we show the comparison within three algorithms

We find that the result of optimal area-time algorithm is the same as minimum routing overhead because of too few possible solution.

2,000,547.5 2,105,722.5 986,095.4688

Maximum sharing logic

2,000,547.5

Maximum area requirement (um²)

985793.5938

Average area requirement (um²) Minimum routing

overhead

Optimal ar^ea-time

985793.5938

2,000,547.5 2,105,722.5 986,095.4688

Maximum sharing logic

2,000,547.5

Maximum area requirement (um²)

985793.5938

Average area requirement (um²) Minimum routing

overhead

Optimal ar^ea-time

985793.5938

Table.3-5 Average and maximum area requirement of three algorithms

the rage and maximum area requirements. To compare the two kinds of result in etail, we find they only differ in the choices of how to reconfigure 2in-fpSUM₃₂s to

haring logics between

2in-fpSUM is in the routing overhead.

The critical path of minimum routing overhead/optimal area-time is:

Delay time = CMP&SWAP Î ALIGN+INV Î 2in-add₂₄Î normalize₂₅Î MUX Finally choose the result of minimum routing overhead/optimal area-time because of smallest ave

3in-fpSUM32 or 4in-fpSUM32. In addition, we analyze the s

32S and 4in-fpSUM₃₂and find out the failure of result2

32 Î CMP&SWAP Î ALIGN+INV Î 2in-add24 Î normalize25

= 3.93 + 6.14 + 4.47 + 4.72 + 0.76 + 3.87 + 6.01 + 2.54 + 7.56 (ns)

= 40ns with time overhead 0.76ns (1.9%)

The area requirement of minimum routing overhead/optimal area-time is:

Area = 2in-fpSUM32*3 + MUX32*2

= 313425 + 8837.5 (um²)

2 2

= 332027.5um with area overhead 8837.5um (2.66%) The critical path of maximum sharing logic is:

Delay time = MUX32 Î CMP&SWAP Î MUX32 Î CMP&SWAP Î MUX32 Î CMP&SWAP Î ALIGN+INV Î 2in-add₂₄ Î MUX₃₂ Î 2in-add₂₄ Î MUX₃₂Î 2in-add25 Î normalize26

= 0.6 + 3.9 + 0.75 + 3.59 + 0.77 + 5.08 + 6.24 + 1.15 + 0.86 + 1.42 + 0.85 +

The area requirem Area =

2in-9.31 + 6.63 (ns)

= 43.86ns with time overhead 2.12ns (4.83%) ent of maximum sharing logic is:

fpSUM₃₂ * 3 + MUX₃₂* 6

= 108972.5 + 106872.5 + 107345.0 + 18882.5 (um ) ²

= 332307.5um²with area overhead 18882.5 um² (5.68%)

Beca ead of maximum sharing logic are much more than those ove overhead/optimal area-time, we finally choose the result

of minimu he computation unit.

3.4

DR-shader is below:

Fig.3-15 The architecture of DR-shader

in the sharable computation unit to support all vertex and pixel instructions with routing overhead

¾ Context memory to store the configuration of each instruction

¾ One more instruction slot to store vertex and pixel shader codes simultaneously Therefore, the area of DR-shader may be larger than the area of vertex shader or pixel shader

use the time overhead and area overh rhead of minimum routing

在文檔中可動態重組之處理單元於頂點與像素處理 (頁 27-0)