Chapter 3 Design
3.1 Analysis of shaders
The architecture of vertex/pixel shader in DirectX(spec. of GPU) is below:
Destination register modifier Instruction slot
PC
1. Program counter: a register which stores the address of the instruction being executed.
2. Instruction slot: a storage unit which stores all shader codes for vertex/pixel shader(s).
3. Instruction decoder: a combinational circuit to translate an instruction into the control signals of the data path.
4. Register file: a storage un for each vertex or pixel.
5. Source register modifier: a sim source data.
6. Computation unit: the m add, mul, mad …).
Instruction Decoder
Computation Unit
Source register modifier
Register file Storage Logic
er modifier: a simple computation unit which is similar to source
DR-s pes to be a multi-function shader.
shared between vertex shader type
and p
ed.
Under these policies, we decide source register modifier, computation unit, and
.
we deliberate upon how to design those sharable units for both two shader
types tex
xel
m of data
nit are i*v bits in vector form,
7. Destination regist
register modifier, but its target is destination register.
hader must support all functions in both two shader ty
Therefore, it also contains those units and it must have to double the units which can’t be shared between vertex shader type and pixel shader type.
Firstly, we consider which units in DR-shader can be
ixel shader type to reduce the hardware overhead of DR-shader. The sharing policies are:
1. If and only if a storage unit must store data, which may be states, instructions or temporary results, for vertex shader and pixel shader simultaneously, it can’t be shar 2. All logic units are sharable.
destination register modifier are sharable units because all of them are logic units. Instruction slot is non-sharable unit, for it must store vertex shader codes and pixel shader codes in the same time. Besides, we can’t decide whether program counter and register file can be shared We will make the decision for them when we discuss the architecture and flexibility of DR-shader.
Secondly,
. In those sharable units, source modifier and destination modifier are the same in ver shader type and pixe shader type. Therefore, we will focus on how to design a sharable computation unit in the following sections. There are some assumptions of vertex and pi shaders’ architecture for us to design a sharable computation unit, listed below:
Single issue and single execution: because shaders expose the parallelis better than the parallelism of instructions for single issue and multi-shaders respectively execute instead of multi-execution
The widths of all operations in the computation u
where i is currently 32 (most probable), and v may be 1 or 4: for the precision
.2 Analysis of Computation requirements
, we n
ta
mputes four fields (x, y, z, w) of source registers and
2. a computation on one field of a source register and produces
3. us without any
In the be ow what instructions are in the three types with their operations and
struction Belong Operations Requirements
requirement described in DirectX.
3
Before design a sharable computation unit for vertex shader type and pixel shader type eed to understand using data, function units and processing flow in all vertex and pixel instructions individually to decide how to design the computation unit in DR-shader. We divide all vertex and pixel shader instructions in DirectX into three types by their using da and processing flows, which are:
1. Vector type: separately co produces four results.
Scalar type: only does
one result. In this type of instructions we use a changed second Taylor formula to reduce the complexity of their computations. (See Appendix A)
Non-computation type: only send the data of source register to b computation.
low, we will sh computation requirements.
In add sub
VS, PS Dst.x = Src0.x + Src1.x Dst.y = Src0.y + Src1.y Dst.z = Src0.z + Src1.z Dst.w = Src0.w + Src1.w
2in-fpSUM32 *4
cmp PS 1.x : Src2.x
Dst.y = (Src0.y >= 0)? Src1.y : Src2.y
2in-MUX32 *4 Dst.x = (Src0.x >= 0)? Src
Dst.z = (Src0.z >= 0)? Src1.z : Src2.z Dst.w = (Src0.w >= 0)? Src1.w : Src2.w
dp2add VS Dst = Src0.x * Src1.x + Src0.y * Src1.y fpMUL + Src2.w
32 *2 3in-fpSUM32 *1 dp3 (vs) VS Dst = Src0.x * Src1.x + Src0.y * Src1.y
+ Src0.w
fpMUL
* Src1.w
32 *3 3in-fpSUM32 *1 dp3 (ps)
dp4
PS
Src0.w * Src1.w VS, Dst = Src0.x * Src1.x + Src0.y * Src1.y
+ Src0.w * Src1.w +
fpMUL32 *4 4in-fpSUM32 *1 max VS, PS Dst.x = (Src0.x > Src1.x)? Src0.x : Sr
Dst.y = (Src0.y > Src1.y)? Src0.y : Src1.y c1.x
w Dst.z = (Src0.z > Src1.z)? Src0.z : Src1.z Dst.w = (Src0.w > Src1.w)? Src0.w : Src1.
2in-fpSUM32 4
32
* 2in-MUX *4
min VS, PS
w
2in-fpSUM Dst.x = (Src0.x < Src1.x)? Src0.x : Src1.x
Dst.y = (Src0.y < Src1.y)? Src0.y : Src1.y Dst.z = (Src0.z < Src1.z)? Src0.z : Src1.z Dst.w = (Src0.w < Src1.w)? Src0.w : Src1.
32 *4
32 *4 2in-MUX
mul VS, PS Dst.x = Src0.x * Src1.x fpMUL
Dst.y = Src0.y * Src1.y Dst.z = Src0.z * Src1.z Dst.w = Src0.w * Src1.w
32 *4
mad VS, PS Src2.x
rc2.y
fpMUL Dst.x = Src0.x * Src1.x +
Dst.y = Src0.y * Src1.y + S Dst.z = Src0.z * Src1.z + Src2.z Dst.w = Src0.w * Src1.w + Src2.w
32 *4 2in-fpSUM32 *4
sge VS
Dst.y = (Src0.y >= Src1.y)? 1.0f : 0.0f Dst.z = (Src0.z >= Src1.z)? 1.0f : 0.0f Dst.w = (Src0.w >= Src1.w)? 1.0f : 0.0f
2in-fpSUM
Dst.x = (Src0.x >= Src1.x)? 1.0f : 0.0f 32 *4 2in-MUX32 *4
slt VS Dst.x = (src0.x < src1.x)? 1.0f : 0.0f Dst.y = (src0.y < src1.y)? 1.0f : 0.0f Dst.z = (src0.z < src1.z)? 1.0f : 0.0f Dst.w = (src0.w < src1.w)? 1.0f : 0.0f;
2in-fpSUM32 *4 2in-MUX32 *4
sgn VS Dst.x = (Src0.x > 0)? 1.0f : (Src0.x = 0)?
0.0f : -1.0f
Dst.y = (Src0.y > 0)? 1.0f : (Src0.y = 0)?
0.0f : -1.0f
Dst.z = (Src0.z > 0)? 1.0f : (Src0.z = 0)?
0.0f : -1.0f
Dst.w = (Src0.w > 0)? 1.0f : (Src0.w = 0)?
0.0f : -1.0f
Compare to 032 *4 2in-MUX32 *8
Table.3-1 Requirements for vector type instructions
Instruction Belong Operations Requirements
add sub
VS, PS Dst.x = Src0.x + Src1.x Dst.y = Src0.y + Src1.y Dst.z = Src0.z + Src1.z Dst.w = Src0.w + Src1.w
2in-fpSUM32 *4
cmp PS Dst.x = (Src0.x >= 0)? Src1.x : Src2.x Dst.y = (Src0.y >= 0)? Src1.y : Src2.y
2in-MUX32 *4
Dst.z = (Src0.z >= 0)? Src1.z : Src2.z Dst.w = (Src0.w >= 0)? Src1.w : Src2.w
dp2add VS Dst = Src0.x * Src1.x + Src0.y * Src1.y fpMUL + Src2.w
32 *2 3in-fpSUM32 *1 dp3 VS Dst = Src0.x * Src1.x + Src0.y * Src1.y
+ Src0.w * Src1.w
fpMUL32 *3 3in-fpSUM32 *1
dp4 VS, PS
Src0.w * Src1.w Dst = Src0.x * Src1.x + Src0.y * Src1.y + Src0.w * Src1.w +
fpMUL32 *4 4in-fpSUM32 *1 max VS, PS Dst.x = (Src0.x > Src1.x)? Src0.x : S
Dst.y = (Src0.y > Src1.y)? Src0.y : Src1.y rc1.x
w Dst.z = (Src0.z > Src1.z)? Src0.z : Src1.z Dst.w = (Src0.w > Src1.w)? Src0.w : Src1.
2in-fpSUM32 4
32
* 2in-MUX *4
min VS, PS
w
2in-fpSUM Dst.x = (Src0.x < Src1.x)? Src0.x : Src1.x
Dst.y = (Src0.y < Src1.y)? Src0.y : Src1.y Dst.z = (Src0.z < Src1.z)? Src0.z : Src1.z Dst.w = (Src0.w < Src1.w)? Src0.w : Src1.
32 *4
32 *4 2in-MUX
Tabl ns
Instruction Belong
e.3-2 Requirements for scalar type instructio
Operations Requirments
branch VS PC = (Src0 !=0)?PC+1 : Src1 Bus to program counter texld PS Dst = Mem#Src1(Src0) Bus to texture memory
Table.3-3 R
From above tab ith maximum
quirements for each kind to execute any instruction, as shown in the following tables:
equirements for non-computation type instructions
les, we conclude that there are only five kinds of computations w re
Computation fpMUL 2in-fpSUM 3in-fpSUM 4in-fpSUM Compare to 0 Maximum
4 4 1 1 4 requirement
Table.3-4 Maximum requirement of each computation
3.3 Design
In this section, we want to implement the computation unit with its area as small as ign of computation unit is
st
of computation unit
possible while keeping one-cycle execution. The tradeoff in the des
that the more sharing we want the more routing overhead we may have. Therefore, we mu carefully decide whether functions of any computation unit can be shared by others. To solve this problem, we divide each computation into sub-function nodes with requirement of each node individually to discover potential sharing possibility and then use an algorithm to choose nodes covering all computations. The computations we divide are called the tree of
computation requirements.
fpMUL32 4
2in-fpSUM32 4
&SWAP ALIGN 2in-adder24 normalize25 +INV 4
ALIGN
+INV 4 4 4
2 3in-fpSUM32
1
1 3in-partial sort
ALIGN
+INV 3in-adder24 normalize26
2 1 1
CMP
&SWAP 2 2in-adder24 2
3 4in-fpSUM32
1
1 4in-partial sort
ALIGN
+INV 4in-adder24 normalize26
3 1
CMP
&SWAP 3 2in-adder24 3 Compare
to 0 32 4
multiply24
8 4
adder8
normalize26 normalize25
1 3
2in-fpSUM32 2in-fpSUM32
1
A node can be divided into sub-function nodes Another way to be divided
Logic
Maximum requirement Fig.3-2 Trees of computation requirements
The meaning of covering is that if we choose a node in the tree of computation requirements, we can say the node has been covered. Besides, if all children of a node have been covered, the node also is covered. We will compare the average and maximum area requirement of all vertex and pixel instructions and choose the one with smallest average and maximum area requirement.
3.3.1 Sharing all units within nin-fpSUM
In nin-fpSUM, we find that there are some possible sharing logics when we divide nin-fpSUM into many sub-function nodes. There are two possible partitions of nin-fpSUM , which are: 1. partition 3 or 4in-fpSUM to several 2in-fpSUMs
Src1
2in-fpSUM
32Src2
2in-fpSUM
322in-fpSUM
32Result
Src3
2in-fpSUM
32Src4
4in-fpSum
Fig.3-3 How three 2in-fpSUM s be reconfigured to one 4in-fpSUM
32
32 32
2. Sharing all units of nin-fpSUM within each other
normalize25
normalize25
normalize26 Result normalize25
normalize25
normalize26 Result CMP 2in-adder24
2in-adder24
2in-adder25
2in-adder24
2in-adder24
2in-adder24
2in-adder24
2in-adder25
2in-adder25 ALIGN Largest for fpadd4
ALIGN Largest for fpadd4
Fig.3-4 How three 2in-fpSUM32s be reconfigured to one 4in-fpSUM32
Although “CMP&SWAP”, “ALIGN+INV”, “normalize” can be easily shared within nin-fpSUM , the problem is in adders, especially at we use three 2in-adders to form a
4in-adder. In these three 2in-adders, two adders will be carry-save adders and the last one will be normal adders to add the carry and sum of the second carry-save. However, there are three problems we need to get over for this kind of design, which are:
1. We need to add four 24-bit numbers by two 24-bit carry save adder and one 24-bit normal adder. Is there any extension in adder?
2. After “ALIGN + INV”, there may be three carry-ins from the inverters. How do we add the three carry-ins by existent adders
3. The result has 24+2 bits and the sources may be minus from inverters. How do we solve the sign-extensions of minuses?
In problem1, the first carry-save adder adds three 24-bit summands from “ALIGN +
V”, so it doesn’t need to extend. Besides, the normal adder must give 26-bit result, so it to
hest bit
1
In problem carry-in to adder for the negation of 2’s comp
-in
Fig.3-6 The solution of problem2
In problem3, the final solution is 26 bits and summands may be minus. Therefore, we must add compensation, which we call sign-com
IN
must be extend to 25-bit adder. However, in the second carry-save adder, do we need extend it to 25-bit adder? The answer is no because we doesn’t need to process the hig
of the carry from first carry-save adder. The figure below can give us more carefully concept:
Fig.3-5 The solution of problem
2, “ALIGN + INV” may send 1-bit
lement. How can we sum carry-ins, which are at most three, to summands without any additional logics? To solve this problem, we use the vacant position in the carries of the two carry-save adder to add two carry-ins. Then, the last carry-in will be added as normal carry by the normal adder.
pensation, to temporary result and get correct
+
Normal adder 24+1 bits
Sum2 Carry2
Result (24+2 bits)
+
Carry save adder 1
Src1
Carry1 24 bits
Src3
Sum1 Src2
+
Carry2 Carry save adder 2
24 bits
Sum1 Carry1 Src4
Sum2
Carryin3
as normal carry-in
+
Normal adder 24+1 bits Sum2
Carry2
Result (24+2 bits)
Carryin2
+
Carry2 Carry save adder 2
24 bits Sum1
Carry1 Src4
Sum2 Carryin1
2
Sign-compensat
01
3
Number of minuses
0
result.
11 10
ion
00
1
11 10
ion
00
1 2
Sign-compensat
01
3
Number of minuses
0
+
Src2 24 bits
Src3 Src4
Result (24+2 bits) 1 Src1
1 1 11 11 11
24 bits
+
Src2 Src3 Src4
Temp result (24+2 bits) Src1
1 1 1
+ 01
Real result (24+2 bits)
Fig.3-7 The solution of problem3
s to choose node covering all computations:
¾ Algorithm1- minimum routing overhead: use the fewer choices to cover all computation requirements. The advantage of this algorithm is that there may be fewer routing overhead with enough sharing logic. However, the disadvantage is that it may loss some possible sharing opportunity for smaller area requirement. The steps of the algorithm are described in below:
Step1: collect nodes with the same logic (sharable nodes) and indicate the most maximum requirement.
Step2: group nodes into several sets and let there are no links or the same nodes within different sets and ignore the sets which only have one computation. See below:
3.3.2 Algorithm1 & 2 to choose nodes
Here, we propose two algorithm
32-bit fpMUL 4
Compare fpSUM2 4
4
2in 24-bit adder fpSUM2 2 32-bit
1
fpSUM3 1
3in partial sort
ALIGN +INV
3in 24-bit adder
fpSUM3 normalize
3 1 1
CMP
&SWAP 2
2in 24-bit adder 2
32-bit fpSUM2 3 32-bit
fpSUM4 1
1 4in partial sort
ALIGN +INV
4in 24-bit adder
fpSUM4 normalize
3 1 1
CMP
&SWAP 3
2in 24-bit adder 3 fpSUM4
Fig.3-8 Different sets of computation trees
Step3: For each set we do that, we firstly choose a sharable node in the highest level and see if all computations have been covered. If there are computations haven’t been covered, delete chosen nodes with all their children and choose another sharable node in highest level. Recursively, all computation requirements have been covered or no sharable node.
Fig d
¾
most .3-9 The result of minimum routing overhea
Algorithm2-Maximum sharing logics: find more sharing choices to cover all computation requirements. The advantage of this algorithm is that there are the sharing logics. However, the routing overhead may become more serious.
Step1.2: the same as step1 and step2 in minimum routing overhead to group nodes into several sets.
Step3: For each set we do that, we firstly choose a sharable node in the lowest level and see if all computations have been covered. If there are computations haven’t been covered, delete chosen nodes with all their children and choose another sharable node in lowest level. Recursively, all computation requirements have been covered or no sharable node.
fpMUL32 4
2in-fpSUM32 4
4 CMP
&SWAP 4 CMP
&SWAP ALIGN 2in-adder24 normalize25 +INV 4
ALIGN
+INV 4 4 4
2 3in-fpSUM32
1
1 3in-partial sort
ALIGN
+INV 3in-adder24 normalize26
3 1 1
CMP
&SWAP 2 2in-adder24 2
3 4in-fpSUM32
1
1 4in-partial sort
ALIGN
+INV 4in-adder24 normalize26
3 1
CMP
&SWAP 3 2in-adder24 3 Compare
to 0 32 4
multiply24
8 4
adder8
normalize261 normalize253
2in-fpSUM32 2in-fpSUM32
1
The following figures show how three 2in-fpSUM32s be reconfigured to one 4in-fpSUM32 in these two algorithms as an example:
3.3.3 Algorithm3-optimal area-time:
In minimum routing overhead and maximum sharing logic, we find that some factors for sharing logic haven’t been considered. In these two algorithms, we choose nodes as basic unit but we don’t consider about different proportions within nodes. Besides, the silicon area is not the same in all nodes. Therefore, we use a new algorithm.
¾ Search by integer programming: weight each sharable node with hardware cost and use integer programming to minimize total cost. Here, we estimate hardware cost of each node by num
cases. If one sharable node with the most maximum requirement it means that logic
without routing overhead divided by area of Fig.3-10 The result of maximum sharing logic
ber of multiplexer area it may need. The cost function has two
of the node will be shared to other nodes which needed the same logic. Therefore, the cost will be its implementation area
fpMUL32 4
2in-fpSUM32 4
4 CMP
&SWAP 4 CMP
&SWAP ALIGN 2in-adder24 normalize25 +INV 4
ALIGN
+INV 4 4
4
2 3in-fpSUM32
1
1 3in-partial sort
ALIGN
+INV 3in-adder24 normalize26
3 1 1
CMP
&SWAP 2 2in-adder24 2
3 4in-fpSUM32
1
1 4in-partial sort
ALIGN
+INV 4in-adder24 normalize26
3 1
CMP
&SWAP 3 2in-adder24 3 Compare
to 0 32 4
ad 8
normalize261 normalize253 der multiply24
8 4
1
2in-fpSUM32 2in-fpSUM32
a multiplexer. Except those nodes, other sharable nodes will have a cost equal to ultiplexers and one output three meaning the routing overhead on two input m
multiplexer.
⎪ ⎪
randomly choose one nodes with
otherw
(meaning rout
⎩
⎪ the most maximum requirement
ise
ing overhead)
Fig.3-11 Cost function of search by integer programmingThe advantage of this algorithm both consider sharing logic and routing overhead.
However, the disadvantage is that the qualities of results depend on the precision of cost. For the integer programming, we change the display of computation trees and give more information.
fpMUL32 (1) 50.7 4
2in-fpSUM32 (1) 48 4
CMP&SWAP (1) 2.3 CMP&SWAP
(1) 2.3
2in-adder24
(1) 2.6
normalize25
(1) 3.4 ALIGN+INV
(1) 3.6 ALIGN+INV
(1) 3.6
3in-fpSUM32 (1) 52.7
3in-partial sort (1) 18.4
ALIGN+INV (2) 3
3in-adder24
(1) 21.6
normalize26
(1) 3
CMP&SWAP (2) 3
2in-adder24
(2) 3
4in-fpSUM32 (1) 76.1
4in-partial sort (1) 27.6
ALIGN+INV (3) 3
4in-adder24 ) 32.4 (1
normalize26 (1) 3
CMP&SWAP
(3) 3 2in-adder24
(3) 3
Compare to 0 32 (1) 1.2
normalize26 5.3
normalize25 3.4 adder8
(2) 3.5
Multiply24
(1) 43.7
2in-fpSUM32 (2) 3
2in-fpSUM32 (3) 3
4
3in-fpSUM32
(1) 52.7
4in-fpSUM32
(1) 76.1
1 1
A function node be divided into several sub-function nodes A function node has several kinds of design
Logic (Ri) cost
Fig.3-12 The computation trees for integer programming
Step1.2: the same as step1 and step2 in minimum routing overhead to group nodes into several sets.
Step3: To find optimal result using integer programming, we set
z Variables
z Constraints from
i
children node
children node
Reqs from leaves to roots and get all constraints for integer
R
ci i
I
icost of node node
minimize ∑ ∗ ∀
Step4: reduce all programming.
A 4
Fig.3-13 An example for Reqs reducing
Step5: apply integer programming and get the result with minimum cost The result of optimal area-time is in below:
fpMUL32 (1) 50.7 4
2in-fpSUM32 (1) 48 4
CMP&SWAP (1) 2.3 CMP&SWAP
(1) 2.3
2in-adder24 (1) 2.6
normalize25 (1) 3.4 ALIGN+INV
(1) 3.6 ALIGN+INV
(1) 3.6
3in-fpSUM32 (1) 52.7
3in-partial sort (1) 18.4
ALIGN+INV (2) 3
3in-adder24 (1) 21.6
normalize26 (1) 3
CMP&SWAP (2) 3
2in-adder24 (2) 3
4in-fpSUM32 (1) 76.1
4in-partial sort (1) 27.6
ALIGN+INV (3) 3
4in-adder24 (1) 32.4
normalize26 (1) 3 CMP&SWAP
(3) 3 2in-adder24
(3) 3
Compare to 0 32 (1) 1.2
normalize26 5.3
normalize25 3.4 adder8
(2) 3.5
Multiply24 (1) 43.7
2in-fpSUM32 (2) 3
2in-fpSUM32 (3) 3
4
3in-fpSUM32 (1) 52.7
4in-fpSUM32 (1) 76.1
2 3
Fig.3-14 The result of optimal area-time algorithm
Then, we compare the result of three algorithms
3.3.4 Comparison within algorithms
Firstly, we show the comparison within three algorithms
We find that the result of optimal area-time algorithm is the same as minimum routing overhead because of too few possible solution.
2,000,547.5 2,105,722.5 986,095.4688
Maximum sharing logic
2,000,547.5
Maximum area requirement (um2)
985793.5938
Average area requirement (um2) Minimum routing
overhead
Optimal area-time
985793.5938
2,000,547.5 2,105,722.5 986,095.4688
Maximum sharing logic
2,000,547.5
Maximum area requirement (um2)
985793.5938
Average area requirement (um2) Minimum routing
overhead
Optimal area-time
985793.5938
Table.3-5 Average and maximum area requirement of three algorithms
the rage and maximum area requirements. To compare the two kinds of result in etail, we find they only differ in the choices of how to reconfigure 2in-fpSUM32s to
haring logics between
2in-fpSUM is in the routing overhead.
The critical path of minimum routing overhead/optimal area-time is:
Delay time = CMP&SWAP Î ALIGN+INV Î 2in-add24 Î normalize25 Î MUX Finally choose the result of minimum routing overhead/optimal area-time because of smallest ave
d
3in-fpSUM32 or 4in-fpSUM32. In addition, we analyze the s
32S and 4in-fpSUM32 and find out the failure of result2
32 Î CMP&SWAP Î ALIGN+INV Î 2in-add24 Î normalize25
= 3.93 + 6.14 + 4.47 + 4.72 + 0.76 + 3.87 + 6.01 + 2.54 + 7.56 (ns)
= 40ns with time overhead 0.76ns (1.9%)
The area requirement of minimum routing overhead/optimal area-time is:
Area = 2in-fpSUM32*3 + MUX32*2
= 313425 + 8837.5 (um2)
2 2
= 332027.5um with area overhead 8837.5um (2.66%) The critical path of maximum sharing logic is:
Delay time = MUX32 Î CMP&SWAP Î MUX32 Î CMP&SWAP Î MUX32 Î CMP&SWAP Î ALIGN+INV Î 2in-add24 Î MUX32 Î 2in-add24 Î MUX32Î 2in-add25 Î normalize26
= 0.6 + 3.9 + 0.75 + 3.59 + 0.77 + 5.08 + 6.24 + 1.15 + 0.86 + 1.42 + 0.85 +
The area requirem Area =
2in-9.31 + 6.63 (ns)
= 43.86ns with time overhead 2.12ns (4.83%) ent of maximum sharing logic is:
fpSUM32 * 3 + MUX32* 6
= 108972.5 + 106872.5 + 107345.0 + 18882.5 (um ) 2
= 332307.5um2 with area overhead 18882.5 um2 (5.68%)
Beca ead of maximum sharing logic are much more than those ove overhead/optimal area-time, we finally choose the result
of minimu he computation unit.
3.4
DR-shader is below:
Fig.3-15 The architecture of DR-shader
in the sharable computation unit to support all vertex and pixel instructions with routing overhead
¾ Context memory to store the configuration of each instruction
¾ One more instruction slot to store vertex and pixel shader codes simultaneously Therefore, the area of DR-shader may be larger than the area of vertex shader or pixel shader
use the time overhead and area overh rhead of minimum routing
use the time overhead and area overh rhead of minimum routing