Comparison within algorithms… - Design of computation unit

Chapter 3 Design

3.3 Design of computation unit

3.3.4 Comparison within algorithms…

4in-adder₂₄ (1) 32.4

normalize₂₆ (1) 3 CMP&SWAP

(3) 3 2in-adder₂₄

(3) 3

Compare to 0 ₃₂ (1) 1.2

normalize₂₆ 5.3

normalize₂₅ 3.4 adder₈

(2) 3.5

Multiply₂₄ (1) 43.7

2in-fpSUM₃₂ (2) 3

2in-fpSUM₃₂ (3) 3

3in-fpSUM₃₂ (1) 52.7

4in-fpSUM₃₂ (1) 76.1

2 3

Fig.3-14 The result of optimal area-time algorithm

Then, we compare the result of three algorithms

3.3.4 Comparison within algorithms

Firstly, we show the comparison within three algorithms

We find that the result of optimal area-time algorithm is the same as minimum routing overhead because of too few possible solution.

2,000,547.5 2,105,722.5 986,095.4688

Maximum sharing logic

2,000,547.5

Maximum area requirement (um²)

985793.5938

Average area requirement (um²) Minimum routing

overhead

Optimal ar^ea-time

985793.5938

2,000,547.5 2,105,722.5 986,095.4688

Maximum sharing logic

2,000,547.5

Maximum area requirement (um²)

985793.5938

Average area requirement (um²) Minimum routing

overhead

Optimal ar^ea-time

985793.5938

Table.3-5 Average and maximum area requirement of three algorithms

the rage and maximum area requirements. To compare the two kinds of result in etail, we find they only differ in the choices of how to reconfigure 2in-fpSUM₃₂s to

haring logics between

2in-fpSUM is in the routing overhead.

The critical path of minimum routing overhead/optimal area-time is:

Delay time = CMP&SWAP Î ALIGN+INV Î 2in-add₂₄Î normalize₂₅Î MUX Finally choose the result of minimum routing overhead/optimal area-time because of smallest ave

3in-fpSUM32 or 4in-fpSUM32. In addition, we analyze the s

32S and 4in-fpSUM₃₂and find out the failure of result2

32 Î CMP&SWAP Î ALIGN+INV Î 2in-add24 Î normalize25

= 3.93 + 6.14 + 4.47 + 4.72 + 0.76 + 3.87 + 6.01 + 2.54 + 7.56 (ns)

= 40ns with time overhead 0.76ns (1.9%)

The area requirement of minimum routing overhead/optimal area-time is:

Area = 2in-fpSUM32*3 + MUX32*2

= 313425 + 8837.5 (um²)

2 2

= 332027.5um with area overhead 8837.5um (2.66%) The critical path of maximum sharing logic is:

Delay time = MUX32 Î CMP&SWAP Î MUX32 Î CMP&SWAP Î MUX32 Î CMP&SWAP Î ALIGN+INV Î 2in-add₂₄ Î MUX₃₂ Î 2in-add₂₄ Î MUX₃₂Î 2in-add25 Î normalize26

= 0.6 + 3.9 + 0.75 + 3.59 + 0.77 + 5.08 + 6.24 + 1.15 + 0.86 + 1.42 + 0.85 +

The area requirem Area =

2in-9.31 + 6.63 (ns)

= 43.86ns with time overhead 2.12ns (4.83%) ent of maximum sharing logic is:

fpSUM₃₂ * 3 + MUX₃₂* 6

= 108972.5 + 106872.5 + 107345.0 + 18882.5 (um ) ²

= 332307.5um²with area overhead 18882.5 um² (5.68%)

Beca ead of maximum sharing logic are much more than those ove overhead/optimal area-time, we finally choose the result

of minimu he computation unit.

3.4

DR-shader is below:

Fig.3-15 The architecture of DR-shader

in the sharable computation unit to support all vertex and pixel instructions with routing overhead

¾ Context memory to store the configuration of each instruction

¾ One more instruction slot to store vertex and pixel shader codes simultaneously Therefore, the area of DR-shader may be larger than the area of vertex shader or pixel shader

use the time overhead and area overh rhead of minimum routing

m routing overhead/optimal area-time as our design of t

Architecture of DR-shader

After finish the computation unit, we can build DR-shader. The architecture of

Configuration signal

In the architecture, there are some necessary hardware overheads:

¾ More logic

Instruction Decoder

Computation Unit

Instruction slot (pixel shader)

Destination register modifier

(vertex shader) Context

memory

Source register modifier

Storage Logic

for the ability to reconfigure between vertex shader type and pixel shader type. However, we will find whether its flexibility deserve be added to upgrade shader utilization in our

simulations.

3.5 Design of workloads monitor logic

In this section, we firstly descript the properties of DR-shader and the hardware overhead.

Then, we will descript the design of vertex/pixel workloads monitor logic. There are two assumptions of reconfigure property for DR-shader:

1. Order of processing: In the beginning, all DR-shaders will be reconfigure to vertex shader type because of no workload in pixels. Then, DR-shaders will be often

reconfigured to vertex sh rding to the various in the workloads between vertices and pixels until all vertices have been processed. Finally,

aining pixels.

Reconfiguring tim

after it finis eeding one more register file for temporary results.

The purpose of the workloads m

shaders with pixel shader type is equal to number of used intervals in pixel queue. (the size of intervals will be determined later)

ad monitor ader type or pixel shader type acco

all DR-shaders will be reconfigured to pixel shader type for rem

2. ing: The configuration of each DR-shader only can be changed h a vertex/pixel to avoid n

onitor logic is to control number of DR-shaders with pixel shader type in DR-shader unit and let stall cycles of all shaders as few as possible. To achieve this goal, we base on three kinds of information to control number of DR-shaders with pixel shader type, which are:

¾ Expected number of

DR-¾ Current number of DR-shaders with pixel shader type is recorded in worklo logic.

¾ Job end signal is sent by each DR-shader, telling workload monitor logic which

At ev with pix finishing change f

Fig.3-16 Flowchart of workloads monitor logic

DRp: DR-shader with pixel shader type

el DR-shaders finish their job.

ery cycle, we count the difference between expected and current number of DR-shaders el shader type. If the expected number is bigger than current number, we change

DR-shaders with vertex shader type to other type by the difference. Otherwise, we inishing DR-shaders with pixel shader by the difference.

shader type

DR : DR-shader with vertex shader type

EDRP: expected number of DR-shaders with pix shader type

CDRP: current number of DR-shaders with pix EDR < CDR ? _P _P

Find and mark finishing DR_p(s)

Find and mark finishing DR (s)_v

yes

Change (CDR_P- EDR_P) finishing DR (s) to DR (s)_P _V

Chan

finishing DR_V(s) to DR_P(s) ge (EDR_P- CDR_P)

¾ Vertex shader

he output of simulator is the execution time from vertex processing to pixel processing of a frame with the information about shader utilization. There are also some parameters we can set for different environments we want, listed below:

¾ Clip information

Throughput of the clipping unit

¾ PreZ information

Throughput of the PreZ unit

¾ Shader information

Throughput of the vertex input

Size of pixel queue

Numbers of DR-shader, vertex shaders and pixel shaders

Number of batches in each shader

Latencies of each instruction

¾ Texture information

Chapter 4 Simulation

4.1 Simulator of DR-shader

For this thesis, we build a cycle-based simulator referenced from SiS. The input of the simulator is 3Dmark05, we consider about information which is listed below:

¾ If a primitive is clipped (culled) or pass

¾ Number of tiles produced from each primitive

¾ If a tile is blocked by preZ or not

¾ Number of pixels can be produced from each pass tile codes and pixel shader codes

Textur

Miss rate of the texture memory

Miss penalty

lator base on SiS

4.2 S

decide a proper proportion between vertex shaders and pixel shaders and the size of pixel queue. For the goal, we assume number of vertex shaders is three,

and o r ed below:

ipping unit = unlimited

mited

e vertex input = unlimited e unit access cycles

Throughput of texture units

Fig.4-1 The cycle based simu

imulation1

In this section, we will

the parameter setting list Clip information

Throughput of the cl PreZ information

 Throughput of the PreZ unit = unli Shader information

Throughput of th

Triangle Setup Clip

V.S.

Prog.

P.S.

Prog.

Vertex

Shader Pixel

Shader

Cycle based Simulation base on SiS C-model

Texture Memory

I u

PreZ

Output

Pixel Queue

np t

 Number of vertex shaders = 3

tchs in each shader = 8

Then, we he s in every cycle. The workload in each cycle is counted as number of pixels in the cycle product with their execution time. We display the pie chart of pixels’ workload:

 Latencies of each instruction = 8

Number of ba

gat r workload statistics of pixel

1000~9999 0.7%

100~999 0.24%

>10000 0.3%

<10 99.37%

10~99 0.28%

其他

<10 10~99 100~999 1000~9999

>10000

0.63%

A Standar

verage = 36.353

d deviation = 279.114

er of pixel shaders when there are three vertex shaders. Under 3 vertex shaders with 37 pixel shaders, we simulate the relation between the size of pix u

Fig.4-2 The pie chart of the pixels’ workload in every cycle

We choose the average workload as numb

el q eue and execution time:

30000000 30200000 30400000 30600000

E xe ut ion tim

3 0 3 0 3 0

0 500 1000 1500 2000 2500

Size of pixel queue (pixels)

c e (c yc le s)

12 0000 10 0000 08 0000

1024

unlimited

Fig.4-3 The relation between the size of pixel queue and execution time

By the graph, we choose the size of pixel queue is 1024 (pixels).

4.3 Simulation2

In this section, we decide the size of intervals in pixel queue and number of vertex shaders and pixel shaders be changed to DR-shaders. We use the parameters decided above, listed below:

¾ Clip information

= unlimited

¾ Shader information

Throughput of the vertex input = unlimited

 Latencies of each instruction = 8

 Throughput of the clipping unit = unlimited

¾ PreZ information

Throughput of the PreZ unit

 Number of batchs in each shader = 8

 Number of vertex shaders = 3

 Size of pixel queue = 1024

 Total number of shaders = 40 (3 + 37)

¾ Texture information

 Texture unit access cycles = 8

 Miss rate of the texture memory = 0

 Throughput of texture unit = unlimited

Firstly, we simulate the relation between the size of intervals and execution time and get below graph:

15000000

Number of DR-shaders 19000000

Exec

23000000 27000000 310

0 5 10 15 20 25 30 35

ution cycles)

Size = 32 00000

Size = 64 Szie = 128

time (

Fig.4-4 The rela ize of intervals, number of DR-shaders, and execution time

It is appar intervals doesn’t have a great influence on the execution time.

Therefore, we choose the size of intervals is for the flexibility.

the number of DR-shaders and time-area product with the size of in

tion within the s

ent that the size of

equal to 32 (pixels) Secondly, we simulate the relation between

tervals equal to 32:

8.5E+15

Number of DR-shaders

Ti m e* A re a

DR-shaders

.45 +16 3 VS and 37 PS

.35 +16 .25 +16

16 DR-shadres

Fig.4-5 The relation between the number of DR-shaders and area-time product The time-area products have a minimum value at number of DR-shaders equal to 16. For the analysis in detail, we list time, area, and utilization of each shader type in below:

8,879,126,204,482,140 [0.653]

480,609,124 [1.073]

Number of each kind of shader

0 Time (cycles) [Speed up]

PS

8,879,126,204,482,140 [0.653]

480,609,124 [1.073]

Number of each kind of shader

0 Time (cycles) [Speed up]

PS

Table.4-1 The time, area, and area-time product

78,431,817 [0.879348]

Stall cycles [Utilization]

Number of each kind of shader

[0.50552]

Stall cycles Utilization]

544,974,67 (Pixel shaders) (DR-shaders)

(Vertex shaders) PS

[

78,431,817 35,664,243

Stall cycles [Utilization]

Number of each kind of shader

[0.50552]

Stall cycles Utilization]

544,974,67 (Pixel shaders) (DR-shaders)

(Vertex shaders) PS

[

Table.4-2 The utilization of each shader type

We choose 24 DR-shaders with 16 pixel shaders in DR-shader unit and the size of intervals in pixel queue is 32 pixels as our final result. This kind of design will have a great improvement in shader utilization and execution time with a few of hardware overhead and area-time product will be reduced to 65.3 %.

5.1 Discussion

To design hardware by reconfigurable architecture, we need to consider sharable logic, hardware overhead from routing path, sharing time and usable opportunity, etc. However, this kind of problem may be very complex and we couldn’t consider all causes at once. The priorities of those causes must be carefully decided for computation time and better result.

There may be a trade-off between sharable logic and routing overhead. So, how to decide whether a logic be shared or not will be one of the most important problems in the

reconfigurable architecture.

5.2 Future work

Utilization loss in texture load misses:

In our observation, long texture load miss penalty will cause shader utilization loss greatly. Although DR-shaders can be reconfigured at finishing, they stalled a long time when load misses. We may reconfigure those load-miss DR-shaders with pixel shader type to vertex shader type and try to reduce utilization loss in texture load miss. To solve this problem, we may need one more register file to buffer its temporary result and one more program counter for current state as hardware overhead. The reconfigure timing may be changed from an end of a vertex or pixel to any cycle. The workload monitor logic may need to change the configuration of load-miss DR-shaders with pixel shader type to vertex shader type. The proposed architecture is below:

Chapter 5 Conclusion

isses

at besides reducing the hardware cost by sharing logic, the used to adapt various workloads everywhere Fig.5-1 The proposed architecture to reduce utilization loss in texture load m

5.3 Conclusion

In this thesis, we have prove th

flexibility of reconfigurable architecture can be

and try to upgrade the utilization of whole system. In our design, the execution time has been greatly shortened with limited hardware overhead.

The level of reconfigurable architecture can be anywhere and used in different levels in the same time. In our design, besides DR-shader can be reconfigured between vertex shader type and pixel shader type, the computation unit of DR-shader also can be reconfigured to execute all vertex and pixel instructions for area saving.

Instruction Decoder

Computation Unit

Instruction slot (pixel shader)

Destination register modifier

(vertex shader)

Source register modifier Register file

(vertex shader) Configur

Storage Logic PC

(vertex shader)

Context memory

ation signal

PC (pixel shader)

Reference

[2] E. Lindholm, M. J. Kilgard, and H. Moreton. A userprogrammable vertex engine. In Proceedings of ACM SIGGRAPH 2001, pages 149–158, August 2001.

[3] M. D. McCool. SMASH: A next-generation API for programmable graphics accelerators.

Technical report CS-2000-14,Computer Graphics Lab, University of Waterloo,2000.

[4] M. Olano. A Programmable Pipeline for Graphics Hardware .Ph.D. thesis, University of North Carolina at Chapel Hill, 1998.

[5] Chris J.

r General-Purpose Computing：A Framework and Analysis ,International Symposium on dings of the 35th annual ACM/IEEE international

symp

rdware (2005)M. Meissner, B.- O.

Schn

ecture, Universitat Politècnica de Catalunya ,HiPEAC 2005

[8] Alireza Shoa and Shahram Shirani Dept. of Electrical and Computer Eng., McMaster University, Hamilton, Canada：Run-time Reconfigurable Systems For Digital Signal Processing Applications: A Survey

Thompson Sahngyun Hahn Mark Oskin : Using Modern Graphics Architectures fo

Microarchitecture archive Procee

osium on Microarchitecture table of contents .Istanbul, Turkey ,Pages: 306 – 317.

[6] Jiawen Chen1 Michae I. Gordon1 William Thies Matthias Zwicker Kari Pulli Frédo Durand：A Reconfigurable Architecture for Load-Balanced Rendering ,Massachusetts Institute of Technology ,Nokia Research Center .Graphics Ha

eider (Editors).

[7] Victor Moya, Carlos González, Jordi Roca Agustín Fernández, Roger Espasa：A Single (Unified) Shader GPU Microarchitecture

for Embedded Systems , Department of Computer Archit

[9] Austin Robison and Abe Winter ics Processing Hardware, March 14, 2006

[10] Karl Hillesland and Anselmo Lastra Unive sity of North Carolina, at Chapel Hill：GPU

：An Overview of Graph

r Floating-Point Paranoia

Taylor formula (reference from SiS) Appendix A. Reducing for second order

( )

在文檔中可動態重組之處理單元於頂點與像素處理 (頁 43-0)