Chapter 3 Design
3.3 Design of computation unit
3.3.4 Comparison within algorithms…
4in-adder24 (1) 32.4
normalize26 (1) 3 CMP&SWAP
(3) 3 2in-adder24
(3) 3
Compare to 0 32 (1) 1.2
normalize26 5.3
normalize25 3.4 adder8
(2) 3.5
Multiply24 (1) 43.7
2in-fpSUM32 (2) 3
2in-fpSUM32 (3) 3
4
3in-fpSUM32 (1) 52.7
4in-fpSUM32 (1) 76.1
2 3
Fig.3-14 The result of optimal area-time algorithm
Then, we compare the result of three algorithms
3.3.4 Comparison within algorithms
Firstly, we show the comparison within three algorithms
We find that the result of optimal area-time algorithm is the same as minimum routing overhead because of too few possible solution.
2,000,547.5 2,105,722.5 986,095.4688
Maximum sharing logic
2,000,547.5
Maximum area requirement (um2)
985793.5938
Average area requirement (um2) Minimum routing
overhead
Optimal area-time
985793.5938
2,000,547.5 2,105,722.5 986,095.4688
Maximum sharing logic
2,000,547.5
Maximum area requirement (um2)
985793.5938
Average area requirement (um2) Minimum routing
overhead
Optimal area-time
985793.5938
Table.3-5 Average and maximum area requirement of three algorithms
the rage and maximum area requirements. To compare the two kinds of result in etail, we find they only differ in the choices of how to reconfigure 2in-fpSUM32s to
haring logics between
2in-fpSUM is in the routing overhead.
The critical path of minimum routing overhead/optimal area-time is:
Delay time = CMP&SWAP Î ALIGN+INV Î 2in-add24 Î normalize25 Î MUX Finally choose the result of minimum routing overhead/optimal area-time because of smallest ave
d
3in-fpSUM32 or 4in-fpSUM32. In addition, we analyze the s
32S and 4in-fpSUM32 and find out the failure of result2
32 Î CMP&SWAP Î ALIGN+INV Î 2in-add24 Î normalize25
= 3.93 + 6.14 + 4.47 + 4.72 + 0.76 + 3.87 + 6.01 + 2.54 + 7.56 (ns)
= 40ns with time overhead 0.76ns (1.9%)
The area requirement of minimum routing overhead/optimal area-time is:
Area = 2in-fpSUM32*3 + MUX32*2
= 313425 + 8837.5 (um2)
2 2
= 332027.5um with area overhead 8837.5um (2.66%) The critical path of maximum sharing logic is:
Delay time = MUX32 Î CMP&SWAP Î MUX32 Î CMP&SWAP Î MUX32 Î CMP&SWAP Î ALIGN+INV Î 2in-add24 Î MUX32 Î 2in-add24 Î MUX32Î 2in-add25 Î normalize26
= 0.6 + 3.9 + 0.75 + 3.59 + 0.77 + 5.08 + 6.24 + 1.15 + 0.86 + 1.42 + 0.85 +
The area requirem Area =
2in-9.31 + 6.63 (ns)
= 43.86ns with time overhead 2.12ns (4.83%) ent of maximum sharing logic is:
fpSUM32 * 3 + MUX32* 6
= 108972.5 + 106872.5 + 107345.0 + 18882.5 (um ) 2
= 332307.5um2 with area overhead 18882.5 um2 (5.68%)
Beca ead of maximum sharing logic are much more than those ove overhead/optimal area-time, we finally choose the result
of minimu he computation unit.
3.4
DR-shader is below:
Fig.3-15 The architecture of DR-shader
in the sharable computation unit to support all vertex and pixel instructions with routing overhead
¾ Context memory to store the configuration of each instruction
¾ One more instruction slot to store vertex and pixel shader codes simultaneously Therefore, the area of DR-shader may be larger than the area of vertex shader or pixel shader
use the time overhead and area overh rhead of minimum routing
m routing overhead/optimal area-time as our design of t
Architecture of DR-shader
After finish the computation unit, we can build DR-shader. The architecture of
Configuration signal
In the architecture, there are some necessary hardware overheads:
¾ More logic
Instruction Decoder
Computation Unit
Instruction slot (pixel shader)
Destination register modifier
Register file Instruction slot
(vertex shader) Context
memory
PC
Source register modifier
Storage Logic
for the ability to reconfigure between vertex shader type and pixel shader type. However, we will find whether its flexibility deserve be added to upgrade shader utilization in our
simulations.
3.5 Design of workloads monitor logic
In this section, we firstly descript the properties of DR-shader and the hardware overhead.
Then, we will descript the design of vertex/pixel workloads monitor logic. There are two assumptions of reconfigure property for DR-shader:
1. Order of processing: In the beginning, all DR-shaders will be reconfigure to vertex shader type because of no workload in pixels. Then, DR-shaders will be often
reconfigured to vertex sh rding to the various in the workloads between vertices and pixels until all vertices have been processed. Finally,
aining pixels.
Reconfiguring tim
after it finis eeding one more register file for temporary results.
The purpose of the workloads m
shaders with pixel shader type is equal to number of used intervals in pixel queue. (the size of intervals will be determined later)
ad monitor ader type or pixel shader type acco
all DR-shaders will be reconfigured to pixel shader type for rem
2. ing: The configuration of each DR-shader only can be changed h a vertex/pixel to avoid n
onitor logic is to control number of DR-shaders with pixel shader type in DR-shader unit and let stall cycles of all shaders as few as possible. To achieve this goal, we base on three kinds of information to control number of DR-shaders with pixel shader type, which are:
¾ Expected number of
DR-¾ Current number of DR-shaders with pixel shader type is recorded in worklo logic.
¾ Job end signal is sent by each DR-shader, telling workload monitor logic which
At ev with pix finishing change f
Fig.3-16 Flowchart of workloads monitor logic
v
DRp: DR-shader with pixel shader type
el
el DR-shaders finish their job.
ery cycle, we count the difference between expected and current number of DR-shaders el shader type. If the expected number is bigger than current number, we change
DR-shaders with vertex shader type to other type by the difference. Otherwise, we inishing DR-shaders with pixel shader by the difference.
shader type
DR : DR-shader with vertex shader type
EDRP: expected number of DR-shaders with pix shader type
CDRP: current number of DR-shaders with pix EDR < CDR ? P P
Find and mark finishing DRp(s)
Find and mark finishing DR (s)v
yes
no
Change (CDRP- EDRP) finishing DR (s) to DR (s)P V
Chan
finishing DRV(s) to DRP(s) ge (EDRP- CDRP)
¾ Vertex shader
he output of simulator is the execution time from vertex processing to pixel processing of a frame with the information about shader utilization. There are also some parameters we can set for different environments we want, listed below:
¾ Clip information
Throughput of the clipping unit
¾ PreZ information
Throughput of the PreZ unit
¾ Shader information
Throughput of the vertex input
Size of pixel queue
Numbers of DR-shader, vertex shaders and pixel shaders
Number of batches in each shader
Latencies of each instruction
¾ Texture information
Chapter 4 Simulation
4.1 Simulator of DR-shader
For this thesis, we build a cycle-based simulator referenced from SiS. The input of the simulator is 3Dmark05, we consider about information which is listed below:
¾ If a primitive is clipped (culled) or pass
¾ Number of tiles produced from each primitive
¾ If a tile is blocked by preZ or not
¾ Number of pixels can be produced from each pass tile codes and pixel shader codes
T
Textur
Miss rate of the texture memory
Miss penalty
lator base on SiS
4.2 S
decide a proper proportion between vertex shaders and pixel shaders and the size of pixel queue. For the goal, we assume number of vertex shaders is three,
and o r ed below:
¾
ipping unit = unlimited
¾
mited
¾
e vertex input = unlimited e unit access cycles
Throughput of texture units
Fig.4-1 The cycle based simu
imulation1
In this section, we will
the parameter setting list Clip information
Throughput of the cl PreZ information
Throughput of the PreZ unit = unli Shader information
Throughput of th
Triangle Setup Clip
V.S.
Prog.
P.S.
Prog.
Vertex
Shader Pixel
Shader
Cycle based Simulation base on SiS C-model
Texture Memory
I u
PreZOutput
Pixel Queue
np t
Number of vertex shaders = 3
tchs in each shader = 8
Then, we he s in every cycle. The workload in each cycle is counted as number of pixels in the cycle product with their execution time. We display the pie chart of pixels’ workload:
Latencies of each instruction = 8
Number of ba
gat r workload statistics of pixel
1000~9999 0.7%
100~999 0.24%
>10000 0.3%
<10 99.37%
10~99 0.28%
其他
<10 10~99 100~999 1000~9999
>10000
0.63%
A Standar
verage = 36.353
d deviation = 279.114
er of pixel shaders when there are three vertex shaders. Under 3 vertex shaders with 37 pixel shaders, we simulate the relation between the size of pix u
Fig.4-2 The pie chart of the pixels’ workload in every cycle
We choose the average workload as numb
el q eue and execution time:
30000000 30200000 30400000 30600000
E xe ut ion tim
3 0 3 0 3 0
0 500 1000 1500 2000 2500
Size of pixel queue (pixels)
c e (c yc le s)
12 0000 10 0000 08 0000
1024
unlimited
Fig.4-3 The relation between the size of pixel queue and execution time
By the graph, we choose the size of pixel queue is 1024 (pixels).
4.3 Simulation2
In this section, we decide the size of intervals in pixel queue and number of vertex shaders and pixel shaders be changed to DR-shaders. We use the parameters decided above, listed below:
¾ Clip information
= unlimited
¾ Shader information
Throughput of the vertex input = unlimited
Latencies of each instruction = 8
Throughput of the clipping unit = unlimited
¾ PreZ information
Throughput of the PreZ unit
Number of batchs in each shader = 8
Number of vertex shaders = 3
Size of pixel queue = 1024
Total number of shaders = 40 (3 + 37)
¾ Texture information
Texture unit access cycles = 8
Miss rate of the texture memory = 0
Throughput of texture unit = unlimited
Firstly, we simulate the relation between the size of intervals and execution time and get below graph:
15000000
Number of DR-shaders 19000000
Exec
23000000 27000000 310
0 5 10 15 20 25 30 35
ution cycles)
Size = 32 00000
Size = 64 Szie = 128
time (
Fig.4-4 The rela ize of intervals, number of DR-shaders, and execution time
It is appar intervals doesn’t have a great influence on the execution time.
Therefore, we choose the size of intervals is for the flexibility.
the number of DR-shaders and time-area product with the size of in
tion within the s
ent that the size of
equal to 32 (pixels) Secondly, we simulate the relation between
tervals equal to 32:
8.5E+15
Number of DR-shaders
Ti m e* A re a
DR-shaders
.45 +16 3 VS and 37 PS
.35 +16 .25 +16
16 DR-shadres
Fig.4-5 The relation between the number of DR-shaders and area-time product The time-area products have a minimum value at number of DR-shaders equal to 16. For the analysis in detail, we list time, area, and utilization of each shader type in below:
8,879,126,204,482,140 [0.653]
480,609,124 [1.073]
Number of each kind of shader
0
Time (cycles) [Speed up]
PS
8,879,126,204,482,140 [0.653]
480,609,124 [1.073]
Number of each kind of shader
0
Time (cycles) [Speed up]
PS
Table.4-1 The time, area, and area-time product
78,431,817 [0.879348]
Stall cycles [Utilization]
Stall cycles [Utilization]
Number of each kind of shader
[0.50552]
Stall cycles Utilization]
544,974,67 (Pixel shaders) (DR-shaders)
(Vertex shaders) PS
[
78,431,817 35,664,243
Stall cycles [Utilization]
Stall cycles [Utilization]
Number of each kind of shader
[0.50552]
Stall cycles Utilization]
544,974,67 (Pixel shaders) (DR-shaders)
(Vertex shaders) PS
[
Table.4-2 The utilization of each shader type
We choose 24 DR-shaders with 16 pixel shaders in DR-shader unit and the size of intervals in pixel queue is 32 pixels as our final result. This kind of design will have a great improvement in shader utilization and execution time with a few of hardware overhead and area-time product will be reduced to 65.3 %.
5.1 Discussion
To design hardware by reconfigurable architecture, we need to consider sharable logic, hardware overhead from routing path, sharing time and usable opportunity, etc. However, this kind of problem may be very complex and we couldn’t consider all causes at once. The priorities of those causes must be carefully decided for computation time and better result.
There may be a trade-off between sharable logic and routing overhead. So, how to decide whether a logic be shared or not will be one of the most important problems in the
reconfigurable architecture.
5.2 Future work
Utilization loss in texture load misses:
In our observation, long texture load miss penalty will cause shader utilization loss greatly. Although DR-shaders can be reconfigured at finishing, they stalled a long time when load misses. We may reconfigure those load-miss DR-shaders with pixel shader type to vertex shader type and try to reduce utilization loss in texture load miss. To solve this problem, we may need one more register file to buffer its temporary result and one more program counter for current state as hardware overhead. The reconfigure timing may be changed from an end of a vertex or pixel to any cycle. The workload monitor logic may need to change the configuration of load-miss DR-shaders with pixel shader type to vertex shader type. The proposed architecture is below:
Chapter 5 Conclusion
isses
at besides reducing the hardware cost by sharing logic, the used to adapt various workloads everywhere Fig.5-1 The proposed architecture to reduce utilization loss in texture load m
5.3 Conclusion
In this thesis, we have prove th
flexibility of reconfigurable architecture can be
and try to upgrade the utilization of whole system. In our design, the execution time has been greatly shortened with limited hardware overhead.
The level of reconfigurable architecture can be anywhere and used in different levels in the same time. In our design, besides DR-shader can be reconfigured between vertex shader type and pixel shader type, the computation unit of DR-shader also can be reconfigured to execute all vertex and pixel instructions for area saving.
Instruction Decoder
Computation Unit
Instruction slot (pixel shader)
Destination register modifier
Register file (pixel shader) Instruction slot
(vertex shader)
Source register modifier Register file
(vertex shader) Configur
Storage Logic PC
(vertex shader)
Context memory
ation signal
PC (pixel shader)
Reference
[1] DirectX 9.0 Programmer's Reference © 1995-2002 Microsoft Corporation.
[2] E. Lindholm, M. J. Kilgard, and H. Moreton. A userprogrammable vertex engine. In Proceedings of ACM SIGGRAPH 2001, pages 149–158, August 2001.
[3] M. D. McCool. SMASH: A next-generation API for programmable graphics accelerators.
Technical report CS-2000-14,Computer Graphics Lab, University of Waterloo,2000.
[4] M. Olano. A Programmable Pipeline for Graphics Hardware .Ph.D. thesis, University of North Carolina at Chapel Hill, 1998.
[5] Chris J.
r General-Purpose Computing:A Framework and Analysis ,International Symposium on dings of the 35th annual ACM/IEEE international
symp
rdware (2005)M. Meissner, B.- O.
Schn
ecture, Universitat Politècnica de Catalunya ,HiPEAC 2005
[8] Alireza Shoa and Shahram Shirani Dept. of Electrical and Computer Eng., McMaster University, Hamilton, Canada:Run-time Reconfigurable Systems For Digital Signal Processing Applications: A Survey
Thompson Sahngyun Hahn Mark Oskin : Using Modern Graphics Architectures fo
Microarchitecture archive Procee
osium on Microarchitecture table of contents .Istanbul, Turkey ,Pages: 306 – 317.
[6] Jiawen Chen1 Michae I. Gordon1 William Thies Matthias Zwicker Kari Pulli Frédo Durand:A Reconfigurable Architecture for Load-Balanced Rendering ,Massachusetts Institute of Technology ,Nokia Research Center .Graphics Ha
eider (Editors).
[7] Victor Moya, Carlos González, Jordi Roca Agustín Fernández, Roger Espasa:A Single (Unified) Shader GPU Microarchitecture
for Embedded Systems , Department of Computer Archit
[9] Austin Robison and Abe Winter ics Processing Hardware, March 14, 2006
[10] Karl Hillesland and Anselmo Lastra Unive sity of North Carolina, at Chapel Hill:GPU
:An Overview of Graph
r Floating-Point Paranoia