國
立
交
通
大
學
資訊科學與工程所
碩
士
論
文
可動態重組之 shader unit 於頂點與像素處理
A dynamically reconfigurable shader unit for vertex and pixel
processing
研 究 生:陳 逸 麒
指導教授:鍾 崇 斌 博士
可動態重組之處理單元於頂點與像素處理
A dynamically reconfigurable shader unit for vertex and pixel processing
研 究 生:陳 逸 麒 Student:Yi-Chi Chen
指導教授:鍾 崇 斌 教授 Advisor:Dr. Chung-Ping Chung
國 立 交 通 大 學
資 訊 工 程 學 系
碩 士 論 文
A Thesis
Submitted to Institute of Computer Science and Engineering
College of Computer Science
National Chiao Tung University
in Partial Fulfillment of the Requirements
for the Degree of
Master
In
Computer Science and Information Engineering
September 2006
Hsinchu, Taiwan, Republic of China
可動態重組之處理單元於頂點與像素處理
學生:陳逸麒 指導教授:鍾崇斌 博士 國立交通大學資訊科學與工程研究所 碩士班摘 要
在頂點與像素的處理中,頂點與像素的工作量,在執行過程中有大量的變化。然而 在固定的硬體資源分配下,頂點處理器以及像數處理器經常有一方閒置,而另一方則發 生資源不足的情況。為此,我們提出了一個新的 shader unit: DR-shader unit,可針對工 作量的變化,動態分配處理器於頂點或像素處理之數量,以提升硬體資源之使用率,並 縮短執行時間。 在本論文中,首先分析處理器的架構,決定可動態重組處理器中,各元件是否能讓 兩種組態所共用。其中我們利用最小繞線代價、最多共用邏輯以及最佳面積與時間三種 演算法,幫助我們決定運算邏輯是否應作共用設計,以組合成運算單元。並且設計工作 量監測邏輯,根據工作量的變化控制各可動態重組處理器之組態。最後得到於速度上有 60%之提昇,以及 30%使用率提昇。A dynamically reconfigurable shader unit for vertex and pixel
processing
Student:Yi-Chi Chen Advisor:Dr. Chung-Ping Chung
Institute of Computer Science and Engineering College of Computer Science
Abstract
In vertex and pixel processing, the workloads of vertices and pixels vary greatly during run time. However, in fixed resource allocation between vertex shaders and pixel shaders, many vertex or pixel shaders may be idle while the other type of shaders are insufficient. Therefore, we propose a dynamically reconfigurable shader unit (DR-shader unit) which can distribute shaders for vertex and pixel processing according various workloads during run time. By the way, shader utilization can be upgraded, shortening execution time
In this thesis, we firstly analyze the architecture of shaders and determine shared units between vertex and pixel shader type in DR-shader. We use three algorithms: minimum
routing overhead, maximum sharing logic, and optimal area-time to determine how logics
be shared and complete sharable computation unit. Besides, we design workload monitor logic to control the configuration of each DR-shader by workloads. Finally we gain 60% upgrade in speed and 30% upgrade in utilization
誌 謝
首先感謝我的指導老師 鍾崇斌教授,在老師的諄諄教誨、辛勤指導與勉勵下,我 得以順利完成此論文,並且順利通過畢業口試。同時感謝我的口試委員 單智君教授、 蕭 勝夫教授以及 周賜福教授,由於他們的指導與建議,讓這篇論文更加完整和確實。 此外,感謝實驗室的學長姊、以及學弟妹們,經常在各種問題上給予我不同的意見, 並且不斷的鼓勵我,得以持續地努力下去,也給予我心情上的抒發。特別感謝的是我唯 一的同學汪威定,從大二起就不斷的支持我,並在各種時候給予我任何幫助,同時也鼓 勵我提早入學,最後也陪我一起提早畢業。 在此,我必須向我的女朋友說聲道歉。在忙碌的時候,我並沒有很多時間陪她,在 她生日的這天,我必須先跟她說聲生日快樂,同時也說聲抱歉,因為我有太久沒有好好 陪她了。 最後感謝我的家人,謝謝你們在背後全心全意地支持我,讓我在這研究的路上走得 更順利,進而能無後顧之憂的學習,讓我追求自己的理想。 謹向所有支持我、勉勵我的師長與親友,奉上最誠摯的祝福,謝謝你們。 陳逸麒 2006. 9. 7Table of contents
中文摘要 ……… ii
Abstract ……… iii
誌謝 ……… iv
Table of contents ……… v
List of Figures ……… vii
List of Tables ……… ix
Chapter 1 Introduction ……… 1
1.1 Vertex and pixel shaders ……… 1
1.2 Dynamically reconfigurable system ……….. 2
1.3 Motivation ………. 2
1.4 Objective ……… 3
1.5 Organization of this thesis..……… 3
Chapter 2 Background ……… 5 2.1 Graphics pipeline…………... ……… 5 2.2 Vertex processing…………... ……… 6 2.2.1 Model-view transformation……… 7 2.2.2 Projection transformation...……… 9 2.2.3 Clipping………...……… 10 2.2.4 Perspective division….…...……… 11 2.2.5 Viewport matrix….….…...……… 11
2.3 Programmable graphics pipeline...……… 12
Chapter 3 Design ……… 13
3.1 Analysis of shaders…… ……… 14
3.2 Analysis of Computation requirements…..……… 15
3.3 Design of computation unit……… ……… 19
3.3.1 Sharing all units within nin-fpSUM……… 20
3.3.2 Algorithm1 & 2 to choose nodes……… 23
3.3.3 Algorithm3-optimal area-time……… 26
3.3.4 Comparison within algorithms….. ……… 29
3.4 Architecture of DR-shader……… 31
Chapter 4 Simulation…………. ……… 34 4.1 Simulator of DR-shader… ……… 34 4.2 Simulation1……… 35 4.3 Simulation2……… 37 Chapter 5 Conclusion………. ……… 41 5.1 Discussion…. ……… 41 5.2 Future Work ……….. 41 5.3 Conclusion ……… 41 Reference ……….. 43
List of Figures
Fig. 1-1 The architecture of DR-shader unit ……… 1
Fig. 2-1 Four steps of graphics pipeline ………... 5
Fig. 2-2 The steps of vertex processing ………... 6
Fig. 2-3 Formula1 ………... ……… 7
Fig. 2-4 The relations between (u, v, n) ………... 8
Fig. 2-5 Formula2 ……… 8 Fig. 2-6 Formula3… ……… 9 Fig. 2-7 Formula4……… ……… 9 Fig. 2-8 Formula5 ……… 10 Fig. 2-9 Formula6……… ……… 10 Fig. 2-10 Formula7……… ……… 11 Fig. 2-11 Formula8……… ……… 12
Fig. 3-1 The architecture of vertex/pixel shader...……… 13
Fig. 3-2 Trees of computation requirements ……… 19
Fig. 3-3 How three 2in-fpSUM32s be reconfigured to one 4in-fpSUM32 …..………… 20
Fig. 3-4 How three 2in-fpSUM32s be reconfigured to one 4in-fpSUM32……… 21
Fig. 3-5 The solution of problem1……… 22
Fig. 3-6 The solution of problem2……… 22
Fig. 3-7 The solution of problem3……… 23
Fig. 3-8 Different sets of computation trees….……… 24
Fig. 3-9 The result of minimum routing overhead……… 25
Fig. 3-10 The result of maximum sharing logic.……… 26
Fig. 3-11 Cost function of search by integer programming …………...……… 27
Fig. 3-13 An example for Reqs reducing ……...……… 28
Fig. 3-14 The result of optimal area-time algorithm……….. ……… 29
Fig. 3-15 The architecture of DR-shader……… 31
Fig. 3-16 Flowchart of workloads monitor logic………... ……… 33
Fig. 4-1 The cycle based simulator base on SiS………...……… 35
Fig. 4-2 The pie chart of the pixels’ workload in every cycle..……… 36
Fig. 4-3 The relation between the size of pixel queue and execution time...……… 37
Fig. 4-4 The relation within the size of intervals, number of DR-shaders, and execution time………. 38 Fig. 4-5 The relation between the number of DR-shaders and area-time product……… 39
List of Tables
Table 2-1 Input, output and operation in each step of graphics pipeline…..……… 5
Table 3-1 Requirements for vector type instructions……...……… 17
Table 3-2 Requirements for scalar type instructions……… 18
Table 3-3 Requirements for non-computation type instructions..……… 18
Table 3-4 Maximum requirement of each computation……….. ……… 19
Table 3-5 Average and maximum area requirement of three algorithms….……… 29
Table 4-1 The time, area, and area-time product…..……… 39
Chapter 1 Introduction
Programmable graphics pipeline is the most popular type of graphics hardware nowadays. The program lengths and execution time of vertex and pixel processing may vary from scene to scene. However, this kind of variation in the execution time will lower the utilization of graphics hardware. In this thesis, we propose a dynamically reconfigurable shader unit
(DR-shader unit) for vertex and pixel processing. DR-shader unit can dynamically allocate its hardware resources to harmony with the computation requirements of vertices and pixels at runtime. By this kind of flexibility, we can increase the utilization of graphics hardware and shorten the execution time of scenes.
1.1 Vertex and pixel shaders
In vertex and pixel processing, there are number of vertex and pixel shaders, which are in the form of programmable processors. The function of the two shaders is to execute the entire vertex or pixel shader codes respectively on each individual vertex or pixel and shader codes vary from pass to pass. However, the workloads of vertices and pixels for the two shaders may be very various during run-time. Number of pixels will be produce by each primitive (composed of three vertices) may have a range from zero to whole pixels in a scene, according to its position. In different situations the execution time of each vertex and pixel may be very diverse
The workloads of vertices and pixels for the vertex and pixel shaders may be very various during run time. Traditionally, number of vertex and pixel shaders are fixed and the various workloads are partially be adapted by pixel queue, which is a buffer in front of pixel processing and stores pixels for the inputs of pixel shaders. However, the problem is that the
degrees of variation during run time often exceed the adaptability of pixel queue. When pixel queue is full, there is no space to store the result of vertex processing and all stages in vertex processing will be idle, including vertex shaders. When pixel queue is empty, there is no input for pixel processing and all stages in pixel processing will be idle, including pixel shaders. When both of these two situations happen too frequently, the utilization loss of graphics processing unit will be very low.
1.2 Dynamically reconfigurable system
We can basically classify reconfigurable systems into two different categories: dynamically reconfigurable system and static reconfigurable system. The most important difference between the two systems is that dynamically reconfigurable system can change its configuration during runtime. Dynamically reconfigurable system not only can be used for reducing the requirement of hardware in a design, but also can be used for circuit
specialization based on the information known only during runtime. This feature does not exist in both static reconfigurable system and ASIC design. Moreover, by means of
dynamically reconfiguration, we can optimize the resource allocations in hardware to meet the computation requirements at runtime.
1.3 Motivation
The workloads of vertices and pixels for vertex and pixel shaders may be very various during run time. It is difficult for any architecture with fixed resource allocation between vertex shaders and pixel shaders to deal with such a big variation. If there are some multi-function shaders which can change their functions between vertex shader and pixel shader, we can easily distribute hardware resource according to the workloads of vertices and pixels. Besides, the architectures of vertex shader and pixel shader are very the same and lots
of the hardware resources can be shared to each other. It gives us a very good chance to use reconfigurable architecture to design them
1.4 Objective
Design a dynamically reconfigurable shader unit (DR-shader unit) to adapt various workloads between vertices and pixels
VSs (Vertex
shaders)
Vertex shader farm
Fig.1-1 The architecture of DR-shader unit
1.5 Organization of this thesis
The organization of this thesis is as follow:
In Chapter 2, the background about graphics pipeline is presented.
In Chapter 3, we analyze the architecture of vertex and pixel shaders with their computation requirement and design DR-shader with workloads monitor logic
In Chapter 4, we show our simulation result with environment and decide a proper PSs (Pixel
shaders)
DR-shader unit
Interconnection
and routing path
Vertex / pixel workloads
monitor logic
DRSs (DR-shaders)DR-shader farm
Vertices
and/or
pixels
Coordinated
vertices and/or
colored pixels
proportion within vertex, pixel and dynamically reconfigurable shaders. In Chapter 5, there are discussion, future work and conclusion.
Chapter 2 Background
2.1 Graphics pipeline
We can simply see graphics pipeline as separable into four distinct and sequential steps: vertex processing, rasterization, pixel processing, and writeback. In below, we will use a table to show inputs, output and explain their operations of these four steps and to give a mainly explanation.
Vertex
Processing Rasteriz-ation ProcessingPixel Writeback
Memory Scene
Fig.2-1 Four steps of graphics pipeline
Input Output Operation
Vertex processing Vertices with 3D coordinates
Vertices positioned in the 2D scene
Transforms each 3D vertex in world space to 2D vertex on scene Rasterization Primitives (triangles)
assembled by vertices
Fragments Interpolations each
primitives into numbers of fragments Pixel processing Fragments ‘Finalized’ pixel with
final color value
Colors each fragment according to its information Writeback ‘Finalized’ pixel with
final color value
Image composed of ‘finalized’ pixel
Uses frame buffer storing pixels to assemble a frame Table.2-1 Input, output and operation in each step of graphics pipeline
processing.
2.2 Vertex processing
At the input of vertex processing step, each primitives consists of three vertex coordinates, vertex normal values and other information, such as lighting and texture coordinates. At the beginning, all vertices are represented in the 3D coordinates with three dimension values {x, y, z}. In order to using a uniform matrix representation to represent affine transformation, we convert the Cartesian coordinates (3D coordinates) to the
homogeneous coordinates, which are quadruples of the form {X, Y, Z, W}, where {X, Y, Z, W} = {xW, yW, zW, W} and in most case W is 1. After the conversion, we can use a sequence of matrix operations easily to transform the coordinates of vertices. Figure3 shows the steps of vertex processing in a typical graphics pipeline which consists of the following stages:
Model-view Transformation Projection Transformation Perspective Division Viewport Mapping Vertex information
Homogeneous
coordinates
eye coordinates clip coordinatesnormalize device coordinates
window coordinates Clipping view coordinates Vertex information
Rasterization
Dehomogenize
world coordinates Vertex processing)
1
(
1
0
0
0
1
0
0
0
1
0
0
0
1
1
0
0
0
1
−
−
−
−
−
−
−
−
−
−
−
−
−
⎥
⎥
⎥
⎥
⎦
⎤
⎢
⎢
⎢
⎢
⎣
⎡
−
−
−
=
⇒
⎥
⎥
⎥
⎥
⎦
⎤
⎢
⎢
⎢
⎢
⎣
⎡
=
⎥
⎥
⎥
⎥
⎦
⎤
⎢
⎢
⎢
⎢
⎣
⎡
z y x z y xeye
eye
eye
T
eye
eye
eye
T
2.2.1 Model-view transformation
Modeling transformation may reshape and move primitives with respect to the position of viewer (eye position: {eyex, eyey, eyez}) because the position of the viewer often does not
locate at the origin of the world coordinates. Therefore, we must use move the position of the viewer to the origin and also move all the vertices with the movement of the origin. Formula 1 shows the matrix that we use to transform the position of viewer to the origin.
Fig.2-3 Formula1
Besides the movements of the position, we must change the directions of x-axis y-axis and z-axis with respect to the orthogonal direction (u), the up-direction vector (v), and the viewing direction (n) of the viewer. Fig4 shows the relations of u, v and n. Formula2 shows the matrix that we use to do the transformation.
Fig.2-4 The relations between (u, v, n)
)
2
(
1
0
0
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
1
0
0
0
0
0
0
−
−
⎥
⎥
⎥
⎥
⎦
⎤
⎢
⎢
⎢
⎢
⎣
⎡
=
⇒
⎥
⎥
⎥
⎥
⎦
⎤
⎢
⎢
⎢
⎢
⎣
⎡
=
⎥
⎥
⎥
⎥
⎦
⎤
⎢
⎢
⎢
⎢
⎣
⎡
z y x z y x z y x z z z y y y x x xn
n
n
v
v
v
u
u
u
B
n
v
u
n
v
u
n
v
u
B
Fig.2-5 Formula2This new orthogonal coordinate system usually is called as the viewing-coordinate system or the u-v-n system. Because these two transformations are both multiplications with a 4×4 matrix in the homogenous coordinates, they can be combined into a single multiplication (Formula 3), which is implemented by 16 floating point multiplications and 12 floating point additions. As the result, the model-view transformation carries us to eye coordinates, where the viewer is at the origin and the directions of the x-axis, y-axis, and z-axis have changed. In the model-view transformation, we translate all vertices from the world coordinates to the eye coordinates. Then, we need to project all vertices on the view plan.
)
3
(
1
0
0
0
1
0
0
0
1
0
0
0
1
1
0
0
0
0
0
0
−
−
−
−
−
−
−
−
−
−
−
−
⎥
⎥
⎥
⎥
⎦
⎤
⎢
⎢
⎢
⎢
⎣
⎡
−
−
−
⎥
⎥
⎥
⎥
⎦
⎤
⎢
⎢
⎢
⎢
⎣
⎡
=
z y x z y x z y x z y xeye
eye
eye
n
n
n
v
v
v
u
u
u
BT
Fig.2-6 Formula32.2.2 Projection transformation
Like a real camera, once we decide the position and the directions of the viewer, all the objects (consist of vertices) will be projected on a plane (view plane, which is defined by {xmax, xmin, ymax, ymin, zmax, zmin} six numbers) to show what we see. There are also near plane
and far near to limit the space we can see and we usually call the limited space as view volume. Here we have two types of projections: orthogonal (orthographic) projection and perspective projection.
The orthogonal projection is a simple projection, in which the projector is perpendicular to the view plane. In this projection, the z values of objects just define the depth of objects. The only thing we must do is to normalize the view volume and let the view volume to be a cube with ranges from -1 to 1 (canonical view volume). The projection transformation will be like Formula 4.
)
4
(
1
0
0
0
2
0
0
0
2
0
0
0
2
min max max min max min max max min max min max max min max−
−
−
−
−
−
−
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎦
⎤
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎣
⎡
−
+
−
−
−
+
−
−
−
+
−
−
=
z
z
z
z
z
z
y
y
y
y
y
y
x
x
x
x
x
x
P
mix mix mix Fig.2-7 Formula4The perspective projection is a more complicated transformation than the orthogonal projection but it can produce more realistic images by changing the sizes of objects according to their distances. Therefore, an object far away will be smaller than in the near. In this
projection, the x value and y value of an object may be divided by its z value. In the homogeneous coordinates, this kind of divisions can be implemented by just change the w value. Formula 5 shows the perspective projection matrix. This matrix also can translate the view volume to canonical view volume.
)
5
(
0
1
0
0
2
0
0
0
2
0
0
0
2
min max max max min max min max min max max min max min min max max min max min−
−
−
−
−
−
−
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎦
⎤
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎣
⎡
−
−
−
−
+
−
−
+
−
−
+
−
=
z
z
z
z
z
z
z
z
y
y
y
y
y
y
z
x
x
x
x
x
x
z
P
mix mix Fig.2-8 Formula5Each of these projection transformations are both consist of a 4×4 matrix multiplication. Therefore, we also can combine the projection transformation with the model-view
transformation. At this time, we have a canonical view volume (clip coordinates), and then we can easily to check whether objects are in the eyesight of the viewer.
2.2.3 Clipping
Although we transform all objects from world coordinates to the clip coordinates, there are many objects which are outside of the canonical view volume and won’t be showed on the scene. Therefore, we must clip those objects to reduce the workloads of behind stages.
Clipping in the homogenous coordinates isn’t completely necessary, but it makes the clipping clean, fast, and simple. Besides, after dehomogenizing, the signs of the x value, y value, z
value and w value will be lost {
(
)
⎟⎠ ⎞ ⎜ ⎝ ⎛ = W Z W Y W X z y
x, , , , , }. Therefore, we can’t know whether objects are in front of or behind the viewer.
We first ignore the objects with w values smaller than zero because they are behind the viewer. Then, we can apply Cyrus-Beck clipping to test if a vertex V in the canonical view volume. Formula 6 shows the testing. By this testing, we clean some vertices out of the sight and others will continue into next steps.
)
6
(
}
,
,
{
0
)
(
1
,
0
)
(
1
⇒
+
>
<
⇒
−
>
∈
−
−
−
>
a
a
i
x
y
z
a
a
a
a
a
a
x w w i i w w i Fig.2-9 Formula62.2.4 Perspective division
Finally, all vertices have been transform from world coordinates to eye coordinates, and some vertices out of the sight have also been clean. At this step, we try to transform objects from 3D- coordinates to 2D-coordinates and decide the position of each vertex on scene. In projection transformation, we have defined how vertices be projected on 2D coordinates and the information has been store in w value. Therefore, the function of perspective division is just to divide (x, y, z) by w value and discard w value. So, we dehomogenize each vertex using the Formula7.
0
≠
⎥
⎥
⎥
⎥
⎥
⎥
⎦
⎤
⎢
⎢
⎢
⎢
⎢
⎢
⎣
⎡
=
⎥
⎥
⎥
⎦
⎤
⎢
⎢
⎢
⎣
⎡
⇒
⎥
⎥
⎥
⎥
⎦
⎤
⎢
⎢
⎢
⎢
⎣
⎡
W
W
Z
W
Y
W
X
z
y
x
W
Z
Y
X
Fig.2-10 Formula72.2.5 Viewport matrix
Finally, we decide the positions of each vertex on scene and the position of each vertex will be scaled by resolution of scene. Therefore, we transform the normalized (x, y) position of each vertex to scene position. Assume that the resolution of scene is w x h, then we will transform (x, y) from (-1, -1) to (0, 0) and from (1, 1) to (w, h). We use Formula8 to do this transformation.
⎥
⎥
⎥
⎦
⎤
⎢
⎢
⎢
⎣
⎡
+
+
=
⎥
⎦
⎤
⎢
⎣
⎡
)
(y
h
)
(x
w
w
w
y x1
2
1
2
Fig11. Formula8.2.3 Programmable graphics pipeline
Programmable graphics pipeline is the most popular solution for the requirements of both performance and flexibility in computer graphics nowadays. With the rapidly
development of computer graphics, such as 3D games, virtual realities and digital lives, the requirements of computer graphics in effects and performance become higher. To meet all kinds of users’ requirements, programmable graphics pipeline have been introduced into graphics hardware and many complicated function units have been put in. Different from fixed-functionality (non-programmable) graphics pipeline, programmable graphics pipeline has new graphics processing units: vertex shader unit and pixel shader unit. These two new processing units give graphics pipeline the flexibility to deal with all kinds of computation requirements while retaining the capability of complicated computation.
Fig.3-1 The architecture of vertex/pixel shader
There are several units in both vertex shader and the pixel shader, which are:
it which contains all inputs and outputs of any computations
ple computation unit which can swizzle or negate
ain computation unit which process complex operation (ex.
Chapter 3 Design
3.1 Analysis of shaders
The architecture of vertex/pixel shader in DirectX(spec. of GPU) is below:
Destination register modifier Instruction slot
PC
1. Program counter: a register which stores the address of the instruction being executed. 2. Instruction slot: a storage unit which stores all shader codes for vertex/pixel shader(s). 3. Instruction decoder: a combinational circuit to translate an instruction into the control signals of the data path.
4. Register file: a storage un for each vertex or pixel.
5. Source register modifier: a sim source data.
6. Computation unit: the m add, mul, mad …).
Instruction Decoder Computation Unit Source register modifier Register file Storage Logic
er modifier: a simple computation unit which is similar to source
DR-s pes to be a multi-function shader.
shared between vertex shader type
and p
ed.
Under these policies, we decide source register modifier, computation unit, and
.
we deliberate upon how to design those sharable units for both two shader
types tex
xel
m of data
nit are i*v bits in vector form,
7. Destination regist
register modifier, but its target is destination register. hader must support all functions in both two shader ty
Therefore, it also contains those units and it must have to double the units which can’t be shared between vertex shader type and pixel shader type.
Firstly, we consider which units in DR-shader can be
ixel shader type to reduce the hardware overhead of DR-shader. The sharing policies are: 1. If and only if a storage unit must store data, which may be states, instructions or temporary results, for vertex shader and pixel shader simultaneously, it can’t be shar 2. All logic units are sharable.
destination register modifier are sharable units because all of them are logic units. Instruction slot is non-sharable unit, for it must store vertex shader codes and pixel shader codes in the same time. Besides, we can’t decide whether program counter and register file can be shared We will make the decision for them when we discuss the architecture and flexibility of DR-shader.
Secondly,
. In those sharable units, source modifier and destination modifier are the same in ver shader type and pixe shader type. Therefore, we will focus on how to design a sharable computation unit in the following sections. There are some assumptions of vertex and pi shaders’ architecture for us to design a sharable computation unit, listed below:
Single issue and single execution: because shaders expose the parallelis better than the parallelism of instructions for single issue and multi-shaders respectively execute instead of multi-execution
The widths of all operations in the computation u
.2 Analysis of Computation requirements
, we n
ta
mputes four fields (x, y, z, w) of source registers and
2. a computation on one field of a source register and produces
3. us without any
In the be ow what instructions are in the three types with their operations and
struction Belong Operations Requirements
requirement described in DirectX.
3
Before design a sharable computation unit for vertex shader type and pixel shader type eed to understand using data, function units and processing flow in all vertex and pixel instructions individually to decide how to design the computation unit in DR-shader. We divide all vertex and pixel shader instructions in DirectX into three types by their using da and processing flows, which are:
1. Vector type: separately co produces four results. Scalar type: only does
one result. In this type of instructions we use a changed second Taylor formula to reduce the complexity of their computations. (See Appendix A)
Non-computation type: only send the data of source register to b computation. low, we will sh computation requirements. In add sub VS, PS Dst.x = Src0.x + Src1.x Dst.y = Src0.y + Src1.y Dst.z = Src0.z + Src1.z Dst.w = Src0.w + Src1.w
2in-fpSUM32 *4
cmp PS 1.x : Src2.x
Dst.y = (Src0.y >= 0)? Src1.y : Src2.y
2in-MUX32 *4
Dst.z = (Src0.z >= 0)? Src1.z : Src2.z Dst.w = (Src0.w >= 0)? Src1.w : Src2.w
dp2add VS Dst = Src0.x * Src1.x + Src0.y * Src1.y fpMUL + Src2.w 32 *2 3in-fpSUM32 *1 dp3 (vs) VS Dst = Src0.x * Src1.x + Src0.y * Src1.y + Src0.w fpMUL * Src1.w 32 *3 3in-fpSUM32 *1 dp3 (ps) dp4 PS Src0.w * Src1.w VS, Dst = Src0.x * Src1.x + Src0.y * Src1.y + Src0.w * Src1.w + fpMUL32 *4 4in-fpSUM32 *1 max VS, PS Dst.x = (Src0.x > Src1.x)? Src0.x : Sr Dst.y = (Src0.y > Src1.y)? Src0.y : Src1.y
c1.x w Dst.z = (Src0.z > Src1.z)? Src0.z : Src1.z Dst.w = (Src0.w > Src1.w)? Src0.w : Src1. 2in-fpSUM32 4 32 * 2in-MUX *4 min VS, PS w 2in-fpSUM Dst.x = (Src0.x < Src1.x)? Src0.x : Src1.x
Dst.y = (Src0.y < Src1.y)? Src0.y : Src1.y Dst.z = (Src0.z < Src1.z)? Src0.z : Src1.z Dst.w = (Src0.w < Src1.w)? Src0.w : Src1. 32 *4 32 *4 2in-MUX mul VS, PS Dst.x = Src0.x * Src1.x fpMUL
Dst.y = Src0.y * Src1.y Dst.z = Src0.z * Src1.z Dst.w = Src0.w * Src1.w 32 *4 mad VS, PS Src2.x rc2.y fpMUL Dst.x = Src0.x * Src1.x +
Dst.y = Src0.y * Src1.y + S Dst.z = Src0.z * Src1.z + Src2.z Dst.w = Src0.w * Src1.w + Src2.w
32 *4
sge VS
Dst.y = (Src0.y >= Src1.y)? 1.0f : 0.0f Dst.z = (Src0.z >= Src1.z)? 1.0f : 0.0f Dst.w = (Src0.w >= Src1.w)? 1.0f : 0.0f 2in-fpSUM Dst.x = (Src0.x >= Src1.x)? 1.0f : 0.0f 32 *4 2in-MUX32 *4 slt VS Dst.x = (src0.x < src1.x)? 1.0f : 0.0f Dst.y = (src0.y < src1.y)? 1.0f : 0.0f Dst.z = (src0.z < src1.z)? 1.0f : 0.0f Dst.w = (src0.w < src1.w)? 1.0f : 0.0f; 2in-fpSUM32 *4 2in-MUX32 *4 sgn VS Dst.x = (Src0.x > 0)? 1.0f : (Src0.x = 0)? 0.0f : -1.0f
Dst.y = (Src0.y > 0)? 1.0f : (Src0.y = 0)? 0.0f : -1.0f Dst.z = (Src0.z > 0)? 1.0f : (Src0.z = 0)? 0.0f : -1.0f Dst.w = (Src0.w > 0)? 1.0f : (Src0.w = 0)? 0.0f : -1.0f Compare to 032 *4 2in-MUX32 *8
Table.3-1 Requirements for vector type instructions
Instruction Belong Operations Requirements
add sub
VS, PS Dst.x = Src0.x + Src1.x Dst.y = Src0.y + Src1.y Dst.z = Src0.z + Src1.z Dst.w = Src0.w + Src1.w
2in-fpSUM32 *4
cmp PS Dst.x = (Src0.x >= 0)? Src1.x : Src2.x Dst.y = (Src0.y >= 0)? Src1.y : Src2.y
Dst.z = (Src0.z >= 0)? Src1.z : Src2.z Dst.w = (Src0.w >= 0)? Src1.w : Src2.w
dp2add VS Dst = Src0.x * Src1.x + Src0.y * Src1.y fpMUL + Src2.w 32 *2 3in-fpSUM32 *1 dp3 VS Dst = Src0.x * Src1.x + Src0.y * Src1.y + Src0.w * Src1.w fpMUL32 *3 3in-fpSUM32 *1 dp4 VS, PS Src0.w * Src1.w Dst = Src0.x * Src1.x + Src0.y * Src1.y + Src0.w * Src1.w + fpMUL32 *4 4in-fpSUM32 *1 max VS, PS Dst.x = (Src0.x > Src1.x)? Src0.x : S Dst.y = (Src0.y > Src1.y)? Src0.y : Src1.y
rc1.x w Dst.z = (Src0.z > Src1.z)? Src0.z : Src1.z Dst.w = (Src0.w > Src1.w)? Src0.w : Src1. 2in-fpSUM32 4 32 * 2in-MUX *4 min VS, PS w 2in-fpSUM Dst.x = (Src0.x < Src1.x)? Src0.x : Src1.x
Dst.y = (Src0.y < Src1.y)? Src0.y : Src1.y Dst.z = (Src0.z < Src1.z)? Src0.z : Src1.z Dst.w = (Src0.w < Src1.w)? Src0.w : Src1. 32 *4 32 *4 2in-MUX Tabl ns Instruction Belong
e.3-2 Requirements for scalar type instructio
Operations Requirments
branch VS PC = (Src0 !=0)?PC+1 : Src1 Bus to program counter texld PS Dst = Mem#Src1(Src0) Bus to texture memory
Table.3-3 R
From above tab ith maximum
quirements for each kind to execute any instruction, as shown in the following tables: equirements for non-computation type instructions
les, we conclude that there are only five kinds of computations w re
Computation fpMUL 2in-fpSUM 3in-fpSUM 4in-fpSUM Compare to 0 Maximum
4 4 1 1 4 requirement
Table.3-4 Maximum requirement of each computation
3.3 Design
In this section, we want to implement the computation unit with its area as small as ign of computation unit is
st
of computation unit
possible while keeping one-cycle execution. The tradeoff in the des
that the more sharing we want the more routing overhead we may have. Therefore, we mu carefully decide whether functions of any computation unit can be shared by others. To solve this problem, we divide each computation into sub-function nodes with requirement of each node individually to discover potential sharing possibility and then use an algorithm to choose nodes covering all computations. The computations we divide are called the tree of
computation requirements. fpMUL32 4 2in-fpSUM32 4 4 CMP &SWAP 4 CMP
&SWAP 2in-adder24 normalize25
ALIGN +INV 4 ALIGN +INV 4 4 4 2 3in-fpSUM32 1 1 3in-partial sort ALIGN
+INV 3in-adder24 normalize26
2 1 1 CMP &SWAP 2 2in-adder24 2 3 4in-fpSUM32 1 1 4in-partial sort ALIGN
+INV 4in-adder24 normalize26
3 1 CMP &SWAP 3 2in-adder24 3 Compare to 0 32 4 multiply24 8 4 adder8 normalize26 normalize25 1 3 2in-fpSUM32 2in-fpSUM32 1
A node can be divided into sub-function nodes Another way to be divided
Logic
Maximum requirement
The meaning of covering is that if we choose a node in the tree of computation requirements, we can say the node has been covered. Besides, if all children of a node have been covered, the node also is covered. We will compare the average and maximum area requirement of all vertex and pixel instructions and choose the one with smallest average and maximum area requirement.
3.3.1 Sharing all units within nin-fpSUM
In nin-fpSUM, we find that there are some possible sharing logics when we divide nin-fpSUM into many sub-function nodes. There are two possible partitions of nin-fpSUM ,
which are: 1. partition 3 or 4in-fpSUM to several 2in-fpSUMs
Src1
2in-fpSUM
32Src2
2in-fpSUM
322in-fpSUM
32Result
Src3
2in-fpSUM
32Src4
4in-fpSum
Fig.3-3 How three 2in-fpSUM s be reconfigured to one 4in-fpSUM
32
32 32
normalize25 normalize25 normalize26 Result normalize25 normalize25 normalize26 Result CMP &SWAP CMP &SWAP CMP &SWAP Src1 Src2 Src3 Src4 CMP &SWAP CMP &SWAP CMP &SWAP CMP &SWAP CMP &SWAP CMP &SWAP Src1 Src2 Src3 Src4 2in-adder24 2in-adder24 2in-adder25 2in-adder24 2in-adder24 2in-adder24 2in-adder24 2in-adder25 2in-adder25 ALIGN +INV ALIGN +INV ALIGN +INV Largest for fpadd4
ALIGN +INV ALIGN +INV ALIGN +INV Largest for fpadd4
Fig.3-4 How three 2in-fpSUM32s be reconfigured to one 4in-fpSUM32
Although “CMP&SWAP”, “ALIGN+INV”, “normalize” can be easily shared within
nin-fpSUM , the problem is in adders, especially at we use three 2in-adders to form a
4in-adder. In these three 2in-adders, two adders will be carry-save adders and the last one will be normal adders to add the carry and sum of the second carry-save. However, there are three problems we need to get over for this kind of design, which are:
1. We need to add four 24-bit numbers by two 24-bit carry save adder and one 24-bit normal adder. Is there any extension in adder?
2. After “ALIGN + INV”, there may be three carry-ins from the inverters. How do we add the three carry-ins by existent adders
3. The result has 24+2 bits and the sources may be minus from inverters. How do we solve the sign-extensions of minuses?
V”, so it doesn’t need to extend. Besides, the normal adder must give 26-bit result, so it to
hest bit
1
In problem carry-in to adder for the negation of 2’s
comp
-in
Fig.3-6 The solution of problem2
In problem3, the final solution is 26 bits and summands may be minus. Therefore, we must add compensation, which we call sign-com
IN
must be extend to 25-bit adder. However, in the second carry-save adder, do we need extend it to 25-bit adder? The answer is no because we doesn’t need to process the hig
of the carry from first carry-save adder. The figure below can give us more carefully concept:
Fig.3-5 The solution of problem
2, “ALIGN + INV” may send 1-bit
lement. How can we sum carry-ins, which are at most three, to summands without any additional logics? To solve this problem, we use the vacant position in the carries of the two carry-save adder to add two carry-ins. Then, the last carry-in will be added as normal carry by the normal adder.
pensation, to temporary result and get correct + Normal adder 24+1 bits Sum2 Carry2 Result (24+2 bits) +
Carry save adder 1 Src1 Carry1 24 bits Src3 Sum1 Src2 + Carry2 Carry save adder 2
24 bits Sum1 Carry1 Src4 Sum2 Carryin3 as normal carry-in + Normal adder 24+1 bits Sum2 Carry2 Result (24+2 bits) Carryin2 + Carry2 Carry save adder 2
24 bits Sum1 Carry1 Src4 Sum2 Carryin1
2
01
Sign-compensat3
0
Number of minuses result.11
10
00
ion1
11
10
00
ion1
2
01
Sign-compensat3
0
Number of minuses + Src2 24 bits Src3 Src4 Result (24+2 bits) Src1 1 1 1 11 11 11 24 bits + Src2 Src3 Src4 Temp result (24+2 bits) Src1 1 1 1 01 + Real result (24+2 bits)Fig.3-7 The solution of problem3
s to choose node covering all computations: ¾ Algorithm1- minimum routing overhead: use the fewer choices to cover all
computation requirements. The advantage of this algorithm is that there may be fewer routing overhead with enough sharing logic. However, the disadvantage is that it may loss some possible sharing opportunity for smaller area requirement. The steps of the algorithm are described in below:
Step1: collect nodes with the same logic (sharable nodes) and indicate the most maximum requirement.
Step2: group nodes into several sets and let there are no links or the same nodes within different sets and ignore the sets which only have one computation. See below:
3.3.2 Algorithm1 & 2 to choose nodes
32-bit fpMUL 4 Compare to 0 4 24-bit 8-bit multiply 8 4 adder 32-bit fpSUM2 4 4 CMP &SWAP 4 CMP &SWAP 2in 24-bit adder fpSUM2 normalize ALIGN +INV 4 ALIGN +INV 4 4 4 32-bit fpSUM2 2 32-bit 1 fpSUM3 1 3in partial sort ALIGN +INV 3in 24-bit adder fpSUM3 normalize 3 1 1 CMP &SWAP 2 2in 24-bit adder 2 32-bit fpSUM2 3 32-bit fpSUM4 1 1 4in partial sort ALIGN +INV 4in 24-bit adder fpSUM4 normalize 3 1 1 CMP
&SWAP 3 2in 24-bitadder 3 fpSUM4 normalize 1 fpSUM4 normalize fpSUM2 normalize 1 3 1
Fig.3-8 Different sets of computation trees
Step3: For each set we do that, we firstly choose a sharable node in the highest level and see if all computations have been covered. If there are computations haven’t been covered, delete chosen nodes with all their children and choose another sharable node in highest level. Recursively, all computation requirements have been covered or no sharable node.
Fig d
¾
most .3-9 The result of minimum routing overhea
Algorithm2-Maximum sharing logics: find more sharing choices to cover all computation requirements. The advantage of this algorithm is that there are the sharing logics. However, the routing overhead may become more serious.
Step1.2: the same as step1 and step2 in minimum routing overhead to group nodes into several sets.
Step3: For each set we do that, we firstly choose a sharable node in the lowest level and see if all computations have been covered. If there are computations haven’t been covered, delete chosen nodes with all their children and choose another sharable node in lowest level. Recursively, all computation requirements have been covered or no sharable node.
fpMUL32 4 2in-fpSUM32 4 4 CMP &SWAP 4 CMP
&SWAP 2in-adder24 normalize25
ALIGN +INV 4 ALIGN +INV 4 4 4 2 3in-fpSUM32 1 1 3in-partial sort ALIGN
+INV 3in-adder24 normalize26
3 1 1 CMP &SWAP 2 2in-adder24 2 3 4in-fpSUM32 1 1 4in-partial sort ALIGN
+INV 4in-adder24 normalize26
3 1 CMP &SWAP 3 2in-adder24 3 Compare to 0 32 4 multiply24 8 4 adder8 normalize261 normalize25 3 2in-fpSUM32 2in-fpSUM32 1
The following figures show how three 2in-fpSUM32s be reconfigured to one 4in-fpSUM32 in
these two algorithms as an example:
3.3.3 Algorithm3-optimal area-time:
In minimum routing overhead and maximum sharing logic, we find that some factors for sharing logic haven’t been considered. In these two algorithms, we choose nodes as basic unit but we don’t consider about different proportions within nodes. Besides, the silicon area is not the same in all nodes. Therefore, we use a new algorithm.
¾ Search by integer programming: weight each sharable node with hardware cost and use integer programming to minimize total cost. Here, we estimate hardware cost of each node by num
cases. If one sharable node with the most maximum requirement it means that logic
without routing overhead divided by area of Fig.3-10 The result of maximum sharing logic
ber of multiplexer area it may need. The cost function has two
of the node will be shared to other nodes which needed the same logic. Therefore, the cost will be its implementation area
fpMUL32 4 2in-fpSUM32 4 4 CMP &SWAP 4 CMP
&SWAP 2in-adder24 normalize25
ALIGN +INV 4 ALIGN +INV 4 4 4 2 3in-fpSUM32 1 1 3in-partial sort ALIGN
+INV 3in-adder24 normalize26
3 1 1 CMP &SWAP 2 2in-adder24 2 3 4in-fpSUM32 1 1 4in-partial sort ALIGN
+INV 4in-adder24 normalize26
3 1 CMP &SWAP 3 2in-adder24 3 Compare to 0 32 4 ad 8 normalize261 normalize25 3 der multiply24 8 4 1 2in-fpSUM32 2in-fpSUM32
a multiplexer. Except those nodes, other sharable nodes will have a cost equal to ultiplexers and one output three meaning the routing overhead on two input m
multiplexer.
⎪
⎪
⎪
⎨
⎧
=
3
r
multiplexe
a
of
area
tion
implementa
an
of
area
cost
randomly choose one nodes with
otherw
(meaning rout
⎩
⎪
the most maximum requirement
ise
ing overhead)
Fig.3-11 Cost function of search by integer programming
The advantage of this algorithm both consider sharing logic and routing overhead. However, the disadvantage is that the qualities of results depend on the precision of cost. For the integer programming, we change the display of computation trees and give more information.
fpMUL32 (1) 50.7 4 2in-fpSUM32 (1) 48 4 CMP&SWAP (1) 2.3 CMP&SWAP (1) 2.3 2in-adder24 (1) 2.6 normalize25 (1) 3.4 ALIGN+INV (1) 3.6 ALIGN+INV (1) 3.6 3in-fpSUM32 (1) 52.7 3in-partial sort (1) 18.4 ALIGN+INV (2) 3 3in-adder24 (1) 21.6 normalize26 (1) 3 CMP&SWAP (2) 3 2in-adder24 (2) 3 4in-fpSUM32 (1) 76.1 4in-partial sort (1) 27.6 ALIGN+INV (3) 3 4in-adder24 ) 32.4 (1 normalize26 (1) 3 CMP&SWAP (3) 3 2in-adder24 (3) 3 Compare to 0 32 (1) 1.2 normalize26 5.3 normalize25 3.4 adder8 (2) 3.5 Multiply24 (1) 43.7 2in-fpSUM32 (2) 3 2in-fpSUM32 (3) 3 4 3in-fpSUM32 (1) 52.7 4in-fpSUM32 (1) 76.1 1 1
A function node be divided into several sub-function nodes A function node has several kinds of design
Logic
(Ri) cost
Step1.2: the same as step1 and step2 in minimum routing overhead to group nodes into several sets.
Step3: To find optimal result using integer programming, we set
z Variables
0
node
from
tion
implementa
of
#
∀
≥
=
i i iI
I
z Constraints from i c i c i i i c i c c i iI
R
I
node
of
children
node
all
for
Req
Req
1
children
to
linking
lines
dot
has
node
if
node
of
children
node
each
for
Req
Req
1
children
to
linking
lines
real
has
node
if
∈
≥
∗
+
∈
≥
∗
+
∑
z ObjectiveReqs from leaves to roots and get all constraints for integer
c
R
i i
i
I
cost
of
node
node
minimize
∑
∗
∀
Step4: reduce all programming. A 4 A 4 A1 A2 Re ReqA1 qA2 B (2) ReqB B (2) ReqB C (4) ReqC C (4) ReqC D (4) ReqD D (4) ReqD D1 ReqD1 D1 ReqD1 D2 ReqD2 D2 ReqD2 D D D D D A A A A A1 c A A1 B A A A2 A1 A I I I I I node Req Req Req node Req Req 4 1 node Req Req 4 1 Req Req 2 1 node 4 Req Req 2 1 2 2 D 2 1 1 1 − − ≥ + + − − ≥ + − − ⎪ ⎩ ⎪ ⎨ ⎧ ≥ + ≥ + − − ≥ + +
(
)
⎪ ⎪ ⎪ ⎩ ⎪ ⎪ ⎪ ⎨ ⎧ ≥ + + + ≥ + ≥ + ≥ + + 2 2 1 2 1 1 Req 4 1 Req 4 1 Req 2 1 4 Req Req A D D D A A1 c A A1 B A A2 A1 A I I I I I I I I I(
)
(
)
⎪ ⎩ ⎪ ⎨ ⎧ ≥ + + + + + + ≥ + + + + + + 4 4 1 4 1 4 4 1 2 1 2 1 2 1 2 1 2 1 D D D A c A A D D D A B A A I I I I I I I I I I I I I IStep5: apply integer programming and get the result with minimum cost The result of optimal area-time is in below:
fpMUL32 (1) 50.7 4 2in-fpSUM32 (1) 48 4 CMP&SWAP (1) 2.3 CMP&SWAP (1) 2.3 2in-adder24 (1) 2.6 normalize25 (1) 3.4 ALIGN+INV (1) 3.6 ALIGN+INV (1) 3.6 3in-fpSUM32 (1) 52.7 3in-partial sort (1) 18.4 ALIGN+INV (2) 3 3in-adder24 (1) 21.6 normalize26 (1) 3 CMP&SWAP (2) 3 2in-adder24 (2) 3 4in-fpSUM32 (1) 76.1 4in-partial sort (1) 27.6 ALIGN+INV (3) 3 4in-adder24 (1) 32.4 normalize26 (1) 3 CMP&SWAP (3) 3 2in-adder24 (3) 3 Compare to 0 32 (1) 1.2 normalize26 5.3 normalize25 3.4 adder8 (2) 3.5 Multiply24 (1) 43.7 2in-fpSUM32 (2) 3 2in-fpSUM32 (3) 3 4 3in-fpSUM32 (1) 52.7 4in-fpSUM32 (1) 76.1 2 3
Fig.3-14 The result of optimal area-time algorithm
Then, we compare the result of three algorithms
3.3.4 Comparison within algorithms
Firstly, we show the comparison within three algorithms
We find that the result of optimal area-time algorithm is the same as minimum routing overhead because of too few possible solution.
2,000,547.5
2,105,722.5
986,095.4688
Maximum sharing logic2,000,547.5
Maximum area requirement (um2)985793.5938
Average area requirement (um2) Minimum routing overhead Optimal area-time985793.5938
2,000,547.5
2,105,722.5
986,095.4688
Maximum sharing logic2,000,547.5
Maximum area requirement (um2)985793.5938
Average area requirement (um2) Minimum routing overhead Optimal area-time985793.5938
the rage and maximum area requirements. To compare the two kinds of result in etail, we find they only differ in the choices of how to reconfigure 2in-fpSUM32s to
haring logics between
2in-fpSUM is in the routing overhead.
The critical path of minimum routing overhead/optimal area-time is:
Delay time = CMP&SWAP Î ALIGN+INV Î 2in-add24 Î normalize25 Î MUX
Finally choose the result of minimum routing overhead/optimal area-time because of smallest ave
d
3in-fpSUM32 or 4in-fpSUM32. In addition, we analyze the s 32S and 4in-fpSUM32 and find out the failure of result2
32 Î
CMP&SWAP Î ALIGN+INV Î 2in-add24 Î normalize25
= 3.93 + 6.14 + 4.47 + 4.72 + 0.76 + 3.87 + 6.01 + 2.54 + 7.56 (ns) = 40ns with time overhead 0.76ns (1.9%)
The area requirement of minimum routing overhead/optimal area-time is: Area = 2in-fpSUM32*3 + MUX32*2
= 313425 + 8837.5 (um2)
2 2
= 332027.5um with area overhead 8837.5um (2.66%) The critical path of maximum sharing logic is:
Delay time = MUX32 Î CMP&SWAP Î MUX32 Î CMP&SWAP Î MUX32 Î
CMP&SWAP Î ALIGN+INV Î 2in-add24 Î MUX32 Î 2in-add24 Î MUX32Î
2in-add25 Î normalize26
= 0.6 + 3.9 + 0.75 + 3.59 + 0.77 + 5.08 + 6.24 + 1.15 + 0.86 + 1.42 + 0.85 +
The area requirem Area =
2in-9.31 + 6.63 (ns)
= 43.86ns with time overhead 2.12ns (4.83%) ent of maximum sharing logic is:
fpSUM32 * 3 + MUX32* 6
= 108972.5 + 106872.5 + 107345.0 + 18882.5 (um ) 2 = 332307.5um2 with area overhead 18882.5 um2 (5.68%)
Beca ead of maximum sharing logic are much more than those ove overhead/optimal area-time, we finally choose the result
of minimu he computation unit.
3.4
DR-shader is below:
Fig.3-15 The architecture of DR-shader
in the sharable computation unit to support all vertex and pixel instructions with routing overhead
¾ Context memory to store the configuration of each instruction
¾ One more instruction slot to store vertex and pixel shader codes simultaneously Therefore, the area of DR-shader may be larger than the area of vertex shader or pixel shader
use the time overhead and area overh rhead of minimum routing
m routing overhead/optimal area-time as our design of t
Architecture of DR-shader
After finish the computation unit, we can build DR-shader. The architecture of
Configuration signal
In the architecture, there are some necessary hardware overheads: ¾ More logic Instruction Decoder Computation Unit Instruction slot (pixel shader) Destination register modifier Register file Instruction slot
(vertex shader) Context
memory PC
Source register modifier
for the ability to reconfigure between vertex shader type and pixel shader type. However, we will find whether its flexibility deserve be added to upgrade shader utilization in our
simulations.
3.5 Design of workloads monitor logic
In this section, we firstly descript the properties of DR-shader and the hardware overhead. Then, we will descript the design of vertex/pixel workloads monitor logic. There are two assumptions of reconfigure property for DR-shader:
1. Order of processing: In the beginning, all DR-shaders will be reconfigure to vertex shader type because of no workload in pixels. Then, DR-shaders will be often
reconfigured to vertex sh rding to the various in the workloads between vertices and pixels until all vertices have been processed. Finally,
aining pixels. Reconfiguring tim
after it finis eeding one more register file for temporary results.
The purpose of the workloads m
shaders with pixel shader type is equal to number of used intervals in pixel queue. (the size of intervals will be determined later)
ad monitor ader type or pixel shader type acco
all DR-shaders will be reconfigured to pixel shader type for rem
2. ing: The configuration of each DR-shader only can be changed h a vertex/pixel to avoid n
onitor logic is to control number of DR-shaders with pixel shader type in DR-shader unit and let stall cycles of all shaders as few as possible. To achieve this goal, we base on three kinds of information to control number of DR-shaders with pixel shader type, which are:
¾ Expected number of
DR-¾ Current number of DR-shaders with pixel shader type is recorded in worklo logic.
At ev with pix finishing change f
Fig.3-16 Flowchart of workloads monitor logic
v
DRp: DR-shader with pixel shader type
el
el DR-shaders finish their job.
ery cycle, we count the difference between expected and current number of DR-shaders el shader type. If the expected number is bigger than current number, we change
DR-shaders with vertex shader type to other type by the difference. Otherwise, we inishing DR-shaders with pixel shader by the difference.
shader type
DR : DR-shader with vertex shader type
EDRP: expected number of DR-shaders with pix
shader type
CDRP: current number of DR-shaders with pix
EDR < CDR ? P P
Find and mark finishing DRp(s)
Find and mark finishing DR (s)v yes no Change (CDRP- EDRP) finishing DR (s) to DR (s)P V Chan finishing DRV(s) to DRP(s) ge (EDRP- CDRP)
¾ Vertex shader
he output of simulator is the execution time from vertex processing to pixel processing of a frame with the information about shader utilization. There are also some parameters we can set for different environments we want, listed below:
¾ Clip information
Throughput of the clipping unit ¾ PreZ information
Throughput of the PreZ unit ¾ Shader information
Throughput of the vertex input Size of pixel queue
Numbers of DR-shader, vertex shaders and pixel shaders Number of batches in each shader
Latencies of each instruction ¾ Texture information
Chapter 4 Simulation
4.1 Simulator of DR-shader
For this thesis, we build a cycle-based simulator referenced from SiS. The input of the simulator is 3Dmark05, we consider about information which is listed below:
¾ If a primitive is clipped (culled) or pass ¾ Number of tiles produced from each primitive ¾ If a tile is blocked by preZ or not
¾ Number of pixels can be produced from each pass tile codes and pixel shader codes
Textur
Miss rate of the texture memory Miss penalty
lator base on SiS
4.2 S
decide a proper proportion between vertex shaders and pixel shaders and the size of pixel queue. For the goal, we assume number of vertex shaders is three,
and o r ed below:
¾
ipping unit = unlimited ¾
mited
¾
e vertex input = unlimited
e unit access cycles
Throughput of texture units
Fig.4-1 The cycle based simu
imulation1
In this section, we will
the parameter setting list Clip information
Throughput of the cl PreZ information
Throughput of the PreZ unit = unli Shader information Throughput of th Triangle Setup Clip V.S. Prog. P.S. Prog. Vertex
Shader ShaderPixel
Cycle based Simulation base on SiS C-model
Texture Memory PreZ
I
u
Output
Pixel Queuenp t
3
Number of vertex shaders =
tchs in each shader = 8
Then, we he s in every cycle. The workload in each cycle is counted as number of pixels in the cycle product with their execution time. We display the pie chart of pixels’ workload:
Latencies of each instruction = 8 Number of ba
gat r workload statistics of pixel
1000~9999 0.7% 100~999 0.24% >10000 0.3% <10 99.37% 10~99 0.28% 其他 <10 10~99 100~999 1000~9999 >10000 0.63%
A
Standar
verage = 36.353
d deviation = 279.114
er of pixel shaders when there are three vertex shaders. Under 3 vertex shaders with 37 pixel shaders, we simulate the relation between the size of pix u
Fig.4-2 The pie chart of the pixels’ workload in every cycle
We choose the average workload as numb
30000000
30200000
30400000
30600000
E
xe
ut
ion
tim
3
0
3
0
3
0
0
500
1000
1500
2000
2500
Size of pixel queue (pixels)
c
e
(c
yc
le
s)
12 0000
10 0000
08 0000
1024
unlimited
Fig.4-3 The relation between the size of pixel queue and execution time
By the graph, we choose the size of pixel queue is 1024 (pixels).
4.3
Simulation2
In this section, we decide the size of intervals in pixel queue and number of vertex shaders and pixel shaders be changed to DR-shaders. We use the parameters decided above, listed below:
¾ Clip information
= unlimited
¾ Shader information
Throughput of the vertex input = unlimited
Latencies of each instruction = 8
Throughput of the clipping unit = unlimited ¾ PreZ information
Number of batchs in each shader = 8 Number of vertex shaders = 3 Size of pixel queue = 1024
Total number of shaders = 40 (3 + 37) ¾ Texture information
Texture unit access cycles = 8 Miss rate of the texture memory = 0 Throughput of texture unit = unlimited
Firstly, we simulate the relation between the size of intervals and execution time and get below graph: 15000000 Number of DR-shaders 19000000 Ex ec 23000000 27000000 310 0 5 10 15 20 25 30 35 utio n cy cle s) Size = 32 00000 Size = 64 Szie = 128 tim e (
Fig.4-4 The rela ize of intervals, number of DR-shaders, and execution time
It is appar intervals doesn’t have a great influence on the execution time. Therefore, we choose the size of intervals is for the flexibility.
the number of DR-shaders and time-area product with the size of in
tion within the s
ent that the size of
equal to 32 (pixels) Secondly, we simulate the relation between
8.5E+15
9.5E+15
1.05E+16
1.15E+16
1
E
1
E
1
E
0
5
10
15
20
25
30
35
Number of DR-shaders
Ti
m
e*
A
re
a
DR-shaders
.45 +16
3 VS and 37 PS
.35 +16
.25 +16
16 DR-shadres
Fig.4-5 The relation between the number of DR-shaders and area-time product The time-area products have a minimum value at number of DR-shaders equal to 16. For the analysis in detail, we list time, area, and utilization of each shader type in below:
8,879,126,204,482,140
[0.653]
480,609,124
[1.073]
18,474,735
[1.644]
24
0
16
447,827,831.4
[1]
Area (um
2)
[Ratio]
Number of each
kind of shader
0
DR
3
VS
13,601,927,566,796,306.56
[1]
30,373,118
[1]
37
Time*Area
[Ratio]
Time (cycles)
[Speed up]
PS
8,879,126,204,482,140
[0.653]
480,609,124
[1.073]
18,474,735
[1.644]
24
0
16
447,827,831.4
[1]
Area (um
2)
[Ratio]
Number of each
kind of shader
0
DR
3
VS
13,601,927,566,796,306.56
[1]
30,373,118
[1]
37
Time*Area
[Ratio]
Time (cycles)
[Speed up]
PS
Table.4-1 The time, area, and area-time product
78,431,817
[0.879348]
16
0
[0.82311]
35,664,243
24
7
[0.515063]
Stall cycles
[Utilization]
Stall cycles
[Utilization]
Number of each
kind of shader
[0.50552]
Stall cycles
Utilization]
544,974,67
(Pixel shaders)
(DR-shaders)
0
DR
3
VS
45,056,703
37
(Vertex shaders)
PS
[
78,431,817
35,664,243
[0.879348]
16
0
[0.82311]
24
7
[0.515063]
Stall cycles
[Utilization]
Stall cycles
[Utilization]
Number of each
kind of shader
[0.50552]
Stall cycles
Utilization]
544,974,67
(Pixel shaders)
(DR-shaders)
0
DR
3
VS
45,056,703
37
(Vertex shaders)
PS
[
We choose 24 DR-shaders with 16 pixel shaders in DR-shader unit and the size of intervals in pixel queue is 32 pixels as our final result. This kind of design will have a great improvement in shader utilization and execution time with a few of hardware overhead and area-time product will be reduced to 65.3 %.