可動態重組之處理單元於頂點與像素處理

(1)

國

立

交

通

大

學

資訊科學與工程所

碩

士

論

文

可動態重組之 shader unit 於頂點與像素處理

A dynamically reconfigurable shader unit for vertex and pixel

processing

研究生：陳逸麒

指導教授：鍾崇斌博士

(2)

可動態重組之處理單元於頂點與像素處理

A dynamically reconfigurable shader unit for vertex and pixel processing

研究生：陳逸麒 Student：Yi-Chi Chen

指導教授：鍾崇斌教授 Advisor：Dr. Chung-Ping Chung

國立交通大學

資訊工程學系

碩士論文

A Thesis

Submitted to Institute of Computer Science and Engineering

College of Computer Science

National Chiao Tung University

in Partial Fulfillment of the Requirements

for the Degree of

Master

In

Computer Science and Information Engineering

September 2006

Hsinchu, Taiwan, Republic of China

(3)

(4)

(5)

(6)

(7)

可動態重組之處理單元於頂點與像素處理

學生：陳逸麒指導教授：鍾崇斌博士國立交通大學資訊科學與工程研究所碩士班

摘要

在頂點與像素的處理中，頂點與像素的工作量，在執行過程中有大量的變化。然而在固定的硬體資源分配下，頂點處理器以及像數處理器經常有一方閒置，而另一方則發生資源不足的情況。為此，我們提出了一個新的 shader unit: DR-shader unit，可針對工作量的變化，動態分配處理器於頂點或像素處理之數量，以提升硬體資源之使用率，並縮短執行時間。在本論文中，首先分析處理器的架構，決定可動態重組處理器中，各元件是否能讓兩種組態所共用。其中我們利用最小繞線代價、最多共用邏輯以及最佳面積與時間三種演算法，幫助我們決定運算邏輯是否應作共用設計，以組合成運算單元。並且設計工作量監測邏輯，根據工作量的變化控制各可動態重組處理器之組態。最後得到於速度上有 60%之提昇，以及 30%使用率提昇。

(8)

A dynamically reconfigurable shader unit for vertex and pixel

processing

Student：Yi-Chi Chen Advisor：Dr. Chung-Ping Chung

Institute of Computer Science and Engineering College of Computer Science

Abstract

In vertex and pixel processing, the workloads of vertices and pixels vary greatly during run time. However, in fixed resource allocation between vertex shaders and pixel shaders, many vertex or pixel shaders may be idle while the other type of shaders are insufficient. Therefore, we propose a dynamically reconfigurable shader unit (DR-shader unit) which can distribute shaders for vertex and pixel processing according various workloads during run time. By the way, shader utilization can be upgraded, shortening execution time

In this thesis, we firstly analyze the architecture of shaders and determine shared units between vertex and pixel shader type in DR-shader. We use three algorithms: minimum

routing overhead, maximum sharing logic, and optimal area-time to determine how logics

be shared and complete sharable computation unit. Besides, we design workload monitor logic to control the configuration of each DR-shader by workloads. Finally we gain 60% upgrade in speed and 30% upgrade in utilization

(9)

誌謝

首先感謝我的指導老師鍾崇斌教授，在老師的諄諄教誨、辛勤指導與勉勵下，我得以順利完成此論文，並且順利通過畢業口試。同時感謝我的口試委員單智君教授、蕭勝夫教授以及周賜福教授，由於他們的指導與建議，讓這篇論文更加完整和確實。此外，感謝實驗室的學長姊、以及學弟妹們，經常在各種問題上給予我不同的意見，並且不斷的鼓勵我，得以持續地努力下去，也給予我心情上的抒發。特別感謝的是我唯一的同學汪威定，從大二起就不斷的支持我，並在各種時候給予我任何幫助，同時也鼓勵我提早入學，最後也陪我一起提早畢業。在此，我必須向我的女朋友說聲道歉。在忙碌的時候，我並沒有很多時間陪她，在她生日的這天，我必須先跟她說聲生日快樂，同時也說聲抱歉，因為我有太久沒有好好陪她了。最後感謝我的家人，謝謝你們在背後全心全意地支持我，讓我在這研究的路上走得更順利，進而能無後顧之憂的學習，讓我追求自己的理想。謹向所有支持我、勉勵我的師長與親友，奉上最誠摯的祝福，謝謝你們。陳逸麒 2006. 9. 7

(10)

Chapter 2 Background ……… 5 2.1 Graphics pipeline…………... ……… 5 2.2 Vertex processing…………... ……… 6 2.2.1 Model-view transformation……… 7 2.2.2 Projection transformation...……… 9 2.2.3 Clipping………...……… 10 2.2.4 Perspective division….…...……… 11 2.2.5 Viewport matrix….….…...……… 11

2.3 Programmable graphics pipeline...……… 12

Chapter 3 Design ……… 13

3.1 Analysis of shaders…… ……… 14

3.2 Analysis of Computation requirements…..……… 15

3.3 Design of computation unit……… ……… 19

3.3.1 Sharing all units within nin-fpSUM……… 20

3.3.2 Algorithm1 & 2 to choose nodes……… 23

3.3.3 Algorithm3-optimal area-time……… 26

3.3.4 Comparison within algorithms….. ……… 29

3.4 Architecture of DR-shader……… 31

(11)

Chapter 4 Simulation…………. ……… 34 4.1 Simulator of DR-shader… ……… 34 4.2 Simulation1……… 35 4.3 Simulation2……… 37 Chapter 5 Conclusion………. ……… 41 5.1 Discussion…. ……… 41 5.2 Future Work ……….. 41 5.3 Conclusion ……… 41 Reference ……….. 43

(12)

List of Figures

Fig. 1-1 The architecture of DR-shader unit ……… 1

Fig. 2-1 Four steps of graphics pipeline ………... 5

Fig. 2-2 The steps of vertex processing ………... 6

Fig. 2-3 Formula1 ………... ……… 7

Fig. 2-4 The relations between (u, v, n) ………... 8

Fig. 2-5 Formula2 ……… 8 Fig. 2-6 Formula3… ……… 9 Fig. 2-7 Formula4……… ……… 9 Fig. 2-8 Formula5 ……… 10 Fig. 2-9 Formula6……… ……… 10 Fig. 2-10 Formula7……… ……… 11 Fig. 2-11 Formula8……… ……… 12

Fig. 3-1 The architecture of vertex/pixel shader...……… 13

Fig. 3-2 Trees of computation requirements ……… 19

Fig. 3-3 How three 2in-fpSUM32s be reconfigured to one 4in-fpSUM32 …..………… 20

Fig. 3-4 How three 2in-fpSUM32s be reconfigured to one 4in-fpSUM32……… 21

Fig. 3-5 The solution of problem1……… 22

Fig. 3-8 Different sets of computation trees….……… 24

Fig. 3-9 The result of minimum routing overhead……… 25

Fig. 3-10 The result of maximum sharing logic.……… 26

Fig. 3-11 Cost function of search by integer programming …………...……… 27

(13)

Fig. 3-13 An example for Reqs reducing ……...……… 28

Fig. 3-14 The result of optimal area-time algorithm……….. ……… 29

Fig. 3-15 The architecture of DR-shader……… 31

Fig. 3-16 Flowchart of workloads monitor logic………... ……… 33

Fig. 4-1 The cycle based simulator base on SiS………...……… 35

Fig. 4-2 The pie chart of the pixels’ workload in every cycle..……… 36

Fig. 4-3 The relation between the size of pixel queue and execution time...……… 37

Fig. 4-4 The relation within the size of intervals, number of DR-shaders, and execution time………. 38 Fig. 4-5 The relation between the number of DR-shaders and area-time product……… 39

(14)

List of Tables

Table 2-1 Input, output and operation in each step of graphics pipeline…..……… 5

Table 3-1 Requirements for vector type instructions……...……… 17

Table 3-2 Requirements for scalar type instructions……… 18

Table 3-3 Requirements for non-computation type instructions..……… 18

Table 3-4 Maximum requirement of each computation……….. ……… 19

Table 3-5 Average and maximum area requirement of three algorithms….……… 29

Table 4-1 The time, area, and area-time product…..……… 39

(15)

Chapter 1 Introduction

Programmable graphics pipeline is the most popular type of graphics hardware nowadays. The program lengths and execution time of vertex and pixel processing may vary from scene to scene. However, this kind of variation in the execution time will lower the utilization of graphics hardware. In this thesis, we propose a dynamically reconfigurable shader unit

(DR-shader unit) for vertex and pixel processing. DR-shader unit can dynamically allocate its hardware resources to harmony with the computation requirements of vertices and pixels at runtime. By this kind of flexibility, we can increase the utilization of graphics hardware and shorten the execution time of scenes.

1.1 Vertex and pixel shaders

In vertex and pixel processing, there are number of vertex and pixel shaders, which are in the form of programmable processors. The function of the two shaders is to execute the entire vertex or pixel shader codes respectively on each individual vertex or pixel and shader codes vary from pass to pass. However, the workloads of vertices and pixels for the two shaders may be very various during run-time. Number of pixels will be produce by each primitive (composed of three vertices) may have a range from zero to whole pixels in a scene, according to its position. In different situations the execution time of each vertex and pixel may be very diverse

The workloads of vertices and pixels for the vertex and pixel shaders may be very various during run time. Traditionally, number of vertex and pixel shaders are fixed and the various workloads are partially be adapted by pixel queue, which is a buffer in front of pixel processing and stores pixels for the inputs of pixel shaders. However, the problem is that the

(16)

degrees of variation during run time often exceed the adaptability of pixel queue. When pixel queue is full, there is no space to store the result of vertex processing and all stages in vertex processing will be idle, including vertex shaders. When pixel queue is empty, there is no input for pixel processing and all stages in pixel processing will be idle, including pixel shaders. When both of these two situations happen too frequently, the utilization loss of graphics processing unit will be very low.

1.2 Dynamically reconfigurable system

We can basically classify reconfigurable systems into two different categories: dynamically reconfigurable system and static reconfigurable system. The most important difference between the two systems is that dynamically reconfigurable system can change its configuration during runtime. Dynamically reconfigurable system not only can be used for reducing the requirement of hardware in a design, but also can be used for circuit

specialization based on the information known only during runtime. This feature does not exist in both static reconfigurable system and ASIC design. Moreover, by means of

dynamically reconfiguration, we can optimize the resource allocations in hardware to meet the computation requirements at runtime.

1.3 Motivation

The workloads of vertices and pixels for vertex and pixel shaders may be very various during run time. It is difficult for any architecture with fixed resource allocation between vertex shaders and pixel shaders to deal with such a big variation. If there are some multi-function shaders which can change their functions between vertex shader and pixel shader, we can easily distribute hardware resource according to the workloads of vertices and pixels. Besides, the architectures of vertex shader and pixel shader are very the same and lots

(17)

of the hardware resources can be shared to each other. It gives us a very good chance to use reconfigurable architecture to design them

1.4 Objective

Design a dynamically reconfigurable shader unit (DR-shader unit) to adapt various workloads between vertices and pixels

VSs (Vertex

shaders)

Vertex shader farm

Fig.1-1 The architecture of DR-shader unit

1.5 Organization of this thesis

The organization of this thesis is as follow:

In Chapter 2, the background about graphics pipeline is presented.

In Chapter 3, we analyze the architecture of vertex and pixel shaders with their computation requirement and design DR-shader with workloads monitor logic

In Chapter 4, we show our simulation result with environment and decide a proper PSs (Pixel

shaders)

DR-shader unit

Interconnection

and routing path

Vertex / pixel workloads

monitor logic

DRSs (DR-shaders)

DR-shader farm

Vertices

and/or

pixels

Coordinated

vertices and/or

colored pixels

(18)

proportion within vertex, pixel and dynamically reconfigurable shaders. In Chapter 5, there are discussion, future work and conclusion.

(19)

Chapter 2 Background

2.1 Graphics pipeline

We can simply see graphics pipeline as separable into four distinct and sequential steps: vertex processing, rasterization, pixel processing, and writeback. In below, we will use a table to show inputs, output and explain their operations of these four steps and to give a mainly explanation.

Vertex

Processing Rasteriz-ation ProcessingPixel Writeback

Memory Scene

Fig.2-1 Four steps of graphics pipeline

Input Output Operation

Vertex processing Vertices with 3D coordinates

Vertices positioned in the 2D scene

Transforms each 3D vertex in world space to 2D vertex on scene Rasterization Primitives (triangles)

assembled by vertices

Fragments Interpolations each

primitives into numbers of fragments Pixel processing Fragments ‘Finalized’ pixel with

final color value

Colors each fragment according to its information Writeback ‘Finalized’ pixel with

final color value

Image composed of ‘finalized’ pixel

Uses frame buffer storing pixels to assemble a frame Table.2-1 Input, output and operation in each step of graphics pipeline

(20)

processing.

2.2 Vertex processing

At the input of vertex processing step, each primitives consists of three vertex coordinates, vertex normal values and other information, such as lighting and texture coordinates. At the beginning, all vertices are represented in the 3D coordinates with three dimension values {x, y, z}. In order to using a uniform matrix representation to represent affine transformation, we convert the Cartesian coordinates (3D coordinates) to the

homogeneous coordinates, which are quadruples of the form {X, Y, Z, W}, where {X, Y, Z, W} = {xW, yW, zW, W} and in most case W is 1. After the conversion, we can use a sequence of matrix operations easily to transform the coordinates of vertices. Figure3 shows the steps of vertex processing in a typical graphics pipeline which consists of the following stages:

Model-view Transformation Projection Transformation Perspective Division Viewport Mapping Vertex information

Homogeneous

coordinates

eye coordinates clip coordinates

normalize device coordinates

window coordinates Clipping view coordinates Vertex information

Rasterization

Dehomogenize

world coordinates Vertex processing

(21)

)

1 (

1

0

1

0

1

0

1

0

1 −

−

⎥

⎦

⎤

⎢

⎣

⎡

−

=

⇒

⎥

⎦

⎤

⎢

⎣

⎡

=

⎥

⎦

⎤

⎢

⎣

⎡

z y x z y x

eye

T

eye

T

2.2.1 Model-view transformation

Modeling transformation may reshape and move primitives with respect to the position of viewer (eye position: {eyex, eyey, eyez}) because the position of the viewer often does not

locate at the origin of the world coordinates. Therefore, we must use move the position of the viewer to the origin and also move all the vertices with the movement of the origin. Formula 1 shows the matrix that we use to transform the position of viewer to the origin.

Fig.2-3 Formula1

Besides the movements of the position, we must change the directions of x-axis y-axis and z-axis with respect to the orthogonal direction (u), the up-direction vector (v), and the viewing direction (n) of the viewer. Fig4 shows the relations of u, v and n. Formula2 shows the matrix that we use to do the transformation.

(22)

Fig.2-4 The relations between (u, v, n)

)

2 (

1

0

1

0

1

0

1

0

1

0

0 −

−

⎥

⎦

⎤

⎢

⎣

⎡

=

⇒

⎥

⎦

⎤

⎢

⎣

⎡

=

⎥

⎦

⎤

⎢

⎣

⎡

z y x z y x z y x z z z y y y x x x

n

v

u

B

n

v

u

n

v

u

n

v

u

B

Fig.2-5 Formula2

This new orthogonal coordinate system usually is called as the viewing-coordinate system or the u-v-n system. Because these two transformations are both multiplications with a 4×4 matrix in the homogenous coordinates, they can be combined into a single multiplication (Formula 3), which is implemented by 16 floating point multiplications and 12 floating point additions. As the result, the model-view transformation carries us to eye coordinates, where the viewer is at the origin and the directions of the x-axis, y-axis, and z-axis have changed. In the model-view transformation, we translate all vertices from the world coordinates to the eye coordinates. Then, we need to project all vertices on the view plan.

(23)

)

3 (

1

0

1

0

1

0

1

0

0 −

−

⎥

⎦

⎤

⎢

⎣

⎡

−

⎥

⎦

⎤

⎢

⎣

⎡

=

z y x z y x z y x z y x

eye

n

v

u

BT

Fig.2-6 Formula3

2.2.2 Projection transformation

Like a real camera, once we decide the position and the directions of the viewer, all the objects (consist of vertices) will be projected on a plane (view plane, which is defined by {xmax, xmin, ymax, ymin, zmax, zmin} six numbers) to show what we see. There are also near plane

and far near to limit the space we can see and we usually call the limited space as view volume. Here we have two types of projections: orthogonal (orthographic) projection and perspective projection.

The orthogonal projection is a simple projection, in which the projector is perpendicular to the view plane. In this projection, the z values of objects just define the depth of objects. The only thing we must do is to normalize the view volume and let the view volume to be a cube with ranges from -1 to 1 (canonical view volume). The projection transformation will be like Formula 4.

)

4 (

1

0

2

0

2

0

2

min max max min max min max max min max min max max min max

−

⎥

⎦

⎤

⎢

⎣

⎡

−

+

−

+

−

+

−

=

z

y

x

P

mix mix mix Fig.2-7 Formula4

The perspective projection is a more complicated transformation than the orthogonal projection but it can produce more realistic images by changing the sizes of objects according to their distances. Therefore, an object far away will be smaller than in the near. In this

(24)

projection, the x value and y value of an object may be divided by its z value. In the homogeneous coordinates, this kind of divisions can be implemented by just change the w value. Formula 5 shows the perspective projection matrix. This matrix also can translate the view volume to canonical view volume.

)

5 (

0

1

0

2

0

2

0

2

min max max max min max min max min max max min max min min max max min max min

−

⎥

⎦

⎤

⎢

⎣

⎡

−

+

−

+

−

+

−

=

z

y

z

x

z

P

mix mix Fig.2-8 Formula5

Each of these projection transformations are both consist of a 4×4 matrix multiplication. Therefore, we also can combine the projection transformation with the model-view

transformation. At this time, we have a canonical view volume (clip coordinates), and then we can easily to check whether objects are in the eyesight of the viewer.

2.2.3 Clipping

Although we transform all objects from world coordinates to the clip coordinates, there are many objects which are outside of the canonical view volume and won’t be showed on the scene. Therefore, we must clip those objects to reduce the workloads of behind stages.

Clipping in the homogenous coordinates isn’t completely necessary, but it makes the clipping clean, fast, and simple. Besides, after dehomogenizing, the signs of the x value, y value, z

value and w value will be lost {

(

)

⎟

⎠ ⎞ ⎜ ⎝ ⎛ = W Z W Y W X z y

x, , , , , }. Therefore, we can’t know whether objects are in front of or behind the viewer.

We first ignore the objects with w values smaller than zero because they are behind the viewer. Then, we can apply Cyrus-Beck clipping to test if a vertex V in the canonical view volume. Formula 6 shows the testing. By this testing, we clean some vertices out of the sight and others will continue into next steps.

)

6 (

}

,

{

0 )

(

1 ,

0 )

(

1 ⇒

+

>

<

⇒

−

>

∈

−

>

a

i

x

y

z

a

x w w i i w w i Fig.2-9 Formula6

(25)

2.2.4 Perspective division

Finally, all vertices have been transform from world coordinates to eye coordinates, and some vertices out of the sight have also been clean. At this step, we try to transform objects from 3D- coordinates to 2D-coordinates and decide the position of each vertex on scene. In projection transformation, we have defined how vertices be projected on 2D coordinates and the information has been store in w value. Therefore, the function of perspective division is just to divide (x, y, z) by w value and discard w value. So, we dehomogenize each vertex using the Formula7.

0 ≠

⎥

⎦

⎤

⎢

⎣

⎡

=

⎥

⎦

⎤

⎢

⎣

⎡

⇒

⎥

⎦

⎤

⎢

⎣

⎡

W

Z

W

Y

W

X

z

y

x

W

Z

Y

X

Fig.2-10 Formula7

2.2.5 Viewport matrix

Finally, we decide the positions of each vertex on scene and the position of each vertex will be scaled by resolution of scene. Therefore, we transform the normalized (x, y) position of each vertex to scene position. Assume that the resolution of scene is w x h, then we will transform (x, y) from (-1, -1) to (0, 0) and from (1, 1) to (w, h). We use Formula8 to do this transformation.

(26)

⎥

⎦

⎤

⎢

⎣

⎡

+

=

⎥

⎦

⎤

⎢

⎣

⎡

)

(y

h

)

(x

w

y x

1

2

1

2

Fig11. Formula8.

2.3 Programmable graphics pipeline

Programmable graphics pipeline is the most popular solution for the requirements of both performance and flexibility in computer graphics nowadays. With the rapidly

development of computer graphics, such as 3D games, virtual realities and digital lives, the requirements of computer graphics in effects and performance become higher. To meet all kinds of users’ requirements, programmable graphics pipeline have been introduced into graphics hardware and many complicated function units have been put in. Different from fixed-functionality (non-programmable) graphics pipeline, programmable graphics pipeline has new graphics processing units: vertex shader unit and pixel shader unit. These two new processing units give graphics pipeline the flexibility to deal with all kinds of computation requirements while retaining the capability of complicated computation.

(27)

Fig.3-1 The architecture of vertex/pixel shader

There are several units in both vertex shader and the pixel shader, which are:

it which contains all inputs and outputs of any computations

ple computation unit which can swizzle or negate

ain computation unit which process complex operation (ex.

Chapter 3 Design

3.1 Analysis of shaders

The architecture of vertex/pixel shader in DirectX(spec. of GPU) is below:

Destination register modifier Instruction slot

PC

1. Program counter: a register which stores the address of the instruction being executed. 2. Instruction slot: a storage unit which stores all shader codes for vertex/pixel shader(s). 3. Instruction decoder: a combinational circuit to translate an instruction into the control signals of the data path.

4. Register file: a storage un for each vertex or pixel.

5. Source register modifier: a sim source data.

6. Computation unit: the m add, mul, mad …).

Instruction Decoder Computation Unit Source register modifier Register file Storage Logic

(28)

er modifier: a simple computation unit which is similar to source

DR-s pes to be a multi-function shader.

shared between vertex shader type

and p

ed.

Under these policies, we decide source register modifier, computation unit, and

.

we deliberate upon how to design those sharable units for both two shader

types tex

xel

m of data

nit are i*v bits in vector form,

7. Destination regist

register modifier, but its target is destination register. hader must support all functions in both two shader ty

Therefore, it also contains those units and it must have to double the units which can’t be shared between vertex shader type and pixel shader type.

Firstly, we consider which units in DR-shader can be

ixel shader type to reduce the hardware overhead of DR-shader. The sharing policies are: 1. If and only if a storage unit must store data, which may be states, instructions or temporary results, for vertex shader and pixel shader simultaneously, it can’t be shar 2. All logic units are sharable.

destination register modifier are sharable units because all of them are logic units. Instruction slot is non-sharable unit, for it must store vertex shader codes and pixel shader codes in the same time. Besides, we can’t decide whether program counter and register file can be shared We will make the decision for them when we discuss the architecture and flexibility of DR-shader.

Secondly,

. In those sharable units, source modifier and destination modifier are the same in ver shader type and pixe shader type. Therefore, we will focus on how to design a sharable computation unit in the following sections. There are some assumptions of vertex and pi shaders’ architecture for us to design a sharable computation unit, listed below:

Single issue and single execution: because shaders expose the parallelis better than the parallelism of instructions for single issue and multi-shaders respectively execute instead of multi-execution

The widths of all operations in the computation u

(29)

.2 Analysis of Computation requirements

, we n

ta

mputes four fields (x, y, z, w) of source registers and

2. a computation on one field of a source register and produces

3. us without any

In the be ow what instructions are in the three types with their operations and

struction Belong Operations Requirements

requirement described in DirectX.

3

Before design a sharable computation unit for vertex shader type and pixel shader type eed to understand using data, function units and processing flow in all vertex and pixel instructions individually to decide how to design the computation unit in DR-shader. We divide all vertex and pixel shader instructions in DirectX into three types by their using da and processing flows, which are:

1. Vector type: separately co produces four results. Scalar type: only does

one result. In this type of instructions we use a changed second Taylor formula to reduce the complexity of their computations. (See Appendix A)

Non-computation type: only send the data of source register to b computation. low, we will sh computation requirements. In add sub VS, PS Dst.x = Src0.x + Src1.x Dst.y = Src0.y + Src1.y Dst.z = Src0.z + Src1.z Dst.w = Src0.w + Src1.w

2in-fpSUM32 ＊4

cmp PS 1.x : Src2.x

Dst.y = (Src0.y >= 0)? Src1.y : Src2.y

2in-MUX32 ＊4

(30)

Dst.z = (Src0.z >= 0)? Src1.z : Src2.z Dst.w = (Src0.w >= 0)? Src1.w : Src2.w

dp2add VS Dst = Src0.x ＊ Src1.x + Src0.y ＊ Src1.y fpMUL + Src2.w 32 ＊2 3in-fpSUM32 ＊1 dp3 (vs) VS Dst = Src0.x ＊ Src1.x + Src0.y ＊ Src1.y + Src0.w fpMUL ＊ Src1.w 32 ＊3 3in-fpSUM32 ＊1 dp3 (ps) dp4 PS Src0.w ＊ Src1.w VS, Dst = Src0.x ＊ Src1.x + Src0.y ＊ Src1.y + Src0.w ＊ Src1.w + fpMUL32 ＊4 4in-fpSUM32 ＊1 max VS, PS Dst.x = (Src0.x > Src1.x)? Src0.x : Sr Dst.y = (Src0.y > Src1.y)? Src0.y : Src1.y

c1.x w Dst.z = (Src0.z > Src1.z)? Src0.z : Src1.z Dst.w = (Src0.w > Src1.w)? Src0.w : Src1. 2in-fpSUM32 4 32 ＊ 2in-MUX ＊4 min VS, PS w 2in-fpSUM Dst.x = (Src0.x < Src1.x)? Src0.x : Src1.x

Dst.y = (Src0.y < Src1.y)? Src0.y : Src1.y Dst.z = (Src0.z < Src1.z)? Src0.z : Src1.z Dst.w = (Src0.w < Src1.w)? Src0.w : Src1. 32 ＊4 32 ＊4 2in-MUX mul VS, PS Dst.x = Src0.x ＊ Src1.x fpMUL

Dst.y = Src0.y ＊ Src1.y Dst.z = Src0.z ＊ Src1.z Dst.w = Src0.w ＊ Src1.w 32 ＊4 mad VS, PS Src2.x rc2.y fpMUL Dst.x = Src0.x ＊ Src1.x +

Dst.y = Src0.y ＊ Src1.y + S Dst.z = Src0.z ＊ Src1.z + Src2.z Dst.w = Src0.w ＊ Src1.w + Src2.w

32 ＊4

(31)

sge VS

Dst.y = (Src0.y >= Src1.y)? 1.0f : 0.0f Dst.z = (Src0.z >= Src1.z)? 1.0f : 0.0f Dst.w = (Src0.w >= Src1.w)? 1.0f : 0.0f 2in-fpSUM Dst.x = (Src0.x >= Src1.x)? 1.0f : 0.0f 32 ＊4 2in-MUX32 ＊4 slt VS Dst.x = (src0.x < src1.x)? 1.0f : 0.0f Dst.y = (src0.y < src1.y)? 1.0f : 0.0f Dst.z = (src0.z < src1.z)? 1.0f : 0.0f Dst.w = (src0.w < src1.w)? 1.0f : 0.0f; 2in-fpSUM32 ＊4 2in-MUX32 ＊4 sgn VS Dst.x = (Src0.x > 0)? 1.0f : (Src0.x = 0)? 0.0f : -1.0f

Dst.y = (Src0.y > 0)? 1.0f : (Src0.y = 0)? 0.0f : -1.0f Dst.z = (Src0.z > 0)? 1.0f : (Src0.z = 0)? 0.0f : -1.0f Dst.w = (Src0.w > 0)? 1.0f : (Src0.w = 0)? 0.0f : -1.0f Compare to 032 ＊4 2in-MUX32 ＊8

Table.3-1 Requirements for vector type instructions

Instruction Belong Operations Requirements

add sub

VS, PS Dst.x = Src0.x + Src1.x Dst.y = Src0.y + Src1.y Dst.z = Src0.z + Src1.z Dst.w = Src0.w + Src1.w

2in-fpSUM32 ＊4

cmp PS Dst.x = (Src0.x >= 0)? Src1.x : Src2.x Dst.y = (Src0.y >= 0)? Src1.y : Src2.y

(32)

Dst.z = (Src0.z >= 0)? Src1.z : Src2.z Dst.w = (Src0.w >= 0)? Src1.w : Src2.w

dp2add VS Dst = Src0.x ＊ Src1.x + Src0.y ＊ Src1.y fpMUL + Src2.w 32 ＊2 3in-fpSUM32 ＊1 dp3 VS Dst = Src0.x ＊ Src1.x + Src0.y ＊ Src1.y + Src0.w ＊ Src1.w fpMUL32 ＊3 3in-fpSUM32 ＊1 dp4 VS, PS Src0.w ＊ Src1.w Dst = Src0.x ＊ Src1.x + Src0.y ＊ Src1.y + Src0.w ＊ Src1.w + fpMUL32 ＊4 4in-fpSUM32 ＊1 max VS, PS Dst.x = (Src0.x > Src1.x)? Src0.x : S Dst.y = (Src0.y > Src1.y)? Src0.y : Src1.y

rc1.x w Dst.z = (Src0.z > Src1.z)? Src0.z : Src1.z Dst.w = (Src0.w > Src1.w)? Src0.w : Src1. 2in-fpSUM32 4 32 ＊ 2in-MUX ＊4 min VS, PS w 2in-fpSUM Dst.x = (Src0.x < Src1.x)? Src0.x : Src1.x

Dst.y = (Src0.y < Src1.y)? Src0.y : Src1.y Dst.z = (Src0.z < Src1.z)? Src0.z : Src1.z Dst.w = (Src0.w < Src1.w)? Src0.w : Src1. 32 ＊4 32 ＊4 2in-MUX Tabl ns Instruction Belong

e.3-2 Requirements for scalar type instructio

Operations Requirments

branch VS PC = (Src0 !=0)?PC+1 : Src1 Bus to program counter texld PS Dst = Mem#Src1(Src0) Bus to texture memory

Table.3-3 R

From above tab ith maximum

quirements for each kind to execute any instruction, as shown in the following tables: equirements for non-computation type instructions

les, we conclude that there are only five kinds of computations w re

(33)

Computation fpMUL 2in-fpSUM 3in-fpSUM 4in-fpSUM Compare to 0 Maximum

4 4 1 1 4 requirement

Table.3-4 Maximum requirement of each computation

3.3 Design

In this section, we want to implement the computation unit with its area as small as ign of computation unit is

st

of computation unit

possible while keeping one-cycle execution. The tradeoff in the des

that the more sharing we want the more routing overhead we may have. Therefore, we mu carefully decide whether functions of any computation unit can be shared by others. To solve this problem, we divide each computation into sub-function nodes with requirement of each node individually to discover potential sharing possibility and then use an algorithm to choose nodes covering all computations. The computations we divide are called the tree of

computation requirements. fpMUL32 4 2in-fpSUM32 4 4 CMP &SWAP 4 CMP

&SWAP 2in-adder24 normalize25

ALIGN +INV 4 ALIGN +INV 4 ₄ 4 2 3in-fpSUM32 1 1 3in-partial sort ALIGN

+INV 3in-adder24 normalize26

2 1 1 CMP &SWAP 2 2in-adder24 2 3 4in-fpSUM32 1 1 4in-partial sort ALIGN

3 1 CMP &SWAP ₃ 2in-adder24 3 Compare to 0 32 4 multiply24 8 4 adder8 normalize₂₆ normalize₂₅ 1 3 2in-fpSUM32 2in-fpSUM32 1

A node can be divided into sub-function nodes Another way to be divided

Logic

Maximum requirement

(34)

The meaning of covering is that if we choose a node in the tree of computation requirements, we can say the node has been covered. Besides, if all children of a node have been covered, the node also is covered. We will compare the average and maximum area requirement of all vertex and pixel instructions and choose the one with smallest average and maximum area requirement.

3.3.1 Sharing all units within nin-fpSUM

In nin-fpSUM, we find that there are some possible sharing logics when we divide nin-fpSUM into many sub-function nodes. There are two possible partitions of nin-fpSUM ,

which are: 1. partition 3 or 4in-fpSUM to several 2in-fpSUMs

Src1

2in-fpSUM

₃₂

Src2

2in-fpSUM

₃₂

2in-fpSUM

₃₂

_Result

Src3

2in-fpSUM

₃₂

Src4

4in-fpSum

Fig.3-3 How three 2in-fpSUM s be reconfigured to one 4in-fpSUM

32

32 32

(35)

normalize₂₅ normalize25 normalize26 Result normalize₂₅ normalize25 normalize26 Result CMP &SWAP CMP &SWAP CMP &SWAP Src1 Src2 Src3 Src4 CMP &SWAP CMP &SWAP CMP &SWAP CMP &SWAP CMP &SWAP CMP &SWAP Src1 Src2 Src3 Src4 2in-adder24 2in-adder24 2in-adder25 2in-adder24 2in-adder24 2in-adder24 2in-adder24 2in-adder25 2in-adder25 ALIGN +INV ALIGN +INV ALIGN +INV Largest for fpadd4

ALIGN +INV ALIGN +INV ALIGN +INV Largest for fpadd4

Fig.3-4 How three 2in-fpSUM32s be reconfigured to one 4in-fpSUM32

Although “CMP&SWAP”, “ALIGN+INV”, “normalize” can be easily shared within

nin-fpSUM , the problem is in adders, especially at we use three 2in-adders to form a

4in-adder. In these three 2in-adders, two adders will be carry-save adders and the last one will be normal adders to add the carry and sum of the second carry-save. However, there are three problems we need to get over for this kind of design, which are:

1. We need to add four 24-bit numbers by two 24-bit carry save adder and one 24-bit normal adder. Is there any extension in adder?

2. After “ALIGN + INV”, there may be three carry-ins from the inverters. How do we add the three carry-ins by existent adders

3. The result has 24+2 bits and the sources may be minus from inverters. How do we solve the sign-extensions of minuses?

(36)

V”, so it doesn’t need to extend. Besides, the normal adder must give 26-bit result, so it to

hest bit

1

In problem carry-in to adder for the negation of 2’s

comp

-in

Fig.3-6 The solution of problem2

In problem3, the final solution is 26 bits and summands may be minus. Therefore, we must add compensation, which we call sign-com

IN

must be extend to 25-bit adder. However, in the second carry-save adder, do we need extend it to 25-bit adder? The answer is no because we doesn’t need to process the hig

of the carry from first carry-save adder. The figure below can give us more carefully concept:

Fig.3-5 The solution of problem

2, “ALIGN + INV” may send 1-bit

lement. How can we sum carry-ins, which are at most three, to summands without any additional logics? To solve this problem, we use the vacant position in the carries of the two carry-save adder to add two carry-ins. Then, the last carry-in will be added as normal carry by the normal adder.

pensation, to temporary result and get correct ＋ Normal adder 24+1 bits Sum2 Carry2 Result (24+2 bits) ＋

Carry save adder 1 Src1 Carry1 24 bits Src3 Sum1 Src2 ＋ Carry2 Carry save adder 2

24 bits Sum1 Carry1 Src4 Sum2 Carryin3 as normal carry-in ＋ Normal adder 24+1 bits Sum2 Carry2 Result (24+2 bits) Carryin2 ＋ Carry2 Carry save adder 2

24 bits Sum1 Carry1 Src4 Sum2 Carryin1

(37)

2

01

Sign-compensat

3

0

Number of minuses result.

11

10

00

ion

1

11

10

00

ion

1

2

01

Sign-compensat

3

0

Number of minuses ＋ Src2 24 bits Src3 Src4 Result (24+2 bits) Src1 1 1 1 11 11 11 24 bits ＋ Src2 Src3 Src4 Temp result (24+2 bits) Src1 1 1 1 01 ＋ Real result (24+2 bits)

Fig.3-7 The solution of problem3

s to choose node covering all computations: ¾ Algorithm1- minimum routing overhead: use the fewer choices to cover all

computation requirements. The advantage of this algorithm is that there may be fewer routing overhead with enough sharing logic. However, the disadvantage is that it may loss some possible sharing opportunity for smaller area requirement. The steps of the algorithm are described in below:

Step1: collect nodes with the same logic (sharable nodes) and indicate the most maximum requirement.

Step2: group nodes into several sets and let there are no links or the same nodes within different sets and ignore the sets which only have one computation. See below:

3.3.2 Algorithm1 & 2 to choose nodes

(38)

32-bit fpMUL ₄ Compare to 0 4 24-bit 8-bit multiply 8 4 adder 32-bit fpSUM2 ₄ 4 CMP &SWAP 4 CMP &SWAP 2in 24-bit adder fpSUM2 normalize ALIGN +INV 4 ALIGN +INV 4 ₄ 4 32-bit fpSUM2 ₂ 32-bit 1 fpSUM3 ₁ 3in partial sort ALIGN +INV 3in 24-bit adder fpSUM3 normalize 3 1 1 CMP &SWAP 2 2in 24-bit adder ₂ 32-bit fpSUM2 ₃ 32-bit fpSUM4 1 1 4in partial sort ALIGN +INV 4in 24-bit adder fpSUM4 normalize 3 1 1 CMP

&SWAP ₃ 2in 24-bit_adder 3 fpSUM4 normalize 1 fpSUM4 normalize fpSUM2 normalize 1 3 1

Fig.3-8 Different sets of computation trees

Step3: For each set we do that, we firstly choose a sharable node in the highest level and see if all computations have been covered. If there are computations haven’t been covered, delete chosen nodes with all their children and choose another sharable node in highest level. Recursively, all computation requirements have been covered or no sharable node.

(39)

Fig d

¾

most .3-9 The result of minimum routing overhea

Algorithm2-Maximum sharing logics: find more sharing choices to cover all computation requirements. The advantage of this algorithm is that there are the sharing logics. However, the routing overhead may become more serious.

Step1.2: the same as step1 and step2 in minimum routing overhead to group nodes into several sets.

Step3: For each set we do that, we firstly choose a sharable node in the lowest level and see if all computations have been covered. If there are computations haven’t been covered, delete chosen nodes with all their children and choose another sharable node in lowest level. Recursively, all computation requirements have been covered or no sharable node.

fpMUL₃₂ 4 2in-fpSUM₃₂ 4 4 CMP &SWAP 4 CMP

ALIGN +INV 4 ALIGN +INV 4 ₄ 4 2 3in-fpSUM₃₂ 1 1 3in-partial sort ALIGN

3 1 1 CMP &SWAP 2 2in-adder24 2 3 4in-fpSUM₃₂ 1 1 4in-partial sort ALIGN

3 1 CMP &SWAP ₃ 2in-adder₂₄ 3 Compare to 0 ₃₂ ₄ multiply₂₄ 8 4 adder₈ normalize26₁ normalize25 3 2in-fpSUM₃₂ 2in-fpSUM₃₂ 1

(40)

The following figures show how three 2in-fpSUM32s be reconfigured to one 4in-fpSUM32 in

these two algorithms as an example:

3.3.3 Algorithm3-optimal area-time:

In minimum routing overhead and maximum sharing logic, we find that some factors for sharing logic haven’t been considered. In these two algorithms, we choose nodes as basic unit but we don’t consider about different proportions within nodes. Besides, the silicon area is not the same in all nodes. Therefore, we use a new algorithm.

¾ Search by integer programming: weight each sharable node with hardware cost and use integer programming to minimize total cost. Here, we estimate hardware cost of each node by num

cases. If one sharable node with the most maximum requirement it means that logic

without routing overhead divided by area of Fig.3-10 The result of maximum sharing logic

ber of multiplexer area it may need. The cost function has two

of the node will be shared to other nodes which needed the same logic. Therefore, the cost will be its implementation area

fpMUL₃₂ 4 2in-fpSUM₃₂ 4 4 CMP &SWAP 4 CMP

ALIGN +INV 4 ALIGN +INV 4 ₄ 4 2 3in-fpSUM₃₂ 1 1 3in-partial sort ALIGN

3 1 1 CMP &SWAP 2 2in-adder24 2 3 4in-fpSUM₃₂ 1 1 4in-partial sort ALIGN

3 1 CMP &SWAP ₃ 2in-adder₂₄ 3 Compare to 0 ₃₂ ₄ ad ₈ normalize26₁ normalize25 3 der multiply₂₄ 8 4 1 2in-fpSUM₃₂ 2in-fpSUM₃₂

(41)

a multiplexer. Except those nodes, other sharable nodes will have a cost equal to ultiplexers and one output three meaning the routing overhead on two input m

multiplexer.

⎪

⎨

⎧

=

3 r

multiplexe

a

of

area

tion

implementa

an

of

area

cost

randomly choose one nodes with

otherw

(meaning rout

⎩

⎪

the most maximum requirement

ise

ing overhead)

Fig.3-11 Cost function of search by integer programming

The advantage of this algorithm both consider sharing logic and routing overhead. However, the disadvantage is that the qualities of results depend on the precision of cost. For the integer programming, we change the display of computation trees and give more information.

fpMUL32 (1) 50.7 ₄ 2in-fpSUM32 (1) 48 4 CMP&SWAP (1) 2.3 CMP&SWAP (1) 2.3 2in-adder24 (1) 2.6 normalize25 (1) 3.4 ALIGN+INV (1) 3.6 ALIGN+INV (1) 3.6 3in-fpSUM32 (1) 52.7 3in-partial sort (1) 18.4 ALIGN+INV (2) 3 3in-adder24 (1) 21.6 normalize26 (1) 3 CMP&SWAP (2) 3 2in-adder24 (2) 3 4in-fpSUM32 (1) 76.1 4in-partial sort (1) 27.6 ALIGN+INV (3) 3 4in-adder24 ) 32.4 (1 normalize26 (1) 3 CMP&SWAP (3) 3 2in-adder24 (3) 3 Compare to 0 32 (1) 1.2 normalize26 5.3 normalize25 3.4 adder8 (2) 3.5 Multiply24 (1) 43.7 2in-fpSUM32 (2) 3 2in-fpSUM32 (3) 3 4 3in-fpSUM32 (1) 52.7 4in-fpSUM32 (1) 76.1 1 1

A function node be divided into several sub-function nodes A function node has several kinds of design

Logic

(Ri) cost

(42)

Step1.2: the same as step1 and step2 in minimum routing overhead to group nodes into several sets.

Step3: To find optimal result using integer programming, we set

z Variables

0 node

from

tion

implementa

of

#

∀

≥

=

_i _i i

I

z Constraints from i c i c i i i c i c c i i

I

R

I

node

of

children

node

all

for

Req

1 children

to

linking

lines

dot

has

node

if

node

of

children

node

each

for

Req

1 children

to

linking

lines

real

has

node

if

∈

≥

∗

+

∈

≥

∗

+

∑

z Objective

Reqs from leaves to roots and get all constraints for integer

c

R

i i

i

I

cost

of

node

minimize

∑

∗

∀

Step4: reduce all programming. A 4 A 4 A1 A2 Re ReqA1 qA2 B (2) ReqB B (2) ReqB C (4) ReqC C (4) ReqC D (4) ReqD D (4) ReqD D1 ReqD1 D1 ReqD1 D2 ReqD2 D2 ReqD2 D D D D D A A A A A1 c A A1 B A A A2 A1 A I I I I I node Req Req Req node Req Req 4 1 node Req Req 4 1 Req Req 2 1 node 4 Req Req 2 1 2 2 D 2 1 1 1 − − ≥ + + − − ≥ + − − ⎪ ⎩ ⎪ ⎨ ⎧ ≥ + ≥ + − − ≥ + +

(

)

⎪ ⎪ ⎪ ⎩ ⎪ ⎪ ⎪ ⎨ ⎧ ≥ + + + ≥ + ≥ + ≥ + + 2 2 1 2 1 1 Req 4 1 Req 4 1 Req 2 1 4 Req Req A D D D A A1 c A A1 B A A2 A1 A I I I I I I I I I

(

)

(

)

⎪ ⎩ ⎪ ⎨ ⎧ ≥ + + + + + + ≥ + + + + + + 4 4 1 4 1 4 4 1 2 1 2 1 2 1 2 1 2 1 D D D A c A A D D D A B A A I I I I I I I I I I I I I I

(43)

Step5: apply integer programming and get the result with minimum cost The result of optimal area-time is in below:

fpMUL₃₂ (1) 50.7 ₄ 2in-fpSUM₃₂ (1) 48 4 CMP&SWAP (1) 2.3 CMP&SWAP (1) 2.3 2in-adder₂₄ (1) 2.6 normalize₂₅ (1) 3.4 ALIGN+INV (1) 3.6 ALIGN+INV (1) 3.6 3in-fpSUM₃₂ (1) 52.7 3in-partial sort (1) 18.4 ALIGN+INV (2) 3 3in-adder24 (1) 21.6 normalize26 (1) 3 CMP&SWAP (2) 3 2in-adder₂₄ (2) 3 4in-fpSUM₃₂ (1) 76.1 4in-partial sort (1) 27.6 ALIGN+INV (3) 3 4in-adder₂₄ (1) 32.4 normalize₂₆ (1) 3 CMP&SWAP (3) 3 2in-adder24 (3) 3 Compare to 0 ₃₂ (1) 1.2 normalize₂₆ 5.3 normalize₂₅ 3.4 adder8 (2) 3.5 Multiply24 (1) 43.7 2in-fpSUM₃₂ (2) 3 2in-fpSUM₃₂ (3) 3 4 3in-fpSUM32 (1) 52.7 4in-fpSUM32 (1) 76.1 2 3

Fig.3-14 The result of optimal area-time algorithm

Then, we compare the result of three algorithms

3.3.4 Comparison within algorithms

Firstly, we show the comparison within three algorithms

We find that the result of optimal area-time algorithm is the same as minimum routing overhead because of too few possible solution.

2,000,547.5

2,105,722.5

986,095.4688

Maximum sharing logic

2,000,547.5

Maximum area requirement (um2₎

985793.5938

Average area requirement (um2₎ Minimum routing overhead Optimal area-time

_985793.5938

2,000,547.5

2,105,722.5

986,095.4688

Maximum sharing logic

2,000,547.5

Maximum area requirement (um2₎

985793.5938

Average area requirement (um2₎ Minimum routing overhead Optimal area-time

_985793.5938

(44)

the rage and maximum area requirements. To compare the two kinds of result in etail, we find they only differ in the choices of how to reconfigure 2in-fpSUM32s to

haring logics between

2in-fpSUM is in the routing overhead.

The critical path of minimum routing overhead/optimal area-time is:

Delay time = CMP&SWAP Î ALIGN+INV Î 2in-add24 Î normalize25 Î MUX

Finally choose the result of minimum routing overhead/optimal area-time because of smallest ave

d

3in-fpSUM32 or 4in-fpSUM32. In addition, we analyze the s 32S and 4in-fpSUM32 and find out the failure of result2

32 Î

CMP&SWAP Î ALIGN+INV Î 2in-add24 Î normalize25

= 3.93 + 6.14 + 4.47 + 4.72 + 0.76 + 3.87 + 6.01 + 2.54 + 7.56 (ns) = 40ns with time overhead 0.76ns (1.9%)

The area requirement of minimum routing overhead/optimal area-time is: Area = 2in-fpSUM32*3 + MUX32*2

= 313425 + 8837.5 (um2)

2 2

= 332027.5um with area overhead 8837.5um (2.66%) The critical path of maximum sharing logic is:

Delay time = MUX32 Î CMP&SWAP Î MUX32 Î CMP&SWAP Î MUX32 Î

CMP&SWAP Î ALIGN+INV Î 2in-add24 Î MUX32 Î 2in-add24 Î MUX32Î

2in-add25 Î normalize26

= 0.6 + 3.9 + 0.75 + 3.59 + 0.77 + 5.08 + 6.24 + 1.15 + 0.86 + 1.42 + 0.85 +

The area requirem Area =

2in-9.31 + 6.63 (ns)

= 43.86ns with time overhead 2.12ns (4.83%) ent of maximum sharing logic is:

fpSUM32 * 3 + MUX32* 6

= 108972.5 + 106872.5 + 107345.0 + 18882.5 (um ) 2 = 332307.5um2 with area overhead 18882.5 um2 (5.68%)

(45)

Beca ead of maximum sharing logic are much more than those ove overhead/optimal area-time, we finally choose the result

of minimu he computation unit.

3.4

DR-shader is below:

Fig.3-15 The architecture of DR-shader

in the sharable computation unit to support all vertex and pixel instructions with routing overhead

¾ Context memory to store the configuration of each instruction

¾ One more instruction slot to store vertex and pixel shader codes simultaneously Therefore, the area of DR-shader may be larger than the area of vertex shader or pixel shader

use the time overhead and area overh rhead of minimum routing

m routing overhead/optimal area-time as our design of t

Architecture of DR-shader

After finish the computation unit, we can build DR-shader. The architecture of

Configuration signal

In the architecture, there are some necessary hardware overheads: ¾ More logic Instruction Decoder Computation Unit Instruction slot (pixel shader) Destination register modifier Register file Instruction slot

(vertex shader) Context

memory PC

Source register modifier

(46)

for the ability to reconfigure between vertex shader type and pixel shader type. However, we will find whether its flexibility deserve be added to upgrade shader utilization in our

simulations.

3.5 Design of workloads monitor logic

In this section, we firstly descript the properties of DR-shader and the hardware overhead. Then, we will descript the design of vertex/pixel workloads monitor logic. There are two assumptions of reconfigure property for DR-shader:

1. Order of processing: In the beginning, all DR-shaders will be reconfigure to vertex shader type because of no workload in pixels. Then, DR-shaders will be often

reconfigured to vertex sh rding to the various in the workloads between vertices and pixels until all vertices have been processed. Finally,

aining pixels. Reconfiguring tim

after it finis eeding one more register file for temporary results.

The purpose of the workloads m

shaders with pixel shader type is equal to number of used intervals in pixel queue. (the size of intervals will be determined later)

ad monitor ader type or pixel shader type acco

all DR-shaders will be reconfigured to pixel shader type for rem

2. ing: The configuration of each DR-shader only can be changed h a vertex/pixel to avoid n

onitor logic is to control number of DR-shaders with pixel shader type in DR-shader unit and let stall cycles of all shaders as few as possible. To achieve this goal, we base on three kinds of information to control number of DR-shaders with pixel shader type, which are:

¾ Expected number of

DR-¾ Current number of DR-shaders with pixel shader type is recorded in worklo logic.

(47)

At ev with pix finishing change f

Fig.3-16 Flowchart of workloads monitor logic

v

DRp: DR-shader with pixel shader type

el

el DR-shaders finish their job.

ery cycle, we count the difference between expected and current number of DR-shaders el shader type. If the expected number is bigger than current number, we change

DR-shaders with vertex shader type to other type by the difference. Otherwise, we inishing DR-shaders with pixel shader by the difference.

shader type

DR : DR-shader with vertex shader type

EDRP: expected number of DR-shaders with pix

shader type

CDRP: current number of DR-shaders with pix

EDR < CDR ? _P _P

Find and mark finishing DR_p(s)

Find and mark finishing DR (s)_v yes no Change (CDR_P- EDR_P) finishing DR (s) to DR (s)_P _V Chan finishing DR_V(s) to DR_P(s) ge (EDR_P- CDR_P)

(48)

¾ Vertex shader

he output of simulator is the execution time from vertex processing to pixel processing of a frame with the information about shader utilization. There are also some parameters we can set for different environments we want, listed below:

¾ Clip information

Throughput of the clipping unit ¾ PreZ information

Throughput of the PreZ unit ¾ Shader information

Throughput of the vertex input Size of pixel queue

Numbers of DR-shader, vertex shaders and pixel shaders Number of batches in each shader

Latencies of each instruction ¾ Texture information

Chapter 4 Simulation

4.1 Simulator of DR-shader

For this thesis, we build a cycle-based simulator referenced from SiS. The input of the simulator is 3Dmark05, we consider about information which is listed below:

¾ If a primitive is clipped (culled) or pass ¾ Number of tiles produced from each primitive ¾ If a tile is blocked by preZ or not

¾ Number of pixels can be produced from each pass tile codes and pixel shader codes

(49)

Textur

Miss rate of the texture memory Miss penalty

lator base on SiS

4.2 S

decide a proper proportion between vertex shaders and pixel shaders and the size of pixel queue. For the goal, we assume number of vertex shaders is three,

and o r ed below:

¾

ipping unit = unlimited ¾

mited

¾

e vertex input = unlimited

e unit access cycles

Throughput of texture units

Fig.4-1 The cycle based simu

imulation1

In this section, we will

the parameter setting list Clip information

Throughput of the cl PreZ information

 Throughput of the PreZ unit = unli Shader information Throughput of th Triangle Setup Clip V.S. Prog. P.S. Prog. Vertex

Shader _ShaderPixel

Cycle based Simulation base on SiS C-model

Texture Memory PreZ

I

u

Output

Pixel Queue

np t

(50)

3

 Number of vertex shaders =

tchs in each shader = 8

Then, we he s in every cycle. The workload in each cycle is counted as number of pixels in the cycle product with their execution time. We display the pie chart of pixels’ workload:

 Latencies of each instruction = 8 Number of ba

gat r workload statistics of pixel

1000~9999 0.7% 100~999 0.24% >10000 0.3% <10 99.37% 10~99 0.28% 其他 <10 10~99 100~999 1000~9999 >10000 0.63%

A

Standar

verage = 36.353

d deviation = 279.114

er of pixel shaders when there are three vertex shaders. Under 3 vertex shaders with 37 pixel shaders, we simulate the relation between the size of pix u

Fig.4-2 The pie chart of the pixels’ workload in every cycle

We choose the average workload as numb

(51)

30000000

30200000

30400000

30600000

E

xe

ut

ion

tim

3

0

3

0

3

0

500 1000

1500

2000

2500

Size of pixel queue (pixels)

c

e

(c

yc

le

s)

12 0000

10 0000

08 0000

1024

unlimited

Fig.4-3 The relation between the size of pixel queue and execution time

By the graph, we choose the size of pixel queue is 1024 (pixels).

4.3 Simulation2

In this section, we decide the size of intervals in pixel queue and number of vertex shaders and pixel shaders be changed to DR-shaders. We use the parameters decided above, listed below:

¾ Clip information

= unlimited

¾ Shader information

Throughput of the vertex input = unlimited

 Latencies of each instruction = 8

 Throughput of the clipping unit = unlimited ¾ PreZ information

(52)

 Number of batchs in each shader = 8  Number of vertex shaders = 3  Size of pixel queue = 1024

 Total number of shaders = 40 (3 + 37) ¾ Texture information

 Texture unit access cycles = 8  Miss rate of the texture memory = 0  Throughput of texture unit = unlimited

Firstly, we simulate the relation between the size of intervals and execution time and get below graph: 15000000 Number of DR-shaders 19000000 Ex ec 23000000 27000000 310 0 5 10 15 20 25 30 35 utio n cy cle s) Size = 32 00000 Size = 64 Szie = 128 tim e (

Fig.4-4 The rela ize of intervals, number of DR-shaders, and execution time

It is appar intervals doesn’t have a great influence on the execution time. Therefore, we choose the size of intervals is for the flexibility.

the number of DR-shaders and time-area product with the size of in

tion within the s

ent that the size of

equal to 32 (pixels) Secondly, we simulate the relation between

(53)

8.5E+15

9.5E+15

1.05E+16

1.15E+16

1 E

0

5

10

15

20

25

30

35 Number of DR-shaders

Ti

m

e*

A

re

a

DR-shaders

.45 +16

_{3 VS and 37 PS}

.35 +16

.25 +16

16 DR-shadres

Fig.4-5 The relation between the number of DR-shaders and area-time product The time-area products have a minimum value at number of DR-shaders equal to 16. For the analysis in detail, we list time, area, and utilization of each shader type in below:

8,879,126,204,482,140

[0.653]

480,609,124

[1.073]

18,474,735

[1.644]

24

0

16 447,827,831.4

[1]

Area (um

2

₎

[Ratio]

Number of each

kind of shader

0 DR

3 VS

13,601,927,566,796,306.56

[1]

30,373,118

[1]

37 Time*Area

[Ratio]

Time (cycles)

[Speed up]

PS

8,879,126,204,482,140

[0.653]

480,609,124

[1.073]

18,474,735

[1.644]

24

0

16 447,827,831.4

[1]

Area (um

2

₎

[Ratio]

Number of each

kind of shader

0 DR

3 VS

13,601,927,566,796,306.56

[1]

30,373,118

[1]

37 Time*Area

[Ratio]

Time (cycles)

[Speed up]

PS

Table.4-1 The time, area, and area-time product

78,431,817

[0.879348]

16

0 [0.82311]

35,664,243

24

7 [0.515063]

Stall cycles

[Utilization]

Stall cycles

[Utilization]

Number of each

kind of shader

[0.50552]

Stall cycles

Utilization]

544,974,67

(Pixel shaders)

(DR-shaders)

0 DR

3 VS

45,056,703

37 (Vertex shaders)

PS

[

78,431,817

35,664,243

[0.879348]

16

0 [0.82311]

24

7 [0.515063]

Stall cycles

[Utilization]

Stall cycles

[Utilization]

Number of each

kind of shader

[0.50552]

Stall cycles

Utilization]

544,974,67

(Pixel shaders)

(DR-shaders)

0 DR

3 VS

45,056,703

37 (Vertex shaders)

PS

[

(54)

We choose 24 DR-shaders with 16 pixel shaders in DR-shader unit and the size of intervals in pixel queue is 32 pixels as our final result. This kind of design will have a great improvement in shader utilization and execution time with a few of hardware overhead and area-time product will be reduced to 65.3 %.

可動態重組之處理單元於頂點與像素處理

國

立

交

通

大

學

資訊科學與工程所

碩

士

論

文

可動態重組之 shader unit 於頂點與像素處理

A dynamically reconfigurable shader unit for vertex and pixel

processing

研 究 生：陳 逸 麒

指導教授：鍾 崇 斌 博士

可動態重組之處理單元於頂點與像素處理

A dynamically reconfigurable shader unit for vertex and pixel processing

研 究 生：陳 逸 麒 Student：Yi-Chi Chen

指導教授：鍾 崇 斌 教授 Advisor：Dr. Chung-Ping Chung

國 立 交 通 大 學

資 訊 工 程 學 系

碩 士 論 文

A Thesis

Submitted to Institute of Computer Science and Engineering

College of Computer Science

National Chiao Tung University

in Partial Fulfillment of the Requirements

for the Degree of

Master

In

Computer Science and Information Engineering

September 2006

Hsinchu, Taiwan, Republic of China

可動態重組之處理單元於頂點與像素處理

摘 要

A dynamically reconfigurable shader unit for vertex and pixel

processing

Abstract

誌 謝

Table of contents

List of Figures

List of Tables

Chapter 1 Introduction

1.1 Vertex and pixel shaders

1.2 Dynamically reconfigurable system

1.3 Motivation

1.4 Objective

Vertex shader farm

1.5 Organization of this thesis

DR-shader unit

Interconnection

and routing path

Vertex / pixel workloads

monitor logic

DR-shader farm

Vertices

and/or

pixels

Coordinated

vertices and/or

colored pixels

Chapter 2 Background

2.1 Graphics pipeline

2.2 Vertex processing

Homogeneous

coordinates

Rasterization

Dehomogenize

)

1

(

1

0

0

0

1

0

0

研究生：陳逸麒

指導教授：鍾崇斌博士

研究生：陳逸麒 Student：Yi-Chi Chen

指導教授：鍾崇斌教授 Advisor：Dr. Chung-Ping Chung

國立交通大學

資訊工程學系

碩士論文

摘要

誌謝