Z strides and full cover bound blocks

Chapter 3 Design

3.2 Primitive blocking

3.2.1 Block-based edge walk

3.2.1.3 Z strides and full cover bound blocks

In previous section, we calculate two edge-blocks in every blocked-row for generating all block³s’ block coordinate between two edge-blocks in later stage, blocked row-span iterator. Moreover, we also need to calculate the nearest and farthest Z value of block³ and decide the full cover bit of block³. For this reason, we have to calculate Z strides and full cover bound blocks in every blocked-row.

In every blocked-row, we have to calculate two Z strides. One is the Z stride of bottom scan-line and another one is the Z stride of top scan-line, as shown in figure 3-2-1-3-1. First we find the edge-points (X_LT, X_RT, X_LB, and X_RB) on the top and bottom scan-line by using edge function and y coordinate values of top and

- 22 -

bottom scan-line. Then we calculate the Z values of these four edge-points by using three vertices’ Z values. Finally, using each two edge-points which are on the same scan-line and their Z value calculate Z strides of top and bottom scan-line. For example, the Z stride of top scan-line can be known by the following formula:

Z stride of top scan-line = (Z value of X_LT – Z value of X_RT) / (X_LT – X_RT)

Figure 3-2-1-3-1The sketch of calculating Z strides and full cover bound blocks

The full cover bound blocks can indicate which block³s in one blocked-row are full cover. How do we find the full cover bound blocks? First we need to find the block²s, which the edge-points of top and bottom scan-line in blocked-row, are belonged to. In figure 3-2-1-3-1, we will find the four colored blocks. Then we compare these blocks and find the two inside blocks. The two inside blocks are the full cover bound blocks. The block A and block B in figure 3-2-1-3-1 are the full cover bound blocks. And the blocks only between the full cover bound blocks will be full cover block³.

3.2.2 Blocked row-span iterator

In blocked row-span iterator, it will calculate all block³s between two edge-blocks in the same blocked-row. There are three major works. First, we need to

- 23 -

find the block coordinate which the block³ is belonged to. Since all block³s are in the same blocked-row, the Y block coordinate of all block³s are the same. And since the block³s are the consecutive block³s, we can calculate X block coordinate by just plus one to X coordinate of previous block³.

Second, we need to calculate nearest and farthest Z values of block³. Since the block³ is a 3-D plane, the nearest and farthest Z value must be on the vertices. We directly calculate the Z values of block’s four vertices and compare these four Z values to get the nearest and farthest Z values. The full and partial cover block³ both use this method to calculate nearest and farthest Z values. Although partial cover block³’s four vertices aren’t all in the primitive, we also use four vertices’ Z values to decide the nearest and farthest Z value. It can simplify the computation. Since the intersection points of primitive and block³may have many different conditions, it is time wasting to calculate all Z values of intersection points of primitive and block³. And using four vertices’ Z values to get the nearest and farthest Z value don’t have a large error with the precise nearest and farthest Z values.

Finally, we need to decide the block³ is full or partial cover block³. Since full cover bound blocks already know in previous stage, we only determine the block³ if between the full cover bound blocks. If the block³ is between full cover bound blocks, this block³ is a full cover block³. Otherwise, this block³ is a partial cover block³.

When one block³ are generated, this block³ will be transferred to blocked-Z test stage for testing the Z value. It can decide this block³are occluded or not.

3.3 Blocked-Z test

Blocked-Z test can filter out block³s which are surely occluded by other block³ since the occluded block³ wouldn’t display on the screen. It can reduce the workloads

- 24 -

of later stages, like rasterization, per-fragment early-Z test. Because blocked-Z test is performed based on a block, as opposed to a fragment, it can filter out many fragments within a block in one compare operation. To achieve this goal, it needs to add a two-dimensional blocked-Z buffer to record the current nearest Z value of each block coordinate, as shown in figure 3-3-1. The width of block-Z buffer is that the width of screen coordinate divides block width. And height of block-Z buffer is that the height of screen coordinate divides block height. The size of each entry on block-Z buffer is 4 bytes, the common size of Z value.

Figure 3-3-1 The configuration of block-Z buffer

Figure 3-3-2 shows the flow chart of blocked-Z test. After primitive blocking generate block³s, using the nearest Z value of block³ compares with corresponding block-Z value on blocked-Z buffer. If the nearest Z value of block³ is larger than corresponding block-Z value, it can filter out this block³. Otherwise, this block³ will pass to the block-based rasterization. And when any block³ passes the blocked-Z test, it has to check if needing to update the block-Z buffer. When the block³ is full cover and the farthest Z value of block³ is smaller than corresponding block-Z value on block-Z buffer, it must update the corresponding block-Z value with farthest Z value of block³. Why only the full cover block³s have the authority to check if needing to

- 25 -

update block-Z buffer. It is because only the full cover block³s can guarantee to occlude other block³s when other block³s are behind the full cover block³. If updating the block-Z value with partial cover block³, another block³s may not be totally occluded by this partial cover block. It may filter out some block³s which should display on the screen. So, only full cover block has authority to check if needing to update block-Z value.

Figure 3-3-2 The flow chart of blocked-Z test

3.4 Block-based rasterization

Block-based rasterization generates fragments to those passing blocked-Z test block³s. It can parallel processing with updating the block-Z buffer when the blocks pass blocked-Z test. Since the data of updating the block-Z buffer and block-based rasteriztion are different. To update the block-Z buffer needs the farthest Z value and full cover bit. And block-based rasterization needs the block coordinate of block³ and primitive ID. Therefore, it can be parallel processing without any fault.

- 26 -

The operation of block-based rasterization is similar to general rasterization.

The only difference is block-based rasterization needs to find the block³ range and generates all fragments in the block³. First, it needs to calculate two edge-fragments on primitive’s edge in every scan-line of blocked-row. If block size is 8x8, like figure 3-4-1, we need to calculate these sixteen edge-fragments, the red squares, on primitive’s edge. Then, using two edge-fragments on same scan-line calculates all attribute strides of every scan line, like RGBA, Z, texture coordinate. Finally, using attribute strides of every scan-line interpolates all fragments in block³ range.

Figure 3-4-1 The sketch of block-based rasterization

Here is a problem that consecutive blocked-Z tested block³s in same blocked-row like figure 3-4-1, it has to calculate the same edge-fragments on primitive edge and calculate the same attribute strides. So, we use an edge-fragment buffer to record edge-fragments which are on major edge and attribute strides in same blocked-row. Moreover, edge-fragment buffer must record primitive ID and row number to indicate which blocked-row’s edge-fragment data is now on the edge-fragment buffer. Figure 3-4-2 shows the configuration of edge-fragment buffer.

When passing blocked-Z test block³ comes, it will check this block³ whether is on the same blocked-row with the edge-fragment data on the edge-fragment buffer. If the primitive ID and row number on edge-fragment buffer is identical with current block³,

- 27 -

then we use edge-fragment data on edge-fragment buffer to generate fragments rather than calculate same edge-fragments and attribute strides again. Figure 3-4-3 shows the flow chart of block-based rasterization. The blocked-row number of Z-tested block³ is the y coordinate value of Block³. The block-ranged edge walk is to calculate the edge-fragments and Z strides in block height. Block range generator calculate which range in one scan-line needs to interpolate fragments. And the fragment generator is to interpolate all fragments in block range by using edge-fragments an attribute strides.

Figure 3-4-2 The edge-fragment buffer for a blocked-row

Figure 3-4-2 The flow chart of block-based rasterization

- 28 -

3.5 Data flow of our proposed method

In this section, we will briefly introduce the data flow of our proposed method, blocked-Z test. Figure 3-5-1 shows the data flow of our proposed method. The triangle setup will assemble the vertices into primitive and also calculate the edge slopes of three primitive edges. Then primitive blocking will generate all block³ data and deliver block³ data to blocked-Z test stage. After blocked-Z test, the Z_nearest, Zfarthest, and full cover flag wouldn’t use anymore. Only the block coordinate and primitive ID of block³ will deliver to block-based rasterization. After block-based rasterization, it will become fragment data and deliver these fragment data to per-fragment early-Z test.

Figure 3-5-1 the data flow of our proposed method, blocked-Z test

3.6 The rendering pipeline with blocked-Z test

In this section, we explain how the pipeline operates smoothly. When the primitive

- 29 -

deliver from triangle setup to primitive blocking, the primitive blocking divides the primitive into block³s and triangle setup can generate another primitive. The block³s will deliver to blocked-Z test unit as soon as the primitive blocking generates one block³. Primitive blocking doesn’t wait all block³s in one primitive be generated and then deliver all block³s to blocked-Z test unit. If do so, the blocked-Z test unit will often be idle by waiting the primitive blocking. In the same principle, the block-based rasterization can get the block³ data immediately when the block³ passes blocked-Z test. Since the block-based rasterization has longer processing time than blocked-Z test, it would always stay in busy time, wouldn’t in idle time. So, the extra added pipeline components of our proposed method can operate smoothly.

Moreover, the rendering pipeline with blocked-Z test becomes deeper than general graphic pipeline since we add two new pipeline units, primitive blocking and blocked-Z test. The computation time may become longer and may not render thirty frames in one second. However, since it can parallel process different frames, only the first frame will take longer time to process. Figure 3-6-1 shows the pipeline operation of multiple frames. Although the pipeline stage becomes deeper, the total throughput can seem identical than traditional graphic pipeline, even increase throughput since our method can relieves some bottlenecks.

V.S. T.S. P.B. Block-Z T. B.B. R. E.Z T. P.S

Figure3-6-1 The rendering pipeline with blocked-Z test

- 30 -

Chapter 4 Experiment and results

4.1 Experiment goal and environment

We are going to know how many percentages of occluded fragments can be filtered out in our blocked-Z test. And we also want to know how many extra workloads bring with the blocked-Z test. Finally, we will consider with the reduced workloads and extra workloads which produce from our blocked-Z test to evaluate how many total workloads blocked-Z test can reduce. Moreover, we will compare our blocked-Z test with the tile-based early-Z test [6] which the related work mentioned.

We trace the Atila simulator and dump the primitive data from Atila to be the input data for our experiment. The Atila simulator is proposed in [7]. The simulator architecture is based on the design of ATI GPU’s architecture and support OpenGL based benchmarks, like Doom3 [8], Quake4 [9], or the 3-D based computer games.

The primitive data which we dump from Atila simulator are the benchmarks of Doom3 and Quake4 with 320*240, 640*480, 1280*1024, and 1600*1200 screen resolutions.

After we have the input data, we also implement the simulator of our blocked-Z test method. Then we can get the filtering ratio of blocked-Z test from the simulator which we implement. And we also can evaluate the workload reduction by the information from our blocked-Z test simulator. The filtering ratio means the percentage of fragments can be filtered out by any kind of early-Z test. The equation of filtering ratio is: filtering ratio = filtered out occluded fragments / original fragments generated by rasterization.

- 31 -

4.2 Experiment results

In section 4.2.1, we will show the filtering ratio of blocked-Z test with various block sizes, which are 4x4, 8x8, 16x16, 32x32, and 6x64 pixels. In section 4.2.2, we will show the extra overhead of blocked-Z test, included extra hardware requirements and extra workloads. And we will evaluate the workload reduction of rasterization and per-fragment early-Z test with our blocked-Z test.

4.2.1 Filtering ratio of blocked-Z test

The figure 4-2-1-1 and figure 4-2-1-2 show the filtering ratio of blocked-Z test

with various block sizes. The filtering ratio of blocked-Z test means how many percentages of fragments can be filtered out in blocked-Z test. It can reflect the effect of blocked-Z test. The last bar chart of each set is the filtering ratio of per-fragment early-Z test. It can be a comparison with the blocked-Z test of various block sizes.

Obviously, we can see that the filtering ratio is higher with the block size is smaller.

And we can see that the filtering ratio with various block sizes in low screen resolution has larger variation. With the block size increase in low screen resolution, the filtering ratio decreases more than in high screen resolution. It is because the difference of block size in the low resolution has larger variation of covered range in the screen than in the high screen resolution. So, in the high screen resolution like 1600*1200, the filtering ratio with various block sizes has less variation. In figure 4-2-1-1, the average filtering ratio with various block sizes is about 80%. And In figure 4-2-1-2, the average filtering ratio with various block size is about 60%.

- 32 -

Figure 4-2-1-1 The filtering ratio of Doom3 with various block sizes

Figure 4-2-1-2 Filtering ratio of Qauke4 with various block sizes

- 33 -

4.2.2 Extra overhead of blocked-Z test

In this section, we will show the extra overhead which produce from blocked-Z test. In section 4.2.2.1, we show the extra workload of blocked-Z test. The extra workload is reflected by the block count. In section 4.2.2.2, we show the extra storage requirement and hardware that blocked-Z test need to use. With different block sizes, the storage requirement has a large variation.

4.2.2.1 Extra workload of blocked-Z test

Although blocked-Z test can reduce the workloads of the rasterization and per-fragment early-Z test, it has the extra workloads of the primitive blocking and blocked-Z test. In this section, we will discuss the extra workload of primitive blocking and blocked-Z test with various block sizes. Figure 4-2-2-1-1 and figure 4-2-2-1-2 shows the extra workload of primitive blocking and blocked-Z test with various block sizes in Doom3 and Quake4. We use the block count to evaluate the extra workload. The reason is primitive blocking needs to generate all blocks and blocked-Z test needs to perform depth test for every block. Obviously we can see, when the block size is smaller, the extra workload will higher. In these two figures, we can see that the workload will increase about three times when the block size grows four times.

- 34 -

Figure 4-2-2-1-1 Extra workload of blocked-Z test with various block sizes

Figure 4-2-2-1-2 Extra workload of blocked-Z test with various block sizes

4.2.2.2 Extra storage requirement and hardware

In our method, we have three extra storage buffers, primitive buffer, edge-fragment buffer, and block-Z buffer. With the block size and screen resolution vary, the storage will have the different requirement. Table 4-2-2-2-1 shows the storage requirement in different block size and screen resolution. The major extra storage requirement is block-Z buffer since it needs to store all current nearest Z value for every block coordinate. The block-Z buffer size can calculate by (screen width * screen height / block size)*4 bytes. When the screen resolution is high and block size

- 35 -

is small, the block-Z buffer size may greater than 100 Kbytes. We can see that when block size is 4x4 and screen resolution is 1600x1200, the block-Z buffer size is even above 400 Kbytes. The edge-fragment buffer size depends on the block size only. It needs to store the edge-fragment data two times of block height. The edge-fragment buffer size can calculate by ( block height * 2)*20 bytes. And the primitive buffer size is a fixed size. It depends on how many primitives need to store in primitive buffer. In this table, we set the ten primitives needed to store in primitive buffer and each entry is 60 bytes. The maxumun of total extra storage is about 470 Kbytes. And the minimum of total extra storage is only aobut 2 Kbytes.

Table 4-2-2-2-1 Extra storage requirement with various block sizes and resolutions

block size: 4x4 block size: 8x8 block size: 16x16 block size: 32x32 block size: 64x64 screen resolution320*2

size(KB) 18.75 75.00 320.00468.75 4.69 18.75 80.00 117.19 1.17 4.69 20.00 29.30 0.29 1.17 5.00 7.32 0.07 0.29 1.25 1.83 Edge-fragment

buffer size(KB) 0.16 0.16 0.16 0.16 0.31 0.31 0.31 0.31 0.63 0.63 0.63 0.63 1.25 1.25 1.25 1.25 2.50 2.50 2.50 2.50 primitive buffer

size(KB) 0.60 0.60 0.60 0.60 0.60 0.60 0.60 0.60 0.60 0.60 0.60 0.60 0.60 0.60 0.60 0.60 0.60 0.60 0.60 0.60 total extra buffer

size(KB) 19.51 75.76 320.76469.51 5.60 19.66 80.91 118.10 2.40 5.91 21.23 30.52 2.14 3.02 6.85 9.17 3.17 3.39 4.35 4.93

In our method, the primitive blocking and blocked-Z test unit is the extra hardware. Table 4-2-2-2-2 shows the extra hardware of primitive blocking and blocked-Z test unit.

Table 4-2-2-2-2 Extra hardware of primitive blocking and blocked-Z test

- 36 -

4.2.3 Total workload reduction with blocked-Z test

Although our method can reduce the workloads of rasterization and

per-fragment early-Z test, the primitive blocking and blocked-Z test would produce the extra workloads. In this section, we evaluate the total workload reduction with considering the reduced and extra workloads.

The primitive blocking and rasterization are similar with their operation.

Primitive blocking can consider a large granularity rasterization. It just spends much time to calculate the nearest and farthest Z value of block. And the blocked-Z test and per-fragment early-Z test both perform the depth test. The execution time of these two execution units can consider the same. Table 4-2-3-1 shows the time complexity of these four execution units, primitive blocking, general rasterization, blocked-Z test, and per-fragment early-Z test. The latency of each computation unit, like adder or multiplier, shows in the most left column. The cycle time of each computation unit is our hypothesis. And how many computation units that each execution unit is need to perform also show in the table. Then we can calculate each execution unit’s latency by computation unit’s latency multiply by the number of computation unit and it shows on last row in table 4-2-3-1.

Table 4-2-3-1 The time complexity of four execution units

- 37 -

When we know the latency of these execution units, we can derive the equation (2) from these execution units’ latency. The equation (2) can calculate the total change workload of our blocked-Z test method. It means that how many workload of rasterization and per-fragment early-Z test can be reduced.

Using the equation (2), we can get the total reduced workload of blocked-Z test.

在文檔中細線化前之區塊深度值測試與其對系統設計之影響 (頁 30-0)