Chapter 3 Design
3.2 Primitive blocking
3.2.1 Block-based edge walk
3.2.1.3 Z strides and full cover bound blocks
In previous section, we calculate two edge-blocks in every blocked-row for generating all block3s’ block coordinate between two edge-blocks in later stage, blocked row-span iterator. Moreover, we also need to calculate the nearest and farthest Z value of block3 and decide the full cover bit of block3. For this reason, we have to calculate Z strides and full cover bound blocks in every blocked-row.
In every blocked-row, we have to calculate two Z strides. One is the Z stride of bottom scan-line and another one is the Z stride of top scan-line, as shown in figure 3-2-1-3-1. First we find the edge-points (X_LT, X_RT, X_LB, and X_RB) on the top and bottom scan-line by using edge function and y coordinate values of top and
- 22 -
bottom scan-line. Then we calculate the Z values of these four edge-points by using three vertices’ Z values. Finally, using each two edge-points which are on the same scan-line and their Z value calculate Z strides of top and bottom scan-line. For example, the Z stride of top scan-line can be known by the following formula:
Z stride of top scan-line = (Z value of X_LT – Z value of X_RT) / (X_LT – X_RT)
Figure 3-2-1-3-1The sketch of calculating Z strides and full cover bound blocks
The full cover bound blocks can indicate which block3s in one blocked-row are full cover. How do we find the full cover bound blocks? First we need to find the block2s, which the edge-points of top and bottom scan-line in blocked-row, are belonged to. In figure 3-2-1-3-1, we will find the four colored blocks. Then we compare these blocks and find the two inside blocks. The two inside blocks are the full cover bound blocks. The block A and block B in figure 3-2-1-3-1 are the full cover bound blocks. And the blocks only between the full cover bound blocks will be full cover block3.
3.2.2 Blocked row-span iterator
In blocked row-span iterator, it will calculate all block3s between two edge-blocks in the same blocked-row. There are three major works. First, we need to
- 23 -
find the block coordinate which the block3 is belonged to. Since all block3s are in the same blocked-row, the Y block coordinate of all block3s are the same. And since the block3s are the consecutive block3s, we can calculate X block coordinate by just plus one to X coordinate of previous block3.
Second, we need to calculate nearest and farthest Z values of block3. Since the block3 is a 3-D plane, the nearest and farthest Z value must be on the vertices. We directly calculate the Z values of block’s four vertices and compare these four Z values to get the nearest and farthest Z values. The full and partial cover block3 both use this method to calculate nearest and farthest Z values. Although partial cover block3’s four vertices aren’t all in the primitive, we also use four vertices’ Z values to decide the nearest and farthest Z value. It can simplify the computation. Since the intersection points of primitive and block3 may have many different conditions, it is time wasting to calculate all Z values of intersection points of primitive and block3. And using four vertices’ Z values to get the nearest and farthest Z value don’t have a large error with the precise nearest and farthest Z values.
Finally, we need to decide the block3 is full or partial cover block3. Since full cover bound blocks already know in previous stage, we only determine the block3 if between the full cover bound blocks. If the block3 is between full cover bound blocks, this block3 is a full cover block3. Otherwise, this block3 is a partial cover block3.
When one block3 are generated, this block3 will be transferred to blocked-Z test stage for testing the Z value. It can decide this block3 are occluded or not.
3.3 Blocked-Z test
Blocked-Z test can filter out block3s which are surely occluded by other block3 since the occluded block3 wouldn’t display on the screen. It can reduce the workloads
- 24 -
of later stages, like rasterization, per-fragment early-Z test. Because blocked-Z test is performed based on a block, as opposed to a fragment, it can filter out many fragments within a block in one compare operation. To achieve this goal, it needs to add a two-dimensional blocked-Z buffer to record the current nearest Z value of each block coordinate, as shown in figure 3-3-1. The width of block-Z buffer is that the width of screen coordinate divides block width. And height of block-Z buffer is that the height of screen coordinate divides block height. The size of each entry on block-Z buffer is 4 bytes, the common size of Z value.
Figure 3-3-1 The configuration of block-Z buffer
Figure 3-3-2 shows the flow chart of blocked-Z test. After primitive blocking generate block3s, using the nearest Z value of block3 compares with corresponding block-Z value on blocked-Z buffer. If the nearest Z value of block3 is larger than corresponding block-Z value, it can filter out this block3. Otherwise, this block3 will pass to the block-based rasterization. And when any block3 passes the blocked-Z test, it has to check if needing to update the block-Z buffer. When the block3 is full cover and the farthest Z value of block3 is smaller than corresponding block-Z value on block-Z buffer, it must update the corresponding block-Z value with farthest Z value of block3. Why only the full cover block3s have the authority to check if needing to
- 25 -
update block-Z buffer. It is because only the full cover block3s can guarantee to occlude other block3s when other block3s are behind the full cover block3. If updating the block-Z value with partial cover block3, another block3s may not be totally occluded by this partial cover block. It may filter out some block3s which should display on the screen. So, only full cover block has authority to check if needing to update block-Z value.
Figure 3-3-2 The flow chart of blocked-Z test
3.4 Block-based rasterization
Block-based rasterization generates fragments to those passing blocked-Z test block3s. It can parallel processing with updating the block-Z buffer when the blocks pass blocked-Z test. Since the data of updating the block-Z buffer and block-based rasteriztion are different. To update the block-Z buffer needs the farthest Z value and full cover bit. And block-based rasterization needs the block coordinate of block3 and primitive ID. Therefore, it can be parallel processing without any fault.
- 26 -
The operation of block-based rasterization is similar to general rasterization.
The only difference is block-based rasterization needs to find the block3 range and generates all fragments in the block3. First, it needs to calculate two edge-fragments on primitive’s edge in every scan-line of blocked-row. If block size is 8x8, like figure 3-4-1, we need to calculate these sixteen edge-fragments, the red squares, on primitive’s edge. Then, using two edge-fragments on same scan-line calculates all attribute strides of every scan line, like RGBA, Z, texture coordinate. Finally, using attribute strides of every scan-line interpolates all fragments in block3 range.
Figure 3-4-1 The sketch of block-based rasterization
Here is a problem that consecutive blocked-Z tested block3s in same blocked-row like figure 3-4-1, it has to calculate the same edge-fragments on primitive edge and calculate the same attribute strides. So, we use an edge-fragment buffer to record edge-fragments which are on major edge and attribute strides in same blocked-row. Moreover, edge-fragment buffer must record primitive ID and row number to indicate which blocked-row’s edge-fragment data is now on the edge-fragment buffer. Figure 3-4-2 shows the configuration of edge-fragment buffer.
When passing blocked-Z test block3 comes, it will check this block3 whether is on the same blocked-row with the edge-fragment data on the edge-fragment buffer. If the primitive ID and row number on edge-fragment buffer is identical with current block3,
- 27 -
then we use edge-fragment data on edge-fragment buffer to generate fragments rather than calculate same edge-fragments and attribute strides again. Figure 3-4-3 shows the flow chart of block-based rasterization. The blocked-row number of Z-tested block3 is the y coordinate value of Block3. The block-ranged edge walk is to calculate the edge-fragments and Z strides in block height. Block range generator calculate which range in one scan-line needs to interpolate fragments. And the fragment generator is to interpolate all fragments in block range by using edge-fragments an attribute strides.
Figure 3-4-2 The edge-fragment buffer for a blocked-row
Figure 3-4-2 The flow chart of block-based rasterization
- 28 -
3.5 Data flow of our proposed method
In this section, we will briefly introduce the data flow of our proposed method, blocked-Z test. Figure 3-5-1 shows the data flow of our proposed method. The triangle setup will assemble the vertices into primitive and also calculate the edge slopes of three primitive edges. Then primitive blocking will generate all block3 data and deliver block3 data to blocked-Z test stage. After blocked-Z test, the Znearest, Zfarthest, and full cover flag wouldn’t use anymore. Only the block coordinate and primitive ID of block3 will deliver to block-based rasterization. After block-based rasterization, it will become fragment data and deliver these fragment data to per-fragment early-Z test.
Figure 3-5-1 the data flow of our proposed method, blocked-Z test
3.6 The rendering pipeline with blocked-Z test
In this section, we explain how the pipeline operates smoothly. When the primitive
- 29 -
deliver from triangle setup to primitive blocking, the primitive blocking divides the primitive into block3s and triangle setup can generate another primitive. The block3s will deliver to blocked-Z test unit as soon as the primitive blocking generates one block3. Primitive blocking doesn’t wait all block3s in one primitive be generated and then deliver all block3s to blocked-Z test unit. If do so, the blocked-Z test unit will often be idle by waiting the primitive blocking. In the same principle, the block-based rasterization can get the block3 data immediately when the block3 passes blocked-Z test. Since the block-based rasterization has longer processing time than blocked-Z test, it would always stay in busy time, wouldn’t in idle time. So, the extra added pipeline components of our proposed method can operate smoothly.
Moreover, the rendering pipeline with blocked-Z test becomes deeper than general graphic pipeline since we add two new pipeline units, primitive blocking and blocked-Z test. The computation time may become longer and may not render thirty frames in one second. However, since it can parallel process different frames, only the first frame will take longer time to process. Figure 3-6-1 shows the pipeline operation of multiple frames. Although the pipeline stage becomes deeper, the total throughput can seem identical than traditional graphic pipeline, even increase throughput since our method can relieves some bottlenecks.
V.S. T.S. P.B. Block-Z T. B.B. R. E.Z T. P.S
Figure3-6-1 The rendering pipeline with blocked-Z test
- 30 -
Chapter 4 Experiment and results
4.1 Experiment goal and environment
We are going to know how many percentages of occluded fragments can be filtered out in our blocked-Z test. And we also want to know how many extra workloads bring with the blocked-Z test. Finally, we will consider with the reduced workloads and extra workloads which produce from our blocked-Z test to evaluate how many total workloads blocked-Z test can reduce. Moreover, we will compare our blocked-Z test with the tile-based early-Z test [6] which the related work mentioned.
We trace the Atila simulator and dump the primitive data from Atila to be the input data for our experiment. The Atila simulator is proposed in [7]. The simulator architecture is based on the design of ATI GPU’s architecture and support OpenGL based benchmarks, like Doom3 [8], Quake4 [9], or the 3-D based computer games.
The primitive data which we dump from Atila simulator are the benchmarks of Doom3 and Quake4 with 320*240, 640*480, 1280*1024, and 1600*1200 screen resolutions.
After we have the input data, we also implement the simulator of our blocked-Z test method. Then we can get the filtering ratio of blocked-Z test from the simulator which we implement. And we also can evaluate the workload reduction by the information from our blocked-Z test simulator. The filtering ratio means the percentage of fragments can be filtered out by any kind of early-Z test. The equation of filtering ratio is: filtering ratio = filtered out occluded fragments / original fragments generated by rasterization.
- 31 -
4.2 Experiment results
In section 4.2.1, we will show the filtering ratio of blocked-Z test with various block sizes, which are 4x4, 8x8, 16x16, 32x32, and 6x64 pixels. In section 4.2.2, we will show the extra overhead of blocked-Z test, included extra hardware requirements and extra workloads. And we will evaluate the workload reduction of rasterization and per-fragment early-Z test with our blocked-Z test.
4.2.1 Filtering ratio of blocked-Z test
The figure 4-2-1-1 and figure 4-2-1-2 show the filtering ratio of blocked-Z test
with various block sizes. The filtering ratio of blocked-Z test means how many percentages of fragments can be filtered out in blocked-Z test. It can reflect the effect of blocked-Z test. The last bar chart of each set is the filtering ratio of per-fragment early-Z test. It can be a comparison with the blocked-Z test of various block sizes.
Obviously, we can see that the filtering ratio is higher with the block size is smaller.
And we can see that the filtering ratio with various block sizes in low screen resolution has larger variation. With the block size increase in low screen resolution, the filtering ratio decreases more than in high screen resolution. It is because the difference of block size in the low resolution has larger variation of covered range in the screen than in the high screen resolution. So, in the high screen resolution like 1600*1200, the filtering ratio with various block sizes has less variation. In figure 4-2-1-1, the average filtering ratio with various block sizes is about 80%. And In figure 4-2-1-2, the average filtering ratio with various block size is about 60%.
- 32 -
Figure 4-2-1-1 The filtering ratio of Doom3 with various block sizes
Figure 4-2-1-2 Filtering ratio of Qauke4 with various block sizes
- 33 -
4.2.2 Extra overhead of blocked-Z test
In this section, we will show the extra overhead which produce from blocked-Z test. In section 4.2.2.1, we show the extra workload of blocked-Z test. The extra workload is reflected by the block count. In section 4.2.2.2, we show the extra storage requirement and hardware that blocked-Z test need to use. With different block sizes, the storage requirement has a large variation.
4.2.2.1 Extra workload of blocked-Z test
Although blocked-Z test can reduce the workloads of the rasterization and per-fragment early-Z test, it has the extra workloads of the primitive blocking and blocked-Z test. In this section, we will discuss the extra workload of primitive blocking and blocked-Z test with various block sizes. Figure 4-2-2-1-1 and figure 4-2-2-1-2 shows the extra workload of primitive blocking and blocked-Z test with various block sizes in Doom3 and Quake4. We use the block count to evaluate the extra workload. The reason is primitive blocking needs to generate all blocks and blocked-Z test needs to perform depth test for every block. Obviously we can see, when the block size is smaller, the extra workload will higher. In these two figures, we can see that the workload will increase about three times when the block size grows four times.
- 34 -
Figure 4-2-2-1-1 Extra workload of blocked-Z test with various block sizes
Figure 4-2-2-1-2 Extra workload of blocked-Z test with various block sizes
4.2.2.2 Extra storage requirement and hardware
In our method, we have three extra storage buffers, primitive buffer, edge-fragment buffer, and block-Z buffer. With the block size and screen resolution vary, the storage will have the different requirement. Table 4-2-2-2-1 shows the storage requirement in different block size and screen resolution. The major extra storage requirement is block-Z buffer since it needs to store all current nearest Z value for every block coordinate. The block-Z buffer size can calculate by (screen width * screen height / block size)*4 bytes. When the screen resolution is high and block size
- 35 -
is small, the block-Z buffer size may greater than 100 Kbytes. We can see that when block size is 4x4 and screen resolution is 1600x1200, the block-Z buffer size is even above 400 Kbytes. The edge-fragment buffer size depends on the block size only. It needs to store the edge-fragment data two times of block height. The edge-fragment buffer size can calculate by ( block height * 2)*20 bytes. And the primitive buffer size is a fixed size. It depends on how many primitives need to store in primitive buffer. In this table, we set the ten primitives needed to store in primitive buffer and each entry is 60 bytes. The maxumun of total extra storage is about 470 Kbytes. And the minimum of total extra storage is only aobut 2 Kbytes.
Table 4-2-2-2-1 Extra storage requirement with various block sizes and resolutions
block size: 4x4 block size: 8x8 block size: 16x16 block size: 32x32 block size: 64x64 screen resolution320*2
size(KB) 18.75 75.00 320.00468.75 4.69 18.75 80.00 117.19 1.17 4.69 20.00 29.30 0.29 1.17 5.00 7.32 0.07 0.29 1.25 1.83 Edge-fragment
buffer size(KB) 0.16 0.16 0.16 0.16 0.31 0.31 0.31 0.31 0.63 0.63 0.63 0.63 1.25 1.25 1.25 1.25 2.50 2.50 2.50 2.50 primitive buffer
size(KB) 0.60 0.60 0.60 0.60 0.60 0.60 0.60 0.60 0.60 0.60 0.60 0.60 0.60 0.60 0.60 0.60 0.60 0.60 0.60 0.60 total extra buffer
size(KB) 19.51 75.76 320.76469.51 5.60 19.66 80.91 118.10 2.40 5.91 21.23 30.52 2.14 3.02 6.85 9.17 3.17 3.39 4.35 4.93
In our method, the primitive blocking and blocked-Z test unit is the extra hardware. Table 4-2-2-2-2 shows the extra hardware of primitive blocking and blocked-Z test unit.
Table 4-2-2-2-2 Extra hardware of primitive blocking and blocked-Z test
- 36 -
4.2.3 Total workload reduction with blocked-Z test
Although our method can reduce the workloads of rasterization and
per-fragment early-Z test, the primitive blocking and blocked-Z test would produce the extra workloads. In this section, we evaluate the total workload reduction with considering the reduced and extra workloads.
The primitive blocking and rasterization are similar with their operation.
Primitive blocking can consider a large granularity rasterization. It just spends much time to calculate the nearest and farthest Z value of block. And the blocked-Z test and per-fragment early-Z test both perform the depth test. The execution time of these two execution units can consider the same. Table 4-2-3-1 shows the time complexity of these four execution units, primitive blocking, general rasterization, blocked-Z test, and per-fragment early-Z test. The latency of each computation unit, like adder or multiplier, shows in the most left column. The cycle time of each computation unit is our hypothesis. And how many computation units that each execution unit is need to perform also show in the table. Then we can calculate each execution unit’s latency by computation unit’s latency multiply by the number of computation unit and it shows on last row in table 4-2-3-1.
Table 4-2-3-1 The time complexity of four execution units
- 37 -
When we know the latency of these execution units, we can derive the equation (2) from these execution units’ latency. The equation (2) can calculate the total change workload of our blocked-Z test method. It means that how many workload of rasterization and per-fragment early-Z test can be reduced.
Using the equation (2), we can get the total reduced workload of blocked-Z test.
Using the equation (2), we can get the total reduced workload of blocked-Z test.