The architecture we proposed is a modified design for [4], and the outcome of our design is the same as [4] (i.e. our module can choose the same modes as [4 ] for a block).
Therefore, the proposed architecture can also k eep the same image quality as [4 ]. The main comparisons of our research are the hardware design and spend; here we will talk
about the main modified parts in de tail.
The algorithm of the object which we modified reduces the total RDO computations from 592 to 198 or 132, with about 66% decrease and a negligible quality loss. Figure 4.1 and figure 4.2 are the RD curves of the video sequence ‘stefan’ and
‘parkrun’ (Qp = 24, 28, 32, 36) which have the fast moving and highly detailed contexts.
Other video test results are listed in table 4.1.
Figure 4.1:RD-curve stefan (CIF, 150 frames)
Figure 4.2:RD-curve parkrun (1280x720p, 150 frames)
Table 4.1:Performance of the architecture on various video sequences
Sequences △Bitrate (%) △PSNR (%)
Bus* 1.80 -0.08
Coastguard* 1.57 -0.06
Container* 0.57 -0.02
Football* 5.57 -0.28
Husky* 1.00 -0.08
Mobile* 2.29 -0.11
Panzoom* 1.97 -0.06
Paris* 2.52 -0.11
Silent* 2.86 -0.10
Stockholm* 2.21 -0.09
Table* 2.28 -0.09
Tempete* 2.29 -0.10
City** 0.54 -0.01
Crew** 2.23 -0.05
Harbour** 1.98 -0.07
Knightshields** 1.37 -0.03
parkrun** 0.97 -0.05
*:CIF**:1280x720p
Table 4.1 is the simulation results extracted from [5] of varies video stream. As we can see from table 4.1, a maximum bit rate overhead is found (5.57%) as well as a maximum PSNR drop (0.28db) for the sequence ‘football’. In the other cases, only small bit rate overhead (less than 3%) and almost negligible PSNR loss (less than 0.2db) is assumed.
There are two main modified parts of our design, both of them contribute the lower
hardware spend and higher processing efficiency, and the detail of original design and the modified scheme are discussed below.
Figure 4.3:Architecture of the gradient vector calculator
Figure 4.4:Pixel access rule
The original architecture of the gradient vector calculator is shown in figure 4.3 and its access scheme is illustrated in figure 4.4. It needs nine cycles to calculate all the
gradient vectors of a 4 x 4 block. In each cycle, four pixels are loaded from the 4 x 4 block buffer with three subtractions are processed simultaneously. In the earlier four cycles, pixels are fetched in row by row and the Gx components are calculated. In the later four cycles, the other components Gy are processed. And we can realize that it needs a register array to store the partial resul ts.
In our design, the conception similar to the second related work we talk about in chapter two is used i.e. arranging the input data in the efficiency way for processing.
And in each cycle, one virtual pixel ( Gx and Gy) is produced, therefore the regist er array can be removed and the adders are abated also.
Figure 4.5:Histogram cell update control
According to the components of Gx and Gy, the original design uses the algorithm we listed in Eq.2.12. The implementation of the algorithm needs at least eight adders processing simultaneously and an extra hardware to deal wi th the sign bits. The generated values (i.e. err0, err1, err3 …and err8) are processed by the histogram cell update control which is shown in figure 4.5. Since the mode with second -minimum error must be the adjacency of the minimum one, the minimum finding hence are first done between err0, err1, err3 and err4, relating to mode 0, mode1, mode3 and mode4.
And the results are used to find the second -minimum error.
The architecture we proposed uses the sign bits as the judgment of directions, and the components of Gx and Gy (which have the same data size as err) are processed directly also. Therefore, we just need three comparators to perform this procedure. And the adders and the histogram cell update control can be replaced. Table 4.2 is the comparisons of hardware spend among our design and Li ’s in the main module of the 4 x 4 block processing core. As we can see that the hardware cost is very low in our proposal.
Table 4.2:Comparison of hardware spend in the main module
Li’s Proposed design
There still another modified in our design of the mode decision parts and we take the 4 x 4 block mode decision for example only. In the 4 x 4 block processing core, the histogram unit is in charge of holding the amplitudes of each mode. If we perform the procedure of mode decision after all amplitudes of the virtual pixels processed. It needs to re-range an eight -cells row in the decreasing order with 28 comparisons, which is too expensive for implementation.
In our proposal, the procedure of mode decision is executed when a virtual pixel produced. And it is only five components which need to be re-ranged and the comparisons can be reduced to 10. But we design this module in three stages; the information generated in the first stage is used in the second stage i.e. the sorter. Hence three comparing operations can be decreased. Finally, the hardware spend for the 4 x 4
block mode decision in our proposal is six 3 -bits equality comparators and seven 13 -bits comparators, instead of ten 16-bits comparators.
Tables 4.3 are the comparisons among our proposal and the related works. All the three works are implemented in 0.1 8μm technology and our proposal have lowest hardware spend (45.2% reduced than Li’s and 15.8% reduced than Wang’s at maximum operating frequency), highest operating frequency and shortest proces sing time for one macro block. The power dissipation of our design is also lower than Wang’s and these advantages make our proposal more favorable for the H.264/AVC real -time systems as the resolution increasing. The hardware spend percentage of each module of our proposal are listed in table 4.4 and 4.5.
Table 4.3:Comparison of the implementations
Wang’s Li’s proposed proposed
(1 MacroBlock) 416 cycles 210 cycles 183 cycles 183 cycles Processing time
(1 MacroBlock) 6240ns 1050ns 732ns 3660ns
Gate counts 10.3k
Table 4.4:Hardware spend percentage of each module (250MHz)
counts 21.1% 19.5% 7.5% 3.9% 28.3% 19.7%
Power
dissipation 31.6% 19.8% 4% 4% 23.7% 16.9%
Table 4.5:Hardware spend percentage of each module (50MHz)
Module
dissipation 19% 25% 6% 6% 23% 21%