Integrated Design - High Performance H.264 Motion Estimator for HDTV

Chapter 2 High Performance H.264 Motion Estimator for HDTV

2.6 Integrated Design

2.6.1 Integrated video quality analysis

TABLE 2-15 presents the simulation results for different algorithms combinations:

PMRME, mode filtering, and SIFME. In these results, we also include the bit truncation in this design to reduce the hardware cost. The simulation environments are as following: rate-distortion optimization (RDO) is off. Only the first frame is I-frame and search range is [-128, 127]. All of the simulation results are compared to that of the default full search algorithm in JM9.0 [27]. The result in this table only shows the average performance under different QPs. The test sequences are all 720p resolution including Stockholm, parkrun, mobile calendar, and shields. The frame rate is 30 and 300 frames are coded.

TABLE 2-15 shows the PSNR change, bit-rate increasing, and “Sharing Rate of L0 (Level 0) buffer”. The sharing rate of level 0 denotes the percentage that FME can directly reuse level 0 search range buffer for computation to save memory bandwidth.

This sharing occurs once the final MV is within level 0 search range. In our design, the sharing rate is at least 90%, and the higher QP will have higher sharing rate and thus can save more power and bandwidth.

In this table, we can find that the performance of PMRME is almost the same with full search. The average PSNR drop is only 0.005dB and the bit-rate is even decreasing when comparing with full search. It is because the PMRME ignores smaller blocks in level 1 and level 2 and prefer larger block which results bit-rate decrease. As for mode filtering, the algorithm also prefers to select larger block size, so the bit-rate decreasing is more obvious. Oppositely, the PSNR drop is a little serious than using PMRME only. But the worst quality drop is only 0.095dB. While

techniques, the PSNR quality is almost the same and the bit-rate quality drop is a little increasing to 2.11% in average.

TABLE 2-16 shows the performance of our proposed algorithms for 1080p video sequences. The performance is not as good as the performance for 720p video, especially the bit-rate increasing rate. The average bit-rate increasing rate reaches 3.07% for QP32. This is because the 1080p sequences prefer the larger block size than 720p or other smaller sequences, which agrees the tendency of our proposed algorithms. Therefore, our algorithms don’t provide too much reduction in bit-rate which happens in the smaller sequences as shown in TABLE 2-1, TABLE 2-2, TABLE 2-9, and TABLE 2-10. However, the quality loss is still acceptable and the average sharing rate is also higher than 90% for 1080p sequences.

TABLE 2-15 PSNR and bitrate change for proposed algorithms compared with full search for 720p sequences

Frame size 720p

Sharing Rate of L0

Buffer(%) n.a. 96.95 96.08 96.10

QP20

PSNR inc.(db) 0 -0.095 -0.0825 -0.117

Bit rate inc. (%) -0.49 -1.24 -0.013 1.80 Sharing Rate of L0

Buffer(%) n.a. 97.84 96.86 96.87

Sharing Rate of L0

Buffer(%) n.a. 98.78 98.21 98.19

QP32

PSNR inc.(db) -0.01 -0.0525 -0.045 -0.09

Bit rate inc. (%) 1.56 1.14 2.18 2.90

Sharing Rate of L0

Buffer(%) n.a. 99.00 98.30 98.31

Avg

PSNR inc.(db) -0.005 -0.075 -0.0645 -0.1

Bit rate inc. (%) -0.017 -0.69 0.52 2.11

Sharing Rate of L0

Buffer(%) n.a. 98.18 97.44 97.44

TABLE 2-16 PSNR and bitrate change for proposed algorithms compared with full search for 1080p sequences

Frame size 1080p

Sharing Rate of L0

Buffer(%) n.a. 93.61 90.64 92.73

QP20

PSNR inc.(db) -0.01 -0.04 -0.04 -0.08

Bit rate inc. (%) -0.44 -0.22 2.65 5.04

Sharing Rate of L0

Buffer(%) n.a. 95.6 94.12 94.48

Sharing Rate of L0

Buffer(%) n.a. 95.97 95.17 95.19

QP32

PSNR inc.(db) 1.68 -0.09 -0.1 -0.08

Bit rate inc. (%) 1.56 1.59 3.06 3.44

Sharing Rate of L0

Buffer(%) n.a. 95.65 94.79 95.05

Avg

PSNR inc.(db) -0.04 -0.07 -0.07 -0.08

Bit rate inc. (%) 0.15 -0.1 2.04 3.07

Sharing Rate of L0

Buffer(%) n.a. 95.36 93.94 94.49

2.6.2 Integrated architecture

Fig. 2-21 shows the total block diagram of the full ME modules. It contains IME, FME, several memory buffer and external data access interface. The whole flow is as described in Fig. 2-5(b).

To enable the data reuse between IME and FME, the IME module has three internal SRAMs for reference pixels storage. When the IME search of a MB is completed, its macroblock information is sent to FME. Moreover, the reference pixels in level 0 SRAM is also sent to FME. However, instead of moving data, we use three SRAMs

as the level 0 buffer and swap them with a ping-pong buffer concept. The three level 0 buffers includes one for IME level 0 reference, one for FME, and one for loading new data from external memory. Whenever the IME stage completes the coding of the first MB, the buffer for level 0 reference for the first MB is changed as the FME reference in the next stage. At the same time, the buffer for current FME reference is changed to load the data of the third MB from external memory for further use. The buffer that is now filled the reference data for the second MB is switched for IME level 0 reference.

With above ping-pong buffers, we can share the level 0 data of IME with the FME, and no additional memory access time is necessary. Besides, the data in level 0 buffers can be reused by FME for more than 90% of MBs according to the sharing rate in TABLE 2-15. With above arrangement, all these data can be reused as much as possible and reduce the bandwidth a lot.

2.6.3 Implementation results and comparisons

TABLE 2-17 shows the total hardware cost of our ME design and comparison to the integrated designs [18][25]. Comparing to [25], our design can save at least 30%

of area costs and 50% of memory costs in IME part. As for FME part, we save 82.8%

of area cost due to fewer number of PUs and reduce 81.2% of memory. In summary, the total area and memory saving is 60% and 65.78% respectively. As for the throughput, our design is sufficient for HD video applications. Our design improves throughput by 75% when comparing to that in [25]. If comparing with the other integrated design [11] using fast algorithms in IME, our design still saves 12.3% area.

As for the cycle count, our design also has 75.5% of throughput improvement than [18]. By the high throughput, only 28.5 MHz is enough for 720p sequence with 30

Fig. 2-21. The block diagram of IME and FME.

在文檔中針對高畫質視訊之H.264/MPEG-4 AVC視訊編碼器設計 (頁 71-76)