Organization - MPEG-4/H.264視訊壓縮標準於嵌入式系統之最佳化研究

Chapter 1. Introduction

1.3 Organization

The rest of this dissertation is organized as follows. Chapter 2 presents the efficient hierarchical motion estimation algorithm and its VLSI architecture. The register-based platform-independent MPEG-4 co-processor and its system-level design will be proposed in Chapter 3. In Chapter 4, the adaptive motion estimation algorithm for vehicle surveillance videos will be addressed. The real-time driving assist and surveillance system will be described in Chapter 5. Finally, some conclusions and future research perspectives will be stated in Chapter 6.

Chapter 2. Efficient Hierarchical Motion Estimation Algorithm and Its VLSI Architecture

This chapter addresses the development and hardware implementation HMEA using multi-resolution frames to reduce the computational complexity. Excellent estimation performance is ensured using an averaging filter to downsample the original image. At the smallest resolution, the least two MV candidates are selected using FSBMA. At the middle level, these two candidate MVs are employed as the center points for small range local searches. Then, at the original resolution, the final MV is obtained by performing a local search around the single candidate from the middle level. HMEA exhibits regular data flow and is suitable for hardware implementation. An efficient VLSI architecture that includes an averaging filter to down-sample the image and two 2D semi-systolic PE arrays to determine the SAD in pipeline is also presented. Simulation results indicate that HMEA is more area-efficient and faster than many full-search and multi-resolution architectures while maintaining high video quality. This architecture with 59K gates and 1,393 bytes of RAM is implemented for a search range of [-16.0, +15.5].

2.1. Introduction

The most complex part of popular video compression standards, including MPEG-4, MPEG-2, and MPEG-1, is ME [1][3][4]. The goal of ME is to remove the

temporal redundancies existing in adjacent frames, and the block-matching algorithm is used as a method for most of the video coding systems. It is used to find a block which is most similar to a current block within a pre-defined search area in a reference frame, and it dominates the encoded image quality, the compression ratio, and the computation time. The reference frame is a previously-encoded frame from the sequence and is before the current frame in the display order. The straight forward method to perform the operation is FSBMA, but it requires lots of manipulations due to its high complexity. Usually, FSBMA spends about 70% of the total encoding time, and this heavy computational load limits the performance of the encoder in terms of encoding speed and power consumption. Therefore, many VLSI architectures for FSBMA have been proposed for fast hardware implementation [21]-[29]. In these architectures, a result is observed that although FSMBA is easy to be implemented and can provide better compression quality, it has either large chip area or low speed.

Traditionally, frameworks of FSBMA are block-level pipelined, where one reference block is considered at a time and the parameters are reset before starting another reference block. Compared with them, the frame-level pipelined FSBMA implementations can achieve nearly 100% fully pipelined computation by exploiting the explicit frame-level parallelism [30]. He et al. proposed a new two-level nested Do-loop FSBMA and a novel 2D array ME architecture [31]. However, its PE array size is fixed to N , and will limit the capability. Therefore, they extend their design ² [32], and develop a scalable improved frame-level pipelined architecture, which reduce the internal FIFOs and increase the speed of [31]. It contains 1,024 PEs and can manipulate an MV in 256 cycles within a search range of [-16, +15].

To reduce the number of search steps of FSBMA in order to increase the overall

(SEA) [33]-[35], partial distortion elimination (PDE) [36], the winner-update algorithm [37], and the advanced diamond search algorithm (DSA) [38] are proposed to reduce the computational heavy load of FSBMA while maintaining its quality.

However, the irregular data flow makes these algorithms suitable only for software implementation owing to their inability to determine exactly how many of SAD operations are required to calculate a single MV. Huang et al. proposed a new block matching algorithm called the global elimination algorithm (GEA), which is modified from SEA [39][40]. GEA has a more regular data flow than SEA. Moreover, the processing cycles are fixed, no initial guess is needed, and the conditional branch that applies when a candidate block cannot satisfy the criterion for early termination is removed. Although GEA is easily implemented and capable of providing good quality, it requires an operating frequency of 19.42 MHz to manipulate the MVs of CIF image in real-time.

Besides, in order to refine the accuracy of DSA, several new algorithms, such as motion vector field adaptive search technique (MVFAST) [41], predictive MVFAST (PMVFAST) [42], and enhanced predictive zonal search (EPZS) are proposed [43].

MVFAST improve DSA in both terms of visual quality and speed up by initially considering a small set of predictors. Unlike DSA where only a large moving diamond pattern was considered, MVFAST also introduced a smaller moving diamond. PMVFAST uses basically the same architecture and patterns as MVFAST does, but a significant difference of PMVFAST compared to MVFAST is the way the small versus the large diamond is selected. Dissimilar to MVFAST where motion was characterized as low, medium, or high by considering the largest motion vector candidate, in PMVFAST a different selection strategy, which can improve the overall speed of the algorithm by using the large diamond less often, is used. Furthermore,

EPZS that improves upon PMVFAST by considering several other additional predictors in the generalized predictor selection phase of PMVFAST. EPZS also selects a more robust and efficient adaptive threshold calculation where as, due to the high efficiency of prediction stage, the pattern of the search can be considerably simplified. However, the disorderly early termination of the search procedure still leads to the poor performance. An architecture, which combines PMVFAST and EPZS, is developed, and it can be configured to support different search patterns, and independent SAD computations [44]. The implementation results show that it requires 1,042 cycles to manipulate an MV, and it does not entirely complete the PMVFAST and EPZS due to their high complexity.

Another ME algorithm that can significantly reduce the computational complexity by decreasing the number of computations is the hierarchical motion vector search algorithms (HMVSA), including three-step search (3SS) [45], new three-step search [46], and four-step search [47], which separate the estimation process into several levels, and the numbers of levels is fixed. HMVSA has regular data flow, and the total execution time is constant, so HMVSA is suitable for hardware implementation. However, HMVSA suffers from a considerably lower PSNR than FSBMA, especially when the motion field is large and complex.

A particular HMVSA is developed to solve this problem, MMEA, whose basic idea is to make an initial coarse estimate and then refine it. Conventional MMEAs are usually implemented in two ways. One is to use a variable search area at each level [48]-[51], and the other is to apply a constant search area [52]-[54]. In the former, an MV candidate is obtained from a large search area at the coarse level and the candidate becomes the search center of the next level, which has a smaller search area.

A larger search area corresponds to a more accurate MV, but the extent of motion may increase with the search area. Therefore, the first MV candidate may not be a good estimate, and will yield an incorrect result at the next levels. Although the latter approach can partially solve this problem since the search area is constant at all levels, the MVs may be less robust against noise.

The above MMEAs that choose only one MV candidate fall easily into the local minimum, so numerous algorithms that combine the scheme with a multiple MV candidate search have been proposed [55]-[59]. However, these methods have a high computational cost to get the prediction performance close to that of FSBMA, because multiple MV candidates are required for local searches at each level. In these algorithms, the method for down-sampling the image is to select one of four pixels in a block. This method may be inappropriate if the block is the edge of a video object, and will influence the image quality, in terms of PSNR of the image. Accordingly, more MV candidates are required to yield an encoded image quality close to that of FSBMA. If the MV candidates are not only chosen by the basis of minimum SAD, such as by the neighborhood relaxation scheme in [58] or the four candidates, which correspond to four differently superblocks[51], the complexity will be increased.

Many hardware architectures for MMEA have been implemented [49], [60]-[63].

In [49], the framework is at the expense of a chip area because the on-chip memory is large. Each multi-resolution level in [60] and [61] has its own specific systolic array, which cannot commonly be applied among different levels, reducing the performance in terms of logic gate usage. [62] has a small chip area, but the reuse of data and the SAD computations are inefficient. Therefore, the overall speed is reduced, and it will limit its applications that require low operating frequency to save the power

consumption such as mobile phones and portable multimedia recorders. In [63], although the reuse of data is efficient, a large on-chip memory is required.

In this chapter, HMEA and its VLSI architecture are proposed. The main contributions of this chapter are to analyze several downsampling methods, discuss which method is suitable for hardware implementation, and derive a high speed pipeline VLSI architecture. HMEA adopts an averaging filter to downsample the original image, which is the first step of the estimation progress. In these hardware frameworks for MMEA [49], [60]-[63], the downsampling methods are not addressed due to the hardware cost. However, when superior quality is obtained in the anterior part of the ME, the refining procedures can be significantly shorter and the complexity can be reduced accordingly. HMEA can achieve almost the same coding performance as FSBMA in terms of PSNR, but HMEA is faster. The MV will be more credible and the search speed is higher as the search area increases. Furthermore, the proposed HMEA limits the number of MV candidates to two at the coarse level, and sets the total number of levels to three to solve the significant problem of the local minimum. An averaging filter is employed, so a single candidate at the final level suffices to provide the desired performance.

A high-speed pipeline VLSI architecture with a reasonable chip area for HMEA is also addressed. It utilizes an efficient 2D PE array to compute the SADs, and the search range can be doubled without adding any hardware. The architecture is suitable for VLSI implementation because the number of the computations for each macro block (MB) is fixed. HMEA is faster and more area-efficient than numerous existing full-search and multi-resolution architectures.

在文檔中 MPEG-4/H.264視訊壓縮標準於嵌入式系統之最佳化研究 (頁 20-27)