Wide-Range Motion Estimation Architecture with Dual Search Windows for High Resolution Video Coding

全文

(1)IEICE TRANS. FUNDAMENTALS, VOL.E91–A, NO.12 DECEMBER 2008. 3638. PAPER. Special Section on VLSI Design and CAD Algorithms. Wide-Range Motion Estimation Architecture with Dual Search Windows for High Resolution Video Coding Lan-Rong DUNG†a) , Member and Meng-Chun LIN† , Nonmember. SUMMARY This paper presents a memory-efficient motion estimation (ME) technique for high-resolution video compression. The main objective is to reduce the external memory access, especially for limited local memory resource. The reduction of memory access can successfully save the notorious power consumption. The key to reduce the memory accesses is based on center-biased algorithm in that the center-biased algorithm performs the motion vector (MV) searching with the minimum search data. While considering the data reusability, the proposed dualsearch-windowing (DSW) approaches use the secondary windowing as an option per searching necessity. By doing so, the loading of search windows can be alleviated and hence reduce the required external memory bandwidth. The proposed techniques can save up to 81% of external memory bandwidth and require only 135 MBytes/sec, while the quality degradation is less than 0.2 dB for 720 p HDTV clips coded at 8 Mbits/sec. key words: motion estimation, MPEG, video compression, bandwidth. 1.. Introduction. Motion estimation (ME) has been notably recognized as the most critical part of video compression, such as MPEG standards and H.26x [1]–[5]. It tends to dominate the computational and hence power requirements. As the demand for high-resolution, high-quality video system increases, the implementation of motion estimation is becoming more costly and power-consuming. Among the hardware components of motion estimation, the on-chip memory is the one that dominates power consumption and cost. Because the on-chip memory size is too small to store a high-resolution frame, there exists a tradeoff between the external memory bandwidth and on-chip memory size. The less the on-chip memory is used in motion estimation, the higher the external memory bandwidth is required. There are three factors that affect the tradeoffs: the data reuse mechanism, the size of search window, and the efficiency of external memory access. The first two factors can be exploited at the architecture level while the last can be improved in the DRAM controller. In the past decade, various algorithms have been proposed to improve the performance of ME in terms of compression ratio and computational cost; however, very few works present solutions for data reusability while analyzing the required external memory bandwidth. Paper [6] is the one. [6] defines data reuse levels for an Full-Search BlockManuscript received March 24, 2008. Manuscript revised June 24, 2008. † The authors are with the Department of Electrical and Control Engineering, National Chiao-Tung University, Hsinchu 30010, Taiwan. a) E-mail: lennon@faculty.nctu.edu.tw DOI: 10.1093/ietfec/e91–a.12.3638. Matching (FSBM) ME architecture to minimize the external memory bandwidth. The FSBM algorithm with the Sum of Absolute Difference (SAD) is the most popular criterion for motion estimation because of its considerably good quality [23], [35]–[37]. It is particularly attractive to those who require extremely high quality. However, the full search algorithm needs high computational load and large memory size which are a major problem in the implementation of motion estimation. To reduce the computational complexity of FSBM, researchers have proposed various fast block-matching algorithms (FBMAs), by either reducing the number of search steps [15]–[21] or simplifying the calculation of error criterion [22]–[31]. We categorize the former as the centerbiased algorithms, and the latter as the criterion-simplifying algorithms. By combining step-reduction and criterionsimplifying, some researchers proposed two-phase algorithms to balance the performance between complexity and quality [32]–[34]. It has been shown that these fast algorithms can significantly reduce the computational load with little quality degradation. The center-biased algorithms are good for reducing the external memory bandwidth, while the centerbiased algorithms, which are motivated by statistical observation show that most of motion vectors are centered around (0,0) and, hence, only a small portion of the search window needs to be accessed most of the time. For high-resolution applications, this nice feature can help us reduce the external memory bandwidth and the local memory requirement. This paper presents a new windowing technique, called dual-search-windowing (DSW), for center-biased ME algorithms. The DSW requires smaller on-chip memory than full search-windowing while maintaining high data reusability that significantly reduces the external memory bandwidth requirement. The DSW consists of a primary windowing and a secondary windowing. The primary windowing is necessary for all MV searches and the secondary windowing is only called for when needed. The primary windowing is sliding with the macro-block (MB) changing, so each move only requires an update of a single slice. This leads to a high degree of reusability. When the center-biased algorithm moves outside the primary window, the secondary window will be loaded. Although the secondary window is not be reused for its occasional occurrence, thanks to the center-biased algorithm, the secondary windowing is seldom needed and the impact on external memory bandwidth requirement is low. For 720 p HDTV clips, the proposed. c 2008 The Institute of Electronics, Information and Communication Engineers Copyright .

(2) DUNG and LIN: WIDE-RANGE MOTION ESTIMATION ARCHITECTURE. 3639. techniques save up to 81.41% of external memory bandwidth and require only 135.20 MBytes/sec, with less than 0.2 dB quality degradation for video coded at 8 Mbits/sec. The paper is organized as follows. Section 2 introduces the primary windowing techniques and exploits their reusability. Section 3 describes the proposed DSW algorithms with comparisons. Section 4 describes the experimental results and the performance analysis in terms of external memory bandwidth, local memory size, and visual quality. Finally, the Sect. 5 concludes the contributions of this work.. Fig. 1 (a) Large diamond search pattern. (b) Small diamond search pattern.. external memory accesses. The primary windowing is used to load a smaller search window for most MV searches. For instance, given a D1 video sequence, the typical search window size is ± 64 and we can choose ± 32 as the primary window. Note that the MB size in MPEG4/AVC is 16 × 16. Therefore, the local memory size can be ideally reduced by a factor of 81/25 and Fig. 4(a) shows the windowing technique with a single MB. The bolded box indicates the data in local memory and the centered square is the current MB. For the MB of S W1 , as shown in the upper-left corner, we first load three slices labelled by 3, 4, and 5 while the slices 1 and 2 are the padding data which is generated internally by the ME engine without consuming external memory bandwidth. When the MV search performs for the MB of S W1 , the primary windowing simultaneously loads slice 6 for the next MV search. The following steps of windowing show the parallel operations of MV searches and updates of slices. Comparing with full search windowing, the local memory size is reduced by the factor of 81/30 (or 2.7) and the external bandwidth requirement can be reduced by a factor of 9/5. To increase the degree of reusability, one can process more MBs at a time because the data of the primary window can be used more than once for each data loading. However, the penalty is the increase of local memory size. Figures 4(b)–(d) illustrate the other schemes for primary windowing; the symbol type p is used to label the schemes. We make a modification on PMV search for schemes of Figs. 4(c) and (d). When the MV of upper right macroblock has not determined yet, we temporarily use the motion vector (0,0) as the MV of upper right macroblock. The modification might result in quality degradation under the constant bit-rate control. Fortunately, the degradation is little for the high occurrence of null MV. As shown in Table 1, the quality degradation is up to 0.16 dB. In video coding community empirically 0.5 dB is considered a threshold below which the perceptual quality difference cannot be perceived by subjects.. Fig. 2 The analysis of accumulated probability versus size of search window for DS algorithm under the rate control for 2 Mbits/sec.. Fig. 3 The analysis of accumulated probability versus size of search window for SDS algorithm under the rate control for 2 Mbits/sec.. 2.. The Primary Windowing Techniques and Their Reusability. Center-biased ME algorithms are developed based on the observation that most of MVs are located near the centerpoint of the search window. In this paper, we use diamond search (DS) [11]–[13] and small diamond search (SDS) [7] as the target algorithms; however, the proposed approaches are not limited to these two algorithms. Figure 1 shows the search pattern for DS and SDS algorithms. By simulating with D1 videos, Figs. 2 and 3 illustrate that more than 98% MVs are located within ±32 search range. Hence, we can use ±32 search range for the primary window to save the.

(3) IEICE TRANS. FUNDAMENTALS, VOL.E91–A, NO.12 DECEMBER 2008. 3640 Table 1 The performance analysis of visual quality versus different PMV accuracy. Bit-rate Search range Motion degree Video sequences All PMV The primary windowing with two vertical MBs (type p = 3) The primary windowing with four MBs (type p = 4) Bit-rate Search range Motion degree Video sequences All PMV The primary windowing with two vertical MBs (type p = 3) The primary windowing with four MBs (type p = 4). DS Algorithm 2M bits/s [−64,+64] Fast motion Night Football PSNRY PSNRY 33 dB 36.03 dB. Moderate motion Crew PSNRY 35.97 dB. 32.9 dB (−0.1 dB). 35.83 dB (−0.2 dB). 35.86 dB (−0.11 dB). 32.96 dB (−0.04 dB). 35.99 dB (−0.04 dB). 35.92 dB (−0.05 dB). SDS Algorithm 2M bits/s [−64,+64] Fast motion Night Football PSNRY PSNRY 32.84 dB 35.43 dB. Moderate motion Crew PSNRY 35.81 dB. 32.68 dB (−0.16 dB). 35.23 dB (−0.2 dB). 35.65 dB (−0.16 dB). 32.78 dB (−0.06 dB). 35.33 dB (−0.1 dB). 35.78 dB (−0.03 dB). where N f is the frame pixel count, N1 is the regular access times of each pixel, m x is number of vertical pixels of target MB set for each primary windowing and Ny is the number of horizontal pixels of each frame. N1 and m x are defined as follows. 5, f or type p = 1, 2. N1 = (2) 3, f or type p = 3, 4. 16, f or type p = 1, 2. mx = (3) 32, f or type p = 3, 4.. 3. Fig. 4 (a) The primary windowing with single MB (type p = 1). (b) The primary windowing with two horizontal MBs (type p = 2). (c) The primary windowing with two vertical MBs (type p = 3). (d) The primary windowing with four MBs (type p = 4).. According to the windowing scheme, we can use the following equation to calculate the number of pixel accesses for each frame during the primary windowing and evaluate the external memory bandwidth: (N1 −1) 2. Nacess,p = N f × N1 − 2m x × Ny ×. i=1. i,. (1). Dual-Search-Window Motion Estimation Algorithms. Based on the proposed two-step memory access mechanisms, we developed three ME algorithms for SDTV/HDTV applications. The first algorithm, named the fullyexpanding dual-search-window algorithm (FEDSW), expands the search range to full search window when the MV search reaches or locates beyond the boundary of the primary window. The FEDSW may have the least quality degradation, but it requires high memory bandwidth for loading the secondary windows to local SRAM. Since the center-biased ME seldom goes too far from the starting point, the secondary window can be set to a smaller size to save the external memory accesses. Hence, we propose.

(4) DUNG and LIN: WIDE-RANGE MOTION ESTIMATION ARCHITECTURE. 3641. the second algorithm, called the fixed-secondary-window dual-search-window algorithm (FSDSW). The FSDSW limits the size of the secondary window to cut the redundant external memory accesses and save local SRAM size. The range of the secondary window is determined by simulating testcases with full-sized search window. Given a range to cover most MV results, the FSDSW requires low memory bandwidth while the average quality loss is little. Nevertheless, its transient quality loss could be high for some high-motion clips. To deliver a quasi-static video quality, we further proposed the third algorithm to adaptively adjust the range of the secondary window. The third algorithm is called the variable-secondary-window dual-search-window algorithm (VSDSW). The VSDSW can adaptively adjust the size of the secondary window to keep the transient quality loss low and save unnecessary memory accesses. The following gives descriptions of the proposed algorithms. 3.1 The Fully-Expanding Dual-Search-Window Algorithm (FEDSW) The FEDSW defines the primary window and four extra search windows, as shown in Fig. 5, where (2N+1) × (2N+1), (2P+1) × (2P+1) and (2N+1)/2 × (2N+1)/2 indicate the ranges of total, primary and secondary window, respectively. In Fig. 5, the primary window is at the center of the full search window and the secondary windows are located at four quadrants. During the ME process, the predicted MV (PMV) is first calculated to decide the initial search point. If PMV is located inside of the primary window, the FEDSW performs the MV searching within the primary window. As shown in Fig. 5(a), when both PMV and MV are within the primary window, the secondary window will not be needed. When the searching point reaches the boundary of primary window, the secondary window will be loaded to expand the search range for the right MV, as shown in Fig. 5(b). The secondary window is selected according to in which quadrant the searching point reaches the boundary. If PMV is out of the primary window at the beginning, the MV search will start in the secondary window, as shown in Fig. 5(c). Although the FEDSW can efficiently decide whether a secondary window is used to find the candidate motion vector or not according to the direction of PMV or position of searching point for each MB, the range of secondary window is still wide-ranging for high resolution video sequences. For example, the range of original search window is [−64, +64] for horizontal and vertical directions, the primary window is [−32, +32] for both ones and the secondary window is quarter of original search window, namely [−32, +32]. The range of secondary window is the same as the one of primary window; however, based on statistical results, the candidate motion vectors of average 98.5% MB and ones of average 99.3% MB can be searched in the primary window ([−32, +32]) by using DS algorithm and SDS algorithm for six testing D1 video sequences respectively and therefore reducing the range of secondary window to efficiently sav-. Fig. 5 (a) The windowing strategy for the case when both PMV and MV are in the primary window. (b) The windowing strategy for the case, given the PMV in the primary window, when the mv searching reaches the boundary of the primary window. (c) The windowing strategy for the case when the PMV is out of the primary window.. ing memory access from DRAM to SRAM is necessary. To achieve this target, we further propose two optimal methods to find the suitable secondary window, the one is to support a fixed range of secondary window through the statistical analysis and the other can adaptively adjust the range of search window by using the curve fitting skill for different kinds of motion degree video sequences. Therefore, this paper presents two more algorithms for the secondary window; they are FSDSW and VSDSW. 3.2 The Fixed-Secondary-Window Dual-Search-Window Algorithm (FSDSW) Figure 6 shows the schemes for FSDSW and VSDSW, where (2S+1)×(2S+1) is the range of secondary window. There are four cases for the secondary windowing. Figure 6(a) shows the first case where the PMV is in the primary window and the motion vector can be reached within the primary window. In this case, the secondary windowing is not needed. However, if the searching point touches the boundary of the primary window, the secondary windowing will be called and the search scheme becomes the second case, as shown in Fig. 6(b). Note that the motion vector will be searched until the searching point reaches the boundary of the secondary window. The third and forth cases occur when the PMV is out of the primary window. In the third case, we perform the secondary windowing at the beginning while the primary window is loaded. Since the primary window and secondary window are not overlapped the MV searching is running within the secondary window only, as shown in Fig. 6(c). If both windows are overlapped, we go.

(5) IEICE TRANS. FUNDAMENTALS, VOL.E91–A, NO.12 DECEMBER 2008. 3642 Table 3. The quality degradation without the secondary windowing.. Video sequences Night Football Crew Character Akiyo Coastguard. 2M bit-rate control FSBM FSDSW (0-size secondary window) PSNRY (dB) ΔPSNRY (dB) 33.05 −0.36 36.27 −1.12 35.87 −0.31 31.41 −0.6 41.31 −0.1 34.91 −0.22. Table 4 Sizing of secondary window with DS/SDS algorithms in different bit-rates and motion types. Fast motion. Fig. 6 Four cases for sizing of secondary window. (a) Case 1: Both PMV and motion vector are within the primary window and the secondary is not needed. (b) Case 2: The PMV is located in the primary window and the tracking of motion vector reaches the boundary of primary window. (c) Case 3: The PMV is out of the primary window and the motion vector search is not returning into the primary window. (d) Case 4: The PMV is out of the primary window and the motion vector search can go into the primary window when two sub-windows are overlapped.. Table 2 Size of secondary window versus coverage for Night D1 sequence. Size of secondary window 0 1 2 3 4 5 6 7 18 . . . 127 128. Case 1. Case 2. Case 3. Case 4. Total. 100% 100% 100% 100% 100% 100% 100% 100% 100% . . . 100% 100%. 21% 43% 53% 61% 65% 67% 71% 76% 100% . . . 100% 100%. 13% 61% 80% 89% 93% 96% 97% 98% 100% . . . 100% 100%. 0% 23% 33% 57% 70% 77% 80% 80% 100% . . . 100% 100%. 98.88% 99.47% 99.72% 99.84% 99.90% 99.93% 99.95% 99.97% 100% . . . 100% 100%. with the forth case, as shown in Fig. 6(d). For the last case the MV searching will perform within the range covered by the primary and secondary windows. In FSDSW, the range of the secondary window is deterministic and fixed based on statistical results. We simulated full-range ME on six D1 video sequences and intended to set the size of the secondary window for covering most of motion vectors. Table 2, for instance, illustrates the statistical results of the D1 clip “Night” with the small diamond search algorithm under the rate control of 2 Mbits/sec. In Table 2, we counted the occurrence times of four cases and calculated the coverage for given window size. As shown in. Bit-rate 2 Mbits/s 4 Mbits/s 6 Mbits/s Bit-rate 2 Mbits/s 4 Mbits/s 6 Mbits/s. 10 14 16 5 8 9. Moderate motion. Slow motion. DS algorithm 6 1 10 6 11 9 SDS algorithm 1 1 3 2 3 4. Average for all video sequences 8 12 14 4 7 8. the result, when the size of secondary window is 4, the total coverage is 99.90%. It means that one can set ±4, instead of ±32 as the size of secondary window and has the coverage as high as 99.90%. Table 3 shows the degradation without the secondary windowing. For fast-motion clips, without the secondary window, the quality degradation might be greater than 0.5 dB. The quality degradation of greater than 0.5 dB is sensible for human perception. The degradation of “football” clip, for instance, can be as high as 1.12 dB, because its coverage without the secondary windowing is 94.5%. To determine the size of secondary windows, we use 0.3 dB as the threshold of quality degradation. We intend to find the minimum size of secondary windows for less than 0.3 dB of the quality degradation. Table 4 shows the sizes of secondary windows based on calculating average values of sizes for different motion degrees of video sequences or all testing video sequences in different bit-rate control and motion estimation algorithms. 3.3 The Variable-Secondary-Window Dual-Search-Window Algorithm (VSDSW) Instead of applying fixed size for the secondary window, we developed VSDSW to adaptively adjust the size of secondary window based on the SAD value of PMV for a specific MB. As shown in Fig. 6, motion estimation process starts after PMV stage because the PMV can efficiently predict a good starting point for each MB. Hence, the range of MV searching is limited and depended on the SAD values of PMVs. To formulate the relation between SAD value of PMV and the required size of secondary window, we.

(6) DUNG and LIN: WIDE-RANGE MOTION ESTIMATION ARCHITECTURE. 3643. collected the SAD-size data as shown in Fig. 7, where the search algorithm is DS algorithm and bit-rate is 2M bits/s. From Fig. 7(a), we observed that the larger the SAD value, the larger the size of secondary window. When the size of secondary window is less than 32, there are 97.9% MBs can be covered for correct MV searches. Thereafter, we first applied the least mean square (LMS) method for the region of the 97.9% MBs to find the optimally-fitting, first-order curve, which is a line shown in Fig. 7(b). Since the maxi-. mum size of the secondary window is 32, we made the transfer curve, S ize secondary window = f (S ADPMV ), a piecewiselinear curve, as shown in Fig. 7(c). There are two boundary conditions; one is zero while the SAD value of PMV is less than 1280.7, and the other is 32 while the SAD of PMV value is greater than 9929.4. Finally, we concluded the parameters of optimal curve and boundary conditions in different bit-rate control and motion estimation algorithms for six testing D1 sequences as illustrated in the Table 5. 4.. Experimental Results. The proposed architecture has been fabricated in the MPEG4 CODEC chip, named AVS-1008, using UMC 0.13 µm 1P8M. The core size of the chip is 5490 × 4950 µm2 and the chip photograph is shown in Fig. 8. The memory sizes of the primary and secondary windows are 0.506 mm2 and 0.272 mm2 , respectively. They cost 2.87% of the whole chip. The specifications of AVS 1008 are summarized in Table 6. In AVS-1008, the motion estimator is not idle while the data is being loaded into the secondary window. The data loading and motion estimation are performed in parallel. Their executions are pipelined. Figure 9 illustrates the pipelined schedule of data loading and motion estimation operations. The data loading operations are performed in the direct memory access (DMA) and the motion estimation is performed in the motion estimator (ME). The DMA loads image data from DRAM to four SRAM blocks. The four. Fig. 7 (a) Distribution of S ize secondary window versus S ADPMV . (b) The optimal fitting line using LMS method. (c) The final piece-wise linear curve for VSDSW sizing. (Note: the search algorithm is DS algorithm and bit rate is 2M bits/s). Table 5. Fig. 8. Chip photograph of AVS-1008.. Parameters of optimal curve for VSDSW sizing with D1 video sequences.. Motion estimation algorithms DS algorithm SDS algorithm. Size of secondary window=a×(S ADPMV )+b Bit-rate 2M bits/s 4M bits/s 1280.7≤S ADPMV ≤9929.4 1535.8≤S ADPMV ≤10184 a=0.0037 b=−4.7836 a=0.0037 b=−5.6826 6.8235≤S ADPMV ≤18830.3 1534.8≤S ADPMV ≤10947 a=0.0017 b=−0.0116 a=0.0034 b=−5.2183. 6M bits/s 1857.7≤S ADPMV ≤8814.2 a=0.0046 b=−8.5453 922.233≤S ADPMV ≤11589 a=0.003 b=−2.7667.

(7) IEICE TRANS. FUNDAMENTALS, VOL.E91–A, NO.12 DECEMBER 2008. 3644 Table 6 Function Video input Video resolution CPU. ME search area. External memory Internal memory Technology Supply voltage Clock. Power consumption Core size Chip size Transistor count Package. Specifications of AVS-1008. MPEG4 ASP Real Time CODEC NTSC:720×480 @ 30 fps PAL: 720×576 @ 25 fps QCIF, CIF, VGA, 16x, D1(max) 32-bit RISC ARM926EJ embedded 16-KB I-cache and 16-KB D-cache primary windowing(max): H:[−32∼+32], V:[−32∼+32] secondary windowing(max): H:[−16∼+16], V:[−16∼+16] 64-MB DDR / 16-MB NOR-type flash 868k bits SRAM UMC 0.13 µm 1P8M 1.2 V external clock input : 27 MHz internal AHB clock : 75.6, 81, 87.75, 90, 94.5, 100 MHz internal APB clock : 37.8, 40.5, 43.9, 45, 47.25, 50 MHz 234 mW 5490 × 4950 µm2 17 × 17 mm2 6.4 million 256-pin BGA. SRAM blocks are mb1 , mb2 , sw p , and sw sec . The mb1 and mb2 store two consecutive macro-blocks. In Fig. 9, mb1 (r1 ) and mb2 (r2 ) stand for loading of macro-blocks r1 and r2 of the current frame. The SRAM blocks mb1 and mb2 behave as the ping-pong buffer; that is, when the SRAM mb1 is being processed in ME the DMA is loading data to the SRAM mb2 , and vice versa. The SRAM sw p stores the primary search window as shown in Fig. 4. As mentioned in Sect. 2, the primary search window only needs to update one or more slices for each new motion vector search. Thus, the operations sw p (S W1 ) and sw p (S W2 ) express the incremental loading of primary search windows of the macroblocks r1 and r2 . For type p = 1 and 3, the incremental loading is the update of the slice 6. For type p = 2 and 4, the incremental loading is the update of the slices 7 and 8. As shown in Fig. 9, there are two cases. When the secondary search window is not required, the pipeline of DMA and ME is running regularly. When the secondary search window is required, the loading of secondary search window is added in DMA and the pipeline becomes irregular. The MV search in secondary search window is activated right after the loading; meanwhile, the data loading of the following macroblock is being performed. Note that the operation ME(S W p1 , r1 ) stands for the motion estimation of the macroblock r1 in the primary search window and the operation ME(S W s , r1 ) for the motion estimation of the macroblock r1 in the secondary search window. In this work, we use six D1-sized (or 720 × 480) video sequences to validate the design, listed in Table 7, to analyze the memory bandwidth, the size of local SRAM and visual quality with our proposed algorithms. The standard for the picture quality evaluations is based on H.264/AVC. Fig. 9. The pipelined schedule of DMA and ME operations. Table 7. Video sequences for D1 at 30 fps.. Fast Motion Moderate Motion Slow Motion. Video Sequence Night Football Crew. Number of Frames 230 260 300. Character Akiyo Coastguard. 260 30 300. Fig. 10 D1 test clips: (a) Night, (b) Football, (c) Crew, (d) Character, (e) Akiyo, and (f) Coastguard.. software model. Figure 10 illustrates the first frames of D1 clips. In the simulation, the full-search-range is ±64 and the MB size is 16-by-16. We use ±32 as the size of primary window and three bit-rates (say 2 Mbits/sec, 4 Mbits/sec and 6 Mbits/sec) to measure the quality degradation. We use DSW(type p ,type s ) to clarify the dual-search-window methods, where type p is the type of primary windowing (1: single MB, 2: two horizontal MBs, 3: two vertical MBs, and 4: four MBs) and type s is the type of second windowing (1:.

(8) DUNG and LIN: WIDE-RANGE MOTION ESTIMATION ARCHITECTURE. 3645 Table 8 DSW methods DSW(1,1) DSW(1,2) DSW(1,3) DSW(2,1) DSW(2,2) DSW(2,3) DSW(3,1) DSW(3,2) DSW(3,3) DSW(4,1) DSW(4,2) DSW(4,3). Table 9 DSW methods DSW(1,1) DSW(1,2) DSW(1,3) DSW(2,1) DSW(2,2) DSW(2,3) DSW(3,1) DSW(3,2) DSW(3,3) DSW(4,1) DSW(4,2) DSW(4,3). Bandwidth requirements for D1 clips with DS algorithm under 2M bit-rate control. Night Bandwidth 51.7 MB/s 50.7 MB/s 50.9 MB/s 51.7 MB/s 50.7 MB/s 50.9 MB/s 31.7 MB/s 30.7 MB/s 30.8 MB/s 31.7 MB/s 30.7 MB/s 30.8 MB/s. DS algorithm at 2M bit-rate Football Crew Character Bandwidth Bandwidth Bandwidth 55.8 MB/s 50.8 MB/s 49.8 MB/s 52.4 MB/s 49.9 MB/s 49.8 MB/s 51.3 MB/s 49.9 MB/s 49.8 MB/s 55.8 MB/s 50.8 MB/s 49.8 MB/s 52.4 MB/s 49.9 MB/s 49.8 MB/s 51.3 MB/s 49.9 MB/s 49.8 MB/s 35.8 MB/s 30.4 MB/s 29.7 MB/s 32.4 MB/s 29.9 MB/s 29.7 MB/s 31.3 MB/s 29.8 MB/s 29.7 MB/s 35.8 MB/s 30.4 MB/s 29.7 MB/s 32.4 MB/s 29.9 MB/s 29.7 MB/s 31.3 MB/s 29.8 MB/s 29.7 MB/s. Akiyo Bandwidth 50.1 MB/s 49.8 MB/s 50 MB/s 50.1 MB/s 49.8 MB/s 50 MB/s 30 MB/s 29.8 MB/s 30 MB/s 30 MB/s 29.8 MB/s 30 MB/s. Coastguard Bandwidth 49.8 MB/s 49.8 MB/s 49.8 MB/s 49.8 MB/s 49.8 MB/s 49.8 MB/s 29.7 MB/s 29.7 MB/s 29.7 MB/s 29.7 MB/s 29.7 MB/s 29.7 MB/s. Bandwidth requirements for D1 clips with SDS algorithm under 2M bit-rate control. Night Bandwidth 50.6 MB/s 50 MB/s 50.8 MB/s 50.6 MB/s 50 MB/s 50.8 MB/s 30.5 MB/s 30 MB/s 30.1 MB/s 30.5 MB/s 30 MB/s 30.1 MB/s. SDS algorithm at 2M bit-rate Football Crew Character Bandwidth Bandwidth Bandwidth 52.3 MB/s 50 MB/s 49.8 MB/s 50.3 MB/s 49.8 MB/s 49.8 MB/s 50.3 MB/s 49.8 MB/s 49.8 MB/s 52.3 MB/s 50 MB/s 49.8 MB/s 50.3 MB/s 49.8 MB/s 49.8 MB/s 50.3 MB/s 49.8 MB/s 49.8 MB/s 32.2 MB/s 30 MB/s 29.7 MB/s 30.2 MB/s 29.7 MB/s 29.7 MB/s 30.3 MB/s 29.7 MB/s 29.7 MB/s 32.2 MB/s 30 MB/s 29.7 MB/s 30.2 MB/s 29.7 MB/s 29.7 MB/s 30.3 MB/s 29.7 MB/s 29.7 MB/s. FEDSW, 2: FSDSW, 3: VSDSW). The following equations are used to evaluate the external memory bandwidth, where f ps is the frame rate, N s,discont is the number of discontinuous secondary windowing times, N s,cont is the number of continuous secondary windowing times, my is the number of horizontal pixels of target MB set for each primary windowing, N secondary is the number of secondary windowing times, and f (si ) expresses the number of pixels loaded in the ith secondary windowing. BWDS W(type p ,1) = [Naccess,p + (2P + 1)2 ×N s,discont + (2P + 1) × my × N s,cont ] × f ps BWDS W(type p ,2) = [Naccess,p + (2S + 1) ×N secondary ] × f ps ⎛ ⎞ N secondary ⎜⎜⎜ ⎟⎟⎟ ⎜ BWDS W(type p ,3) = ⎜⎜⎝Naccess,p + f (si )⎟⎟⎟⎠ × f ps. (4). 2. (5) (6). i=1. First of all, we estimate the required external memory bandwidth for D1 clips with DS and SDS algorithms. Table 8 and Table 9 shows that using the single-MB windowing has the same result as using two-horizontal-MB windowing, and the other two windowing techniques has the same bandwidth requirement as well. The DSW methods with type p = 3 and type p = 4 are better than the other two, and can save the external memory bandwidth up to 40.36%.. Akiyo Bandwidth 49.9 MB/s 49.8 MB/s 49.8 MB/s 49.9 MB/s 49.8 MB/s 49.8 MB/s 29.8 MB/s 29.7 MB/s 29.8 MB/s 29.8 MB/s 29.7 MB/s 29.8 MB/s. Coastguard Bandwidth 49.8 MB/s 49.8 MB/s 49.8 MB/s 49.8 MB/s 49.8 MB/s 49.8 MB/s 29.7 MB/s 29.7 MB/s 29.7 MB/s 29.7 MB/s 29.7 MB/s 29.7 MB/s. Next, we further calculate the size of local memory as shown in Table 10. In Table 10, the required sizes of local memory for primary windowing are 60k bits ((5×6)×(16×16)×8)/1024), 80k bits ((5×8)×(16×16)×8)/1024), 72k bits ((6×6)×(16×16)× 8)/1024), 96k bits ((6×8)×(16×16)×8)/1024) for singleMB, two-horizontal-MB, two-vertical-MB, and four-MB windowing techniques, respectively. For the secondary windowing, with the four-MB primary windowing, FEDSW requires 96 Kbits ((6×8)×(16×16)× 8)/1024) local memory and VSDSW requires 50 Kbits ((5×5)×(16×16)×8/1024). Comparing with the others, FSDSW requires the minimum local memory. It requires 8 Kbits ((8 + 8 + 16)2 ×8/1024) with DS algorithm and 4.5 Kbits ((4 + 4 + 16)2 ×8/1024) with SDS algorithm, when the bit-rate control is set to 2 Mbits/sec. Table 10 sums up the total SRAM size based on the memory requirements of primary and secondary window and also calculates the memory increasing ratio normalized with single-MB windowing technique. From the analysis of memory bandwidth and local memory requirements, DSWs with type p = 1 and type p = 2 has the same bandwidth requirement while the latter with DS algorithm requires 33.3%, 29.41% and 18.18% more local memory than the former for FEDSW, FSDSW, and VSDSW, respectively. With SDS algorithm, the latter requires 33.3%, 31.01% and 18.18% more local memory than the former. Also, as shown.

(9) IEICE TRANS. FUNDAMENTALS, VOL.E91–A, NO.12 DECEMBER 2008. 3646. Motion etimation algorithm. DS algorithm. SDS algorithm. Table 10 Results of local SRAM sizes. DSW methods The SRAM size of The SRAM size of primary window secondary window DSW(1,1) 60 Kbits 60 Kbits DSW(1,2) 60 kbits 8 Kbits DSW(1,3) 60 Kbits 50 Kbits DSW(2,1) 80 Kbits 80 Kbits DSW(2,2) 80 Kbits 8 Kbits DSW(2,3) 80 Kbits 50 Kbits DSW(3,1) 72 Kbits 72 Kbits DSW(3,2) 72 Kbits 8 Kbits DSW(3,3) 72 Kbits 50 Kbits DSW(4,1) 96 Kbits 96 Kbits DSW(4,2) 96 Kbits 8 Kbits DSW(4,3) 96 Kbits 50 Kbits DSW(1,1) 60 Kbits 60 Kbits DSW(1,2) 60 Kbits 4.5 Kbits DSW(1,3) 60 Kbits 50 Kbits DSW(2,1) 80 Kbits 80 Kbits DSW(2,2) 80 Kbits 4.5 Kbits DSW(2,3) 80 Kbits 50 Kbits DSW(3,1) 72 Kbits 72 Kbits DSW(3,2) 72 Kbits 4.5 Kbits DSW(3,3) 72 Kbits 50 Kbits DSW(4,1) 96 Kbits 96 Kbits DSW(4,2) 96 Kbits 4.5 Kbits DSW(4,3) 96 Kbits 50 Kbits. Table 11. Total SRAM size 120 kbits 68 Kbits 110 Kbits 160 Kbits 88 Kbits 130 Kbits 144 Kbits 80 Kbits 122 Kbits 192 Kbits 104 Kbits 146 Kbits 120 Kbits 64.5 Kbits 110 Kbits 160 Kbits 84.5 Kbits 130 Kbits 144 Kbits 76.5 Kbits 122 Kbits 192 Kbits 100.5 Kbits 146 Kbits. Increasing ratio — — — 33.3% 29.41% 18.18% 20% 17.65% 10.91% 60% 52.94% 32.73% — — — 33.3% 31.01% 18.18% 20% 18.6% 10.91% 60% 55.81% 32.73%. Comparisons of visual quality for DS algorithm in PSNRY.. Video sequences Night Football Crew Character Akiyo Coastguard. FSBM PSNRY (dB) 33.05 36.27 35.87 31.41 41.31 34.91. Video sequences Night Football Crew Character Akiyo Coastguard. FSBM PSNRY (dB) 36.34 39.69 38.48 35.13 41.31 37.95. Video sequences Night Football Crew Character Akiyo Coastguard. FSBM PSNRY (dB) 38.18 41.57 39.88 37.33 41.31 39.88. 2M bit-rate control DS FEDSW ΔPSNRY (dB) ΔPSNRY (dB) −0.05 −0.07 −0.24 −0.25 0.1 0.07 0.01 0.01 0.01 0.01 −0.04 −0.04 4M bit-rate control DS FEDSW ΔPSNRY (dB) ΔPSNRY (dB) −0.03 −0.03 −0.1 −0.1 −0.05 −0.05 0.03 0.03 0.01 0.01 −0.01 −0.01 6M bit-rate control DS FEDSW ΔPSNRY (dB) ΔPSNRY (dB) 0 0 −0.04 −0.04 0.04 0.05 −0.02 −0.02 0 0 0 0. in the result, DSWs with type p = 3 and type p = 4 can save 40.36% external memory bandwidth when comparing with DSW with type p = 1. Although DSWs with type p = 3 and type p = 4 have the same requirement for memory bandwidth, the latter needs larger local memory size than the former. The memory increasing ratios are 33.3% (((192 − 144)/144) × 100%), 30% (((104 − 80)/80) × 100%). FSDSW ΔPSNRY (dB) −0.07 −0.26 0.09 0.01 0.01 −0.04. VSDSW ΔPSNRY (dB) −0.07 −0.25 0.1 0.01 0.01 −0.04. FSDSW ΔPSNRY (dB) −0.03 −0.1 −0.06 0.03 0.01 −0.02. VSDSW ΔPSNRY (dB) −0.01 −0.1 −0.05 0.03 0.01 −0.01. FSDSW ΔPSNRY (dB) −0.02 −0.05 0.05 −0.02 0 0. VSDSW ΔPSNRY (dB) 0 −0.04 0.05 −0.02 0 0. and 19.67% (((146 − 122)/122) × 100%) in DS algorithm and 33.3% (((192 − 144)/144) × 100%), 31.37% (((100.5 − 76.5)/76.5)×100%) and 19.67% (((146−122)/122)×100%) in SDS algorithm for FEDSW, FSDSW, and VSDSW, respectively. Table 11 shows the visual qualities for DS algorithm. For high motion video clips, such as “Night” and “Football,”.

(10) DUNG and LIN: WIDE-RANGE MOTION ESTIMATION ARCHITECTURE. 3647 Table 12. Comparisons of visual quality for SDS algorithm in PSNRY.. Video sequences Night Football Crew Character Akiyo Coastguard. FSBM PSNRY (dB) 33.05 36.27 35.87 31.41 41.31 34.91. Video sequences Night Football Crew Character Akiyo Coastguard. FSBM PSNRY (dB) 36.34 39.69 38.48 35.13 41.31 37.95. Video sequences Night Football Crew Character Akiyo Coastguard. FSBM PSNRY (dB) 38.18 41.57 39.88 37.33 41.31 39.88. 2M bit-rate control SDS FEDSW ΔPSNRY (dB) ΔPSNRY (dB) −0.21 −0.21 −0.84 −0.84 −0.06 −0.06 −0.41 −0.41 0 0 −0.07 −0.07 4M bit-rate control SDS FEDSW ΔPSNRY (dB) ΔPSNRY (dB) −0.06 −0.06 −0.26 −0.26 −0.02 −0.03 −0.1 −0.12 0 0 −0.02 −0.02 6M bit-rate control SDS FEDSW ΔPSNRY (dB) ΔPSNRY (dB) −0.02 −0.03 −0.07 −0.07 0.02 0.02 0.03 0.03 0 0 0 0. FSDSW ΔPSNRY (dB) −0.23 −0.86 −0.08 −0.41 0 −0.07. VSDSW ΔPSNRY (dB) −0.21 −0.84 −0.06 −0.41 0 −0.07. FSDSW ΔPSNRY (dB) −0.08 −0.34 −0.06 −0.13 0 −0.02. VSDSW ΔPSNRY (dB) −0.06 −0.27 −0.04 −0.12 0 −0.02. FSDSW ΔPSNRY (dB) −0.04 −0.14 0.02 0.03 0 0. VSDSW ΔPSNRY (dB) −0.04 −0.07 0.02 0.03 0 0. Fig. 11 The dynamic quality performance of D1 clip “Football” with DS algorithm and 2 Mbits/sec of bit-rate.. Fig. 12 The dynamic quality performance of D1 clip “Football” with SDS algorithm and 2 Mbits/sec of bit-rate.. the quality degradation (PSNRY) is less than 0.26 dB. It means that the quality degradation of proposed algorithms can be less than 0.5 dB which is considered as a minor degradation in the community of video compression. Note that the degradation is even less (say 0.02 dB) when comparing with full-range-windowing. We also applied the algorithms for SDS algorithm. From Table 12, the quality degradation is getting worse than DS algorithm; however, when comparing with full-range-windowing, the degradation is less than 0.02 dB. To observe the dynamic degradation, we estimate the quality degradation of each frame for all D1 video clips at 2M bit-rate control. After exhaustive simulation, we. conclude that the approach VSDSW can have the best visual quality among the proposed algorithms. Because of the space limitation of the manuscript, this paper does not present all results. Instead, we use the “football” sequence as an example. Figure 11 and Fig. 12 illustrate the variation of quality degradation. As shown in the results, FSDSW has worse transient degradation than FEDSW and VSDSW while the VSDSW is better than FEDSW. To show the proposed algorithms can actually save the memory requirements for high resolution clips, we also use high-definition (HD) clips [38] to demonstrate the DSW techniques. The testcases are listed in Table 13 and their video format is 720 p (1280 × 720). Figure 13 illustrates the.

(11) IEICE TRANS. FUNDAMENTALS, VOL.E91–A, NO.12 DECEMBER 2008. 3648 Table 13. Video sequences for HDTV at 50 fps.. Moderate Motion Slow Motion. Video Sequence Mobcal. Number of Frames 504. Shileds Stockholm Parkrin. 504 604 504. Fig. 15 The analysis of accumulated probability versus size of search window for SDS algorithm under the rate control for 8 Mbits/sec.. Table 14 Parameters of optimal curve for VSDSW sizing with HD video sequences. Fig. 13 The first frames of HDTV video sequences: (a) Mobcal, (b) Parkrun, (c) Shield, and (d) Stockholm.. Size of secondary search window=a×(S ADPMV )+b Motion estimation Bit-rate algorithms 8M bits/s DS 0≤S ADPMV ≤18060 algorithm a=0.00016 b=3.1034 SDS 0≤S ADPMV ≤23596 algorithm a=0.0013 b=1.3246. Table 15 Comparisons of visual quality for HDTV clips with DS algorithm in PSNRY. Video sequences Mobcal Shileds Stockholm Parkrin. Fig. 14 The analysis of accumulated probability versus size of search window for DS algorithm under the rate control for 8 Mbits/sec.. first frames of four HDTV clips. We consider that the full search range is ±128 and the MB size is 16-by-16. From Fig. 14 and Fig. 15, we can also use ±32 as the size of primary window. One may note that the motion vectors of tested HD clips are shorter than those of D1 clips in average. This is because we hardly find fast-motion HD video clips and the clips under testing are all slow-motion ones. However, we can still use these HD clips to illustrate the capability of the proposed architecture for HD video. Using the VSDSW with parameters of Table 14, Table 15 and Table 16 illustrate that, comparing with FSBM, the quality degradation is as low as 0.16 dB. Its quality is very close to the full-range windowing technique, less than 0.07 dB. 8M bit-rate control FSBM DS PSNRY ΔPSNRY 32.97 dB −0.06 dB 32.83 dB 0.16 dB 33.85 dB 0 dB 26.06 dB 0 dB. VSDSW ΔPSNRY −0.06 dB 0.16 dB 0 dB −0.05 dB. Table 16 Comparisons of visual quality for HDTV clips with SDS algorithm in PSNRY. Video sequences Mobcal Shileds Stockholm Parkrin. Table 17. 8M bit-rate control FSBM SDS PSNRY ΔPSNRY 32.97 dB −0.04 dB 32.83 dB 0.16 dB 33.85 dB 0 dB 26.06 dB −0.02 dB. VSDSW ΔPSNRY −0.05 dB 0.16 dB 0 dB −0.09 dB. Results of local SRAM sizes for HDTV clips.. Motion estimation algorithm. DSW methods. DS SDS. DSW(4,3) DSW(4,3). SRAM size of primary search window 96 Kbits 96 Kbits. SRAM size of secondary search window 50 Kbits 50 Kbits. Total SRAM size. 146 Kbits 146 Kbits.

(12) DUNG and LIN: WIDE-RANGE MOTION ESTIMATION ARCHITECTURE. 3649 Table 18 Bandwidth requirements (in MB/s) for HD clips with DS and SDS algorithms under 8M bit-rate control. Methods DSW(4,3) Methods DSW(4,3). DS algorithm at 8M bit-rate Mobcal Shileds Stockholm 136.18 135.30 135.21 SDS algorithm at 8M bit-rate Mobcal Shileds Stockholm 135.59 135.23 135.20. Parkrin 135.23 Parkrin 135.21. tors for the CODEC performance and quality. Given the limited local memory size, this paper mainly focuses on the reduction of external memory bandwidth while the compression quality degradation is little. The reduction of memory bandwidth implies the save of power consumption. We proposed three windowing algorithms for center-biased motion estimations and take the advantage of minimizing the required data accesses in the center-biased motion estimations. At the same time, we also take the data reusability into account. Under the rate-control mechanism, the proposed windowing can significantly save the external memory bandwidth. As shown in Fig. 16 and Fig. 17, the quality degradation is very little for either D1 or HDTV video clips. For 720 p HDTV sequences, the proposed windowing algorithms only require the external memory bandwidth as low as 135.20 MBytes/Sec, while the quality degradation is less than 0.2 dB. Acknowledgments. Fig. 16. The rate-distortion curve of the “football” D1 clip.. This work was supported by the National Science Council, R.O.C., under the grant number NSC 95-2221-E-009337-MY3. The authors would like to acknowledge National Chip Implementation Center (CIC) for technical support. References. Fig. 17. The rate-distortion curve of the “mocal” HD clip.. degradation while the memory requirements are way less than the full-range windowing. Table 17 shows the total size of local memory using DSW(4,3) is 146 kbits and Table 18 shows the results of bandwidth performance for testing HD clips. The results show that the proposed approach can save 76.14% of local memory and 81.41% of external memory bandwidth while the traditional approach may require 612 Kbits local memory and 727.51 MBytes/sec. 5.. Conclusion. As the demand of high-resolution video applications increases, to solve the notorious power-consuming problem, the memory requirements have been the most important fac-. [1] Information Technology — Coding of Moving Pictures and Associated Audio for Digital Storage Media at up to about 1.5 Mbit/s — Part 2: Video, ISO/IEC 11172-2, 1993. [2] Information Technology — Generic Coding of Moving Pictures and Associated Audio Information: Video, ISO/IEC 13818-2 and ITU-T Recommendation H.262, 1996. [3] Information Technology — Coding of Audio-Visual Objects — Part 2: Visual, ISO/IEC 14496-2, 1999. [4] Joint Video Team, Draft ITU-T Recommendation and Final Draft International Standard of Joint Video Specification, ITU-T Recommendation H.264 and ISO/IEC 14496-10 AVC, May 2003. [5] P. Kuhn, Algorithm, Complexity Analysis And VLSI Architecture for MPEG-4 Motion Estimation, Kluwer Academic Publishers, 1999. [6] J.-C. Tuan, T.-S. Chang, and C.-W. Jen, “On the data reuse and memory bandwidth analysis for full-search block-matching VLSI architecture,” IEEE Trans. Circuits Syst. Video Technol., vol.12, no.1, pp.61–72, Jan. 2002. [7] J.-Y. Kim and S.-B. Yang, “An efficient search algorithm for BLOCK motion estimation,” IEEE Workshop on Signal Processing Systems, pp.100–109, Oct. 1999. [8] J.R. Jain and A.K. Jain, “Displacement measurement and its application in interframe image coding,” IEEE Trans. Commun., vol.29, no.12, pp.1799–1808, Dec. 1981. [9] M.J. Chen, L.G. Chen, and T.D. Chiueh, “One-dimensional full search motion estimation algorithm for video coding,” IEEE Trans. Circuits Syst. Video Technol., vol.4, no.5, pp.504–509, Oct. 1994. [10] R. Li, B. Zeng, and M.L. Liou, “A new three-step search algorithm for block motion estimation,” IEEE Trans. Circuits Syst. Video Technol., vol.4, no.4, pp.438–442, Aug. 1994. [11] S. Zhu and K. Ma, “A new diamond search algorithm for fast block matching motion estimation,” ICICS’97, pp.9–12, Singapore, Sept. 1997. [12] J.Y. Tham, S. Ranganath, M. Ranganath, and A.A. Kassim, “A novel unrestricted center-biased dimond search algorithm for block motion.

(13) IEICE TRANS. FUNDAMENTALS, VOL.E91–A, NO.12 DECEMBER 2008. 3650. [13]. [14]. [15]. [16]. [17]. [18]. [19]. [20]. [21]. [22]. [23]. [24]. [25]. [26]. [27]. [28]. [29]. [30]. [31]. [32]. [33]. estimation,” IEEE Trans. Circuits Syst. Video Technol., vol.8, no.4, pp.369–377, Aug. 1998. S. Zhu and K.-K. Ma, “A new diamond search algorithm for fast block-matching motion estimation,” IEEE Trans. Image Process., vol.9, no.2, pp.287–290, Feb. 2000. C. Zhu, X. Lin, and L.P. Chau, “Hexagon-based search pattern for fast block motion estimation,” IEEE Trans. Circuits Syst. Video Technol., vol.12, no.5, pp.349–355, May 2002. L.-M. Po and W.-C. Ma, “A novel four-step search algorithm for fast block motion estimation,” IEEE Trans. Circuits Syst. Video Technol., vol.6, no.3, pp.313–317, June 1996. S. Zhu and K.-K. Ma, “A new diamond search algorithm for fast block-matching motion estimation,” IEEE Trans. Image Process., vol.9, no.2, pp.287–290, Feb. 2000. X.-Q. Banh and Y.-P. Tan, “Adaptive dual-cross search algorithm for block-matching motion estimation,” IEEE Trans. Consum. Electron., vol.50, no.2, pp.766–775, May 2004. Y. Nie and K.-K. Ma, “Adaptive rood pattern search for fast blockmatching motion estimation,” IEEE Trans. Image Process., vol.11, no.12, pp.1442–1449, Dec. 2002. C.-H. Cheung and L.-M. Po, “Novel cross-diamond-hexagonal search algorithms for fast block motion estimation,” IEEE Trans. Multimed., vol.7, no.1, pp.16–22, Feb. 2005. X. Jing and L.-P. Chau, “An efficient three-step search algorithm for block motion estimation,” IEEE Trans. Multimed., vol.6, no.3, pp.435–438, June 2004. L.-K. Liu and E. Feig, “A block-based gradient descent search algorithm for block motion estimation in video coding,” IEEE Trans. Circuits Syst. Video Technol., vol.6, no.4, pp.419–422, Aug. 1996. W. Li and E. Salari, “Successive elimination algorithm for motion estimation,” IEEE Trans. Image Process., vol.4, no.1, pp.105–107, Jan. 1995. V.L. Do and K.Y. Yun, “A low-power VLSI architecture for fullsearch block-matching motion estimation,” IEEE Trans. Circuits Syst. Video Technol., vol.8, no.4, pp.393–398, Aug. 1998. J.H. Luo, C.N. Wang, and T. Chiang, “A novel all-binary motion estimation (ABME) with optimized haredware architectures,” IEEE Trans. Circuits Syst. Video Technol., vol.12, no.8, pp.700–712, Aug. 2002. B. Liu and A. Zaccarin, “New fast algorithms for the estimation of block motion vectors,” IEEE Trans. Circuits Syst. Video Technol., vol.3, no.2, pp.148–157, April 1993. C.K. Cheung and L.M. Po, “A hierarchical block motion estimation algorithm using partial distortion measure,” IEEE ICIP, vol.3, pp.606–609, Oct. 1997. C.K. Cheung and L.M. Po, “Normalized partial distortion search algorithm for block motion estimation,” IEEE Trans. Circuits Syst. Video Technol., vol.10, no.3, pp.417–422, April 2000. C.N. Wang, S.W. Yang, C.M. Liu, and T. Chiang, “A hierarchical decimation lattice based on N-queen with an application for motion estimation,” IEEE Signal Process. Lett., vol.10, no.8, pp.228–231, Aug. 2003. C.N. Wang, S.W. Yang, C.M. Liu, and T. Chiang, “A hierarchical N-queen decimation lattice and hardware architecture for motion estimation,” IEEE Trans. Circuits Syst. Video Technol., vol.14, no.4, pp.429–440, April 2004. Y.-L. Chan and W.-C. Siu, “New adaptive pixel decimation for block motion vector estimation,” IEEE Trans. Circuits Syst. Video Technol., vol.6, no.1, pp.113–118, Feb. 1996. Y.K. Wang, Y.Q. Wang, and H. Kuroda, “A globally adaptive pixeldecimation algorithm for block-motion estimation,” IEEE Trans. Circuits Syst. Video Technol., vol.10, no.6, pp.1006–1011, Sept. 2000. S. Lee and S.I. Chae, “Motion estimation algorithm using lowresolution quantization,” Electron. Lett., vol.32, no.7, pp.647–648, March 1996. H.W. Cheng and L.R. Dung, “EFBLA: A two-phase matching algo-. [34]. [35]. [36]. [37]. [38]. rithm for fast motion estimation,” Advances in Multimedia Information Processing — PCM, vol.2532, pp.112–119, Dec. 2002. C.L. Su and C.W. Jen, “Motion estimation using msd-first processing,” IEE Proc-G Circuits, Devices and Systems, vol.150, no.2, pp.124–133, April 2003. J.-F. Shen, T.-C. Wand, and L.-G. Chen, “A novel low-power fullsearch block-matching motion-estimation design for H.263+,” IEEE Trans. Circuits Syst. Video Technol., vol.11, no.7, pp.890–897, July 2001. M. Brunig and W. Niehsen, “Fast full-search block matching,” IEEE Trans. Circuits Syst. Video Technol., vol.11, no.2, pp.241–247, Feb. 2001. L. Sousa and N. Roma, “Low-power array architectures for motion estimation,” 1999 IEEE 3rd Workshop on Multimedia Signal Processing, pp.679–684, 1999. http://www.ldv.ei.tum.de/liquid.php?page=70. Lan-Rong Dung was born in 1966. He received a BSEE and the Best Student Award from Feng Chia University, Taiwan, in 1988, an MS in electronics engineering from National Chiao Tung University, Taiwan, in 1990, and Ph.D. in electrical and computer engineering from Georgia Institute of Technology, in 1997. From 1997 to 1999 he was with Rockwell Science Center, Thousand Oaks, CA, as a Member of the Technical Staff. He joined the faculty of National Chiao Tung University, Taiwan in 1999 where he is currently an associate professor in the Department of Electrical and Control Engineering. He received the VHDL International Outstanding Dissertation Award celebrating in Washington DC in October, 1997. His current research interests include VLSI design, digital signal processing, hardware-software codesign, and System-on-Chip architecture. He is a member of Computer and Signal Processing societies of the IEEE.. Meng-Chun Lin received a B.S. degree in Electronic Engineering from Fu Jen Catholic University, Taipei, Taiwan, in 2001, and Ph.D. degree in the Electrical and Control Engineering, National Chiao Tung University Hsinchu, Taiwan, in 2007. He is currently working with Avisonic Technology Corp., Taiwan. His research interests are image processing, video processing, VLSI architecture and memory circuit design..

(14)