A content-based methodology for power-aware motion estimation architecture

(1)

A Content-Based Methodology for Power-Aware

Motion Estimation Architecture

Hsien-Wen Cheng and Lan-Rong Dung, Member, IEEE

Abstract—This paper presents a novel power-aware motion

estimation algorithm, called adaptive content-based subsample al-gorithm (ACSA), for battery-powered multimedia devices. While the battery status changes, the architecture adaptively performs graceful tradeoffs between power consumption and compression quality. As the available energy decreases, the algorithm raises the subsample rate for maximizing battery lifetime. Differing from the existing subsample algorithms, the content-based algorithm first extracts edge pixels from a macro-block and then subsamples the remaining low-frequency part. In this way, we can alleviate the aliasing problem and thus keep the quality degradation low as the subsample rate increases. As shown in experimental results, the architecture can dynamically operate at different power con-sumption modes with little quality degradation according to the remaining capacity of battery pack while the power overhead of edge extraction is under 0.8%.

Index Terms—Content-based image processing, motion

esti-mation (ME), power-aware system, subsample algorithm, very large-scale integration (VLSI) architecture, video compression, VLSI image processing.

I. INTRODUCTION

M

OTION ESTIMATION (ME) has been notably rec-ognized as the most critical part in many video compression applications, such as MPEG standards and H.26x [1], which leads to dominant computational and, hence, power requirements. With increasing demand of portable multimedia devices, recently, a power-aware ME that can be flexible in both power consumption and compression quality is highly required [2]. Fig. 1 illustrates a typical block diagram of the portable multimedia system powered by battery. In practice, a battery has the two most important nonideal properties, which are the rate capacity effect and recovery effect [3]. Fig. 2, as an ex-ample, illustrates that the system can extend the battery lifetime by gradually stepping down the power dissipation. Normally, system designers can choose points A, B, C, and D according to the discharging profile provided by battery manufacturer. When the battery monitor unit detects the voltage drop of battery, the host processor will change the power mode accordingly. By changing the operation mode of all power-aware components simultaneously, the architecture adapts the power dissipa-tion to battery status and, hence, raises battery performance.

Manuscript received February 24, 2004; revised September 13, 2004 and Feb-ruary 22, 2005. This work was supported in part by Taiwan MOE Program for Promoting Academic Excellent of Universities under Grant 91-E-FA06-4-4 and the National Science Council, Taiwan, R.O.C., under Grant NSC 93-2220-E-009-023. This paper was recommended by Associate Editor S.-F. Chang.

The authors are with the Department of Electrical and Control Engineering, National Chiao Tung University, Hsinchu 300, Taiwan, R.O.C. (e-mail: [email protected]; [email protected]).

Digital Object Identifier 10.1109/TCSII.2005.850771

Fig. 1. System block diagram of a portable, battery-powered multimedia device.

Fig. 2. Diagram representing the nonlinear discharging properties of battery.

This paper, therefore, intends to presents a power-aware ME architecture driven by an adaptive content-based subsample algorithm to lengthen the battery lifetime.

Many published papers have presented efficient algorithms for very large-scale integration (VLSI) implementation of motion estimation on high performance or low-power design. Yet most of them cannot dynamically adapt the compression quality to different power consumption modes. Among these proposed algorithms, the full-search block-matching (FSBM) algorithm with the sum of absolute difference (SAD) criterion is the most popular approach for ME because of its considerably good quality. It is particularly attractive to those who require extremely high quality. However, the huge number of

compar-ison/difference operations results in a high computation load

and power consumption [4], [5]. To reduce the computational complexity of FSBM, researchers have proposed various fast algorithms. They either reduce search steps [6]–[8] or simplify calculations of error criterion [9], [10]. These fast-search algorithms have successfully improved the block matching speed and, thus, led to a low-power implementation. However, a low-power implementation is not necessarily a power-aware system, in which the implementation should adaptively modify its behavior to balance the performance between quality and battery life [11]. The requirement for ME algorithms to be suitable for power-aware design is a high degree of scalability in performance tradeoffs. Unfortunately, the fast algorithms mentioned above do not meet this requirement.

(2)

The subsample algorithms present in [12] and [13] are very suitable for power-aware ME architecture because of their highly scalable characteristic. As the subsample rate in-creasing, the unsampled processing elements in a semisystolic array will be disabled to save the switch activities, that is, the power consumption of the architecture will be decreased correspondingly. However, applying subsample algorithms for power-aware architecture may suffer from the aliasing problem in the high-frequency band. The aliasing problem degrades the compression quality rapidly as the subsample rate increases. To alleviate the problem, we extend traditional subsample algorithms to an adaptive content-based subsample algorithm (ACSA). In the ACSA, we first use edge extraction techniques to separate the high-frequency band from a macro-block and then subsample the low-frequency band only. By merging the edge pixels and subsample nonedge pixels, the algorithm generates a turn-off mask for the architecture to disable the switching activities of processing elements (PEs) in a semisys-tolic array. This content-based algorithm keeps the quality degradation low as the subsample rate increases. Because the number of high-frequency pixels varies with different video clips, we introduce an adaptive control mechanism to set the threshold value for edge determination and make the number of masked pixels stationary for a given power mode. The ACSA can be implemented in most existing ME architectures by turning off PEs according to the subsample mask. In this paper, we present a semisystolic architecture with gated PEs. The simulation results show that the ACSA can dynamically alter the subsample rate as the power consumption mode changes and the architecture can work at different power consumption modes with acceptable and smooth quality degradation while the power overhead of edge extraction is under 0.8%.

II. SUBSAMPLEALGORITHMS

A. Generic Subsample Algorithm

Here, we present a generic subsample algorithm (GSA) in which the subsample rate ranges from 4:1 to 1:1. The GSA uses SSAD

(1) as a matching criterion, called the subsample sum of absolute difference (SSAD), where is the subsample mask for the subsample rate 8 to as shown in

(2) where the macro-block size is -by- , is the luminance value at of the current macro-block, and is the luminance value at of the reference macro-block which offsets from the current macro-block in the searching area .The subsample mask is generated from the basic mask

Fig. 3. Procedure of the ACSA.

where

for and for (3)

Due to its flexibility in energy-quality tradeoffs, the GSA is suit-able for the implementation of power-aware architectures. The power consumption of the architecture is proportional to the in-verse of the subsample rate. However, the algorithm suffers from the aliasing problem which will degrade the ME quality and re-sults in a considerable degradation of quality when the high-fre-quency band is messed up.

B. Adaptive Content-Based Subsample Algorithm

As mentioned above, the GSA has an aliasing problem for a high subsample rate and leads to considerable quality degra-dation because the high-frequency band is interfered with. To alleviate the problem, we propose an ACSA that only subsam-ples the low-frequency band. The procedure of the ACSA is de-scribed in Fig. 3. We first use edge extraction to separate high-frequency pixels (or edge pixels) from a macro-block and then subsample the remaining pixels (or low-frequency pixels). The determination of edge pixels starts from gradient filtering. In this paper, we use three popular gradient filters to exercise the con-tent-based algorithm; they are the high-pass gradient filter, the Sobel gradient filter, and the morphological gradient filter. After obtaining the gradients denoted as , we use a floating threshold to determine the edge pixels of the current macro-block. The floating threshold makes the edge extraction more robust with video content varying than the constant threshold does. The following equation describes the calculation of for each macro-block in the th frame:

(3)

Following the threshold setting step, the algorithm uses the threshold value to pick the edge pixels and produces the edge mask as shown as

EdgeMask for

otherwise. (5)

Finally, the adaptive content-based subsample mask (ACSM) is generated by merging the edge mask and the subsample mask, as shown in

ACSM EdgeMask

(6) The operator means logicORoperation. According to the cal-culation of ACSM, the subsample rate of the ACSA is

equal to to , where is the number of

1’s in ACSM. Once the ACSM is generated, the algorithm can then determine the motion vection (MV) with the adaptive con-tent-based subsample sum of absolute difference (ACSSAD) criterion. The following equation shows the ACSSAD criterion: ACSSAD

ACSM (7)

Although the content-based algorithm can perform high-power scalability and alleviate the mess in high frequency, there exists a nonstationary problem if the designer uses a constant threshold parameter to statically derive the floating threshold for edge-extracting. Since different macro-blocks with the same threshold parameter will have different values, setting the threshold parameters without considering the content variation of macro-blocks will make the subsample rate nonstationary; that is, the power consumption will not be converged within a narrow range for a given power mode. For example, the subsample rate of the Weather clip can vary between 256:67 and 256:224 in the first frame while is set as 0.1. To solve this problem, this paper introduces an adaptive control mechanism to adjust the threshold parameter so that the subsample rate can be stationary. The adaptive control mechanism is a run-time process that adjusts the threshold parameter fittingly according to the difference between the current subsample rate and the desired subsample rate (or target subsample rate).

Fig. 4 shows the block diagram of the adaptive control mecha-nism. Given the battery status, the host processor sets the power mode and the target subsample rate as well. The target sub-sample rate is to , where is the target number of 1’s in the ACSM. Then, the controller recursively updates the threshold parameter based on the current

and the difference of and , as shown in Fig. 3 under the line “//update threshold parameter for the next frame.”

III. POWER-AWAREARCHITECTURE

According to the ACSA, we present a semisystolic architec-ture as shown in Fig. 5, based on existed architecarchitec-tures, such as [4]. The architecture contains an edge-extraction unit (EXU), an array of processing elements (PEs), a parallel adder tree (PAT), a shift register array (SRA), and a motion-vector selector (MVS).

Fig. 4. Block diagram of the edge-extraction unit with adaptive control mechanism.

Fig. 5. Block diagram of the power-aware ME architecture driven by the ACSA.

Given the power consumption mode, the EXU, which contains the gradient filter and the ACSM generator, extracts high-fre-quency (or edge) pixels from the current macro-block and gener-ates 0–1 ACSM to disable or enable PEs. The PE array is used to accumulate absolute pixel differences column by column while the parallel adder tree sums up all the results to generate the value of ACSSAD. Each PE has a datapath that consists of the absolute difference unit and the Adder unit. Upon the ACSSAD is generated, the MVS then performs compare-and-select operation to compare-and-select the best motion vector.

To start with, by performing the edge extraction, the architec-ture first generates ACSM to determine whether to enable/dis-able PEs and thus dynamically changes the switching activities of system to reduce the power consumption. The ACSM dis-ables the PE by using a block element (BE) that is implemented byANDgates. The BEs can nullify the input signals of data path. When a PE is disabled during an MV searching iteration, the circuit in the PE remains still until the next iteration starts and, thus, the consumption of transient power can be saved.

IV. POWERMODEL

One can consider the major power consumption of a CMOS gate as

(8) where is the output capacitance, is the operation frequency, is the switch activity of gate , and and are constants. For an active execution unit in a VLSI system, the power consumption can be shown in

(9) where is the gate count of . The following equa-tions demonstrate the total power consumption:

(4)

Fig. 6. Quality degradation of the “table tennis” clip. (11) (12) (13) (14) (15) After considering the activity of EUs, the total power consump-tion can be expressed as (10) and approximated as (11) by as-suming the switch activities are uniform within an execution unit; that is, , . Since the average output capacitances of each execution unit are nearly the same as the average output capacitances of total system , the total power consumption can be approximated to (14). There-fore, we can obtain an approximated power estimation model shown in (15), where is defined as the gate power coefficient. In this paper, we use the gate power coefficient as the unit for estimating power dissipation for the power-aware architecture.

V. RESULTS

The peak signal-to-noise ratio (PSNR) of the motion-com-pensated predicted frame compared to the original frame is adopted as a performance measure. Fig. 6 shows the quality degradation of the CIF video clip “table tennis” with parame-ters and . The target subsample rates of GSA and ACSA are set as (4:1), (8:3), (2:1), (8:5), (4:3), (8:7), and (1:1) respectively. In the simulation, we compare our approach with GSA and the variable-search-window approach. The vari-able-search-window approach is simply achieved by gradually setting the search parameter to 16, 20, 22, 25, 27, 30, and 32. As shown in the results, the quality degradation of the ACSA is less than that of the GSA and the variable-search-window approach. Fig. 7 illustrates the improvement of ACSA over GSA. Because the ACSA alleviates the aliasing problem for the

Fig. 7. Nineteenth motion-compensated frame of the “table-tennis” clip. (a) GSA with the 8-to-3 subsample rate and (b) ACSA with the 8-to-3 subsample rate.

Fig. 8. Step response for varyingK of the “table tennis” clip. TABLE I

STATIONARYERROR OF THEACSA SUBSAMPLERATE(R )

The edge-extraction unit uses the high-pass gradient filter. The control parameterK = 0:3.

high-frequency band, the compensated frame with the ACSA has better quality than that with the GSA on the edge of the sportsman’s arm and thumb. In addition, we can find that the type of gradient filter does not make much difference to the performance of the proposed algorithm. Fig. 8 illustrates the stationary response with variable . Obviously, the higher the value of , the shorter the settling time and the worse the stability of the . The suitable range of is from 0.1 to 0.5 after testing those four video clips. We analyzed the errors of ACSA in Table I. The average

error is as low as 1.12% and the error variance is as low as 0.000 24. Because the subsample rate can be nearly stationary with a given target subsample rate and power mode, the power-awareness of the proposed algorithm is very good and the ACSA can be applied for power-aware architecture.

Table II shows the synthesis result using the TSMC 1P4M 0.35- m cell library, where the symbol can be either the

(5)

TABLE II

POWERANALYSIS OF THEPOWER-AWAREARCHITECTURE

Area Overhead=N =(N + N ) Power Overhead=P =(P + P )

N =16 p =32

Cell library: TSMC 0.35-m process.

Fig. 9. Application for switching the power mode of the video clip “table-tennis.”

adaptive content-based subsample rate or the generic subsample rate, and is the gate power coefficient defined in (15). Com-paring with the general semisystolic architecture [4], the edge-extraction unit (EXU) of the proposed architecture is the major overhead for the power-aware function. As mentioned above, this paper uses one of three gradient filters to implement the EXU. As per the synthesis results, the gate counts of the three gradient filters are 595.33, 793.77, and 727.63, respectively. The variance of these values are very little to the overall gate count of EXU. This means that the selection of the gradient filter does not affect the overhead estimation much. Therefore, we selec-tively use a high-pass filter to estimate the performance head caused by EXU which dominates the area/power over-heads. From the results, the area overhead of EXU is 7.68% and the power overheads are 0.8% and 0.5% for 4-to-1 and 1-to-1 subsample rates, respectively. Fig. 9 shows an example of the video clip “table-tennis” for switching the power con-sumption mode. The target subsample pixel count is reduced by 48 every 40 frames. The result shows that the adaptive control mechanism can make the power consumption reach the target level within 10 frames. According to the battery properties de-scribed in Section I, the curve shows that our power-aware ar-chitecture can match the behavior by slowly and gradually de-grading the quality. The marks A, B, C, and D correspond to the power-switching points.

VI. CONCLUSION

Motivated by the concept of battery properties and the power-aware paradigm, this paper presents an architec-ture-level power-aware technique based on a novel adaptive content-based subsample algorithm. When the battery is in the status of full capacity, the proposed ME architecture will turn on all of the PEs to provide the best compression quality. In contrast, when the battery capacity is short for full operation, instead of exhibiting an all-or-none behavior, the proposed architecture will shift to lower power consumption mode by disabling some PEs to extend the battery lifetime with little quality degradation. As shown in the simulation results, the proposed algorithm successfully improves the compression quality of the generic subsample algorithm and does not intro-duce much power dissipation.

ACKNOWLEDGMENT

The authors would like to thank National Chip Implementa-tion Center (CIC), Taiwan, R.O.C., for technical support.

REFERENCES

[1] P. Kuhn, Algorithms, Complexity Analysis And VLSI Architectures for

MPEG-4 Motion Estimation. Norwell, MA: Kluwer, 1999.

[2] M. Bhardwaj, R. Min, and A.-P. Chandrakasan, “Quantifying and en-hancing power awareness of VLSI systems,” IEEE Trans. Very Large

Scale Integr. (VLSI) Syst., vol. 9, no. 6, pp. 757–772, Dec. 2001.

[3] D. Linden, Handbook of Batteries, 2nd ed. New York: McGraw-Hill, 1995.

[4] C.-H. Hsieh and T.-P. Lin, “VLSI architecture for block-matching mo-tion estimamo-tion algorithm,” IEEE Trans. Circuits Syst. Video Technol., vol. 2, no. 2, pp. 169–175, Jun. 1992.

[5] J.-C. Tuan, T.-S. Chang, and C.-W. Jen, “On the data reuse and memory bandwidth analysis for full-search block-matching VLSI architecture,”

IEEE Trans. Circuits Syst. Video Technol., vol. 12, no. 1, pp. 61–72, Jan.

2002.

[6] R. Li, B. Zeng, and M.-L. Liou, “A new three-step search algorithm for block motion estimation,” IEEE Trans. Circuits Syst. Video Technol., vol. 4, no. 4, pp. 438–442, Aug. 1994.

[7] J.-Y. Tham, S. Ranganath, M. Ranganath, and A.-A. Kassim, “A novel unrestricted center-biased diamond search algorithm for block motion estimation,” IEEE Trans. Circuits Syst. Video Technol., vol. 8, no. 4, pp. 369–377, Aug. 1998.

[8] C. Zhu, X. Lin, and L.-P. Chau, “Hexagon-based search pattern for fast block motion estimation,” IEEE Trans. Circuits Syst. Video Technol., vol. 12, no. 5, pp. 349–355, May 2002.

[9] K. Sauer and B. Schwartz, “Efficient block motion estimation using in-tegral projections,” IEEE Trans. Circuits Syst. Video Technol., vol. 6, no. 5, pp. 513–518, Oct. 1996.

[10] J.-H. Luo, C.-N. Wang, and T. Chiang, “A novel all-binary motion es-timation (ABME) with optimized hardware architectures,” IEEE Trans.

Circuits Syst. Video Technol., vol. 12, no. 8, pp. 700–712, Aug. 2002.

[11] O.-S. Unsal and I. Koren, “System-level power-aware design techniques in real-time systems,” Proc. IEEE, vol. 91, no. 7, pp. 1055–1069, Jul. 2003.

[12] B. Liu and A. Zaccarin, “New fast algorithms for the estimation of block motion vectors,” IEEE Trans. Circuits Syst. Video Technol., vol. 3, no. 2, pp. 148–157, Apr. 1993.

[13] C.-K. Cheung and L.-M. Po, “Normalized partial distortion search al-gorithm for block motion estimation,” IEEE Trans. Circuits Syst. Video