In this year, the project has resulted in five journal papers and one conference paper:
1. Hsien-Wen Cheng and Lan-Rong Dung, “A Power-Aware Motion Estimation Architecture Using Content-based Subsampling,” Journal of Information Science and Engineering, vol. 22, no. 4, pp. 799-818, 2006.
2. Hsien-Wen Cheng and Lan-Rong Dung, “A Content-based Methodology for Power-Aware Motion Estimation Architecture,” IEEE transactions on Circuits and Systems II, vol.52, No.10, pp.631-635, 2006.
3. Lan-Rong Dung and Hsueh-Chih Yang, “A Parallel-In Folding Technique for High-Order FIR Filter Implementation,” accepted by IEICE transactions on Fundamentals.
4. Tsung-Hsi Chiang, and Lan-Rong Dung, “System level verification on high-level synthesis of dataflow algorithms using Petri net,” accepted by WSEAS transactions on Circuits and Systems.
5. Chuan-Sheng Lin and Lan-Rong Dung, “A NAND Flash Memory Controller for SD/MMC Flash Memory Card,” to be appeared in IEEE transactions on Magnetics
6. Tsung-Hsi Chiang, and Lan-Rong Dung, “System-Level Verification on High-Level Synthesis of Dataflow Graph,” ISCAS 2006.
799
A Power-Aware Motion Estimation Architecture Using Content-based Subsampling
*HSIEN-WEN CHENG AND LAN-RONG DUNG Department of Electrical and Control Engineering
National Chiao Tung University Hsinchu, 300 Taiwan E-mail: [email protected]
This paper presents a novel power-aware motion estimation architecture for bat-tery-powered multimedia devices. As the battery status changes, the proposed architec-ture adaptively performs graceful tradeoffs between power consumption and compres-sion quality. The tradeoffs are considered to be graceful in that the proposed architecture is scalable with changing conditions and the compression quality is slightly degraded as the available energy is depleted. The key to such tradeoffs lies in a content-based sub-sample algorithm, first proposed in this paper. As the available energy decreases, the al-gorithm raises the subsample rate for maximizing the battery lifetime. Differently from the existing subsample algorithms, the content-based algorithm first extracts edge pixels from a macro-block and then subsamples the remaining low-frequency part. By doing so, we can alleviate the aliasing problem and, thus, limit the quality degradation as the sub-sample rate increases. Given a power consumption mode, the proposed architecture first performs edge extraction to generate a turn-off mask and then uses the turn-off mask to reduce the switch activities of processing elements (PEs) in a semi-systolic array. The reduction of switch activities results in significant power consumption savings. To achieve a high degree of scalability and qualified power-awareness, we use an adaptive control mechanism to set the threshold value for edge determination and make the reduc-tion of switch activities rather stareduc-tionary. As shown by experimental results, the archi-tecture can dynamically operate in different power consumption modes with little quality degradation according to the remaining capacity of the battery pack while the power overhead of edge extraction is kept under 0.8%
Keywords: motion estimation, image processing, VLSI architecture, video compression, power-aware system
1. INTRODUCTION
Motion estimation (ME) has been notably recognized as the most critical part of many video compression applications, such as MPEG standards and H.26x [1], since it tends to dominate the computational and hence power requirements. With increasing demand for battery-powered multimedia devices, an ME architecture that can be flexible in both power consumption and compression quality is highly required. This requirement is driven by the user-centric perspective [2]. Basically, users have two views on using portable devices. Sometimes, users want extremely high video quality at the cost of re-duced battery lifetime. At other times, users want acceptable quality with extended bat-tery lifetime. This paper, therefore, presents a novel power-aware ME architecture that
Received February 9, 2004; revised July 6, 2004; accepted July 27, 2004.
Communicated by Pau-Choo Chung.
* This work was supported in part by the National Science Council of Taiwan, R.O.C., under grant No. NSC 92-2220-E-009-033.
uses a content-based subsample algorithm, which can adaptively perform tradeoffs be-tween power consumption and compression quality as the battery status changes. The proposed architecture is driven by a content-based subsample algorithm that allows the architecture to work in different power consumption modes with acceptable quality deg-radation. Since the control mechanism and data sequences in different power consump-tion modes are the same in the architecture, the power-aware algorithm can switch power consumption modes very smoothly on the fly. The block diagram shown in Fig. 1 illus-trates a typical application of the proposed power-aware ME architecture. The host proc-essor monitors the remaining capacity of the battery pack and switches power consump-tion modes. According to the power mode, the power-aware architecture sets the sub-sample rate and calculates the motion vector (MV) for motion compensation. Note that most portable multimedia devices, in practice, have a battery monitor unit and power management subroutines. The host processor and battery monitor unit should not be con-sidered as the overhead of using the power-aware architecture.
Fig. 1. The system block diagram of a portable, battery-powered multimedia device.
Many published papers have presented efficient algorithms for VLSI implementa-tion of moimplementa-tion estimaimplementa-tion, based on either high performance or low power design. How-ever, most of them cannot dynamically adapt the compression quality to different power consumption modes. Among these proposed algorithms, the Full-Search Block-Matching (FSBM) algorithm with the Sum of Absolute Difference (SAD) criterion is the most popular approach to motion estimation because of its good quality. It is particularly at-tractive when extremely high quality is required. Many types of architectures have been proposed for the implementation of FSBM algorithms [3-6]. However, they require a huge number of comparison/difference operations and result in a large computation load and high power consumption. To reduce the computational complexity of FSBM, re-searchers have proposed various fast algorithms. They either reduce the number of search steps [7-12] or simplify the calculation of the error criterion [13-16]. By combining step-reduction and criterion-simplification, some proposed two-phase algorithms balance
the performance between complexity and quality [17-19]. They first use FSBM with a simplified matching criterion to generate candidate vectors and then select the best mo-tion vector from among these candidates using the SAD criterion. These fast-search al-gorithms successfully improved the block matching speed while limiting the quality degradation, thus achieving low power implementation. However, a low power imple-mentation is not necessarily a power-aware system in that a power-aware system should adaptively modify its behavior according to the change of the power/energy status and achieve a balance between quality and battery life [20]. The requirement of ME algo-rithms to be suitable for power-aware designs is high degree of scalability in perform-ance tradeoffs. Unfortunately, the fast algorithms mentioned above do not meet this re-quirement.
The authors in [21, 22] presented subsample algorithms that significantly reduce the computation cost with low quality degradation. The reduction of the computation cost implies a savings in power consumption. Since the power consumption can be reduced by simply increasing the subsample rate, the subsample algorithms have a high degree of scalability and are very suitable for power-aware ME architectures. However, applying subsample algorithms for power-aware architectures may suffer from aliasing problem in the high frequency band. The aliasing problem degrades the compression quality rapidly as the subsample rate increases. To alleviate this problem, we extend traditional subsam-ple algorithms to obtain a content-based algorithm, called the content-based subsamsubsam-ple algorithm (CSA). In this algorithm, we first use edge extraction techniques to separate the high-frequency band from a macro-block and then subsample the low-frequency band only. By combining the edge pixels and subsample pixels, the algorithm generates a turn-on mask for the architecture to limit the switch activities of processing elements (PEs) in a semi-systolic array. By doing so, we can achieve significant power consump-tion savings and limit the quality degradaconsump-tion as the subsample rate increases. Because the number of high-frequency pixels varies with different video clips, we use an adaptive control mechanism to set a threshold value for edge determination and make the number of masked pixels stationary for a given power mode.
The CSA can be used in most existing ME architectures by turning off PEs accord-ing to the subsample rate. In this paper, we present a semi-systolic architecture with gated PEs. The proposed architecture shows that the CSA algorithm can dynamically alter the subsample rate as the power consumption mode changes. As shown by experi-mental results, the proposed architecture can work in different power consumption modes with acceptable and smooth quality degradation while keeping the power overhead of edge extraction under 0.8%.
The rest of the paper is organized as follows. In section 2, we introduce the back-ground of the power-aware paradigm. Section 3 presents subsample algorithms in detail.
Section 4 describes the proposed power-aware architecture and gives experimental re-sults. Finally, in section 5, we draw conclusions of this work.
2. BACKGROUND 2.1 Battery Properties
One may simply consider a battery as a capacitor in which the charge capacity is
linearly proportional to the output voltage. However, in practice, the behavior of a battery is less than ideal due to the variation in voltage and capacity. Two other important prop-erties of batteries are the rate capacity effect and recovery effect [23]. The first effect means that the capacity of a battery is dependent on the discharging rate, and the second one means that a battery with an intermittent load may have a larger capacity than one with a continuous load. Fig. 2 (a) illustrates the rate capacity effect by plotting the cell voltage of two different discharging loads as time advances. As shown by the curves, when the load is halved the battery life can be more than two times longer. Fig. 2 (b) shows the recovery effect, in where the reduction of the load causes a raise of the voltage.
Therefore, one can extend the battery lifetime by gradually stepping down the power dissipation. The Intel SpeedStep technology, for instance, which is widely used in mobile CPUs, adopts the same strategy to extend the battery lifetime [24]. This technol-ogy changes the power consumption mode by scaling down the supplied voltage and operating frequency, hence degrading the performance in order to increase the battery lifetime.
(a) The rate capacity effect [25]. (b) The recovery effect.
Fig. 2. Non-ideal battery properties.
From these two properties of batteries, we can learn two things. First, we can reduce the load to achieve a longer battery lifetime because halving the current can more than double the battery lifetime. Second, optimal performance can be achieved when the bat-tery is fully charged because the batbat-tery capacity can be recovered later by reducing the load. These properties provide strong motivation for developing power-aware designs and reason out the requirement of power-aware architecture − high degree of scalability in energy-quality tradeoffs.
2.2 Power Model
One can consider the major power consumption of a CMOS gate i as in Eq. (1), where Ci is the output capacitance, fi is the operation frequency, ri(0 ↔ 1) is the switch activity of gate i, α and κ are constants:
2 (0 1).
gatei i i DD i i
P = ⋅ ⋅ ⋅α C f V = ⋅ ⋅κ C r ↔ (1) For an execution unit EUj in a VLSI system, the power consumption can be com-puted using Eq. (2), where Ngate,j is the gate count of EUj:
,
1
(0 1).
gate j j
N
j j
EU i i
i
P κ C r
=
=
∑
⋅ ⋅ ↔ (2)After considering the activity of execution units, the total power consumption can be expressed as in Eq. (3) and approximated as in Eq. (5) by assuming that the switch activities are uniform within an execution unit; that is, rik(0↔ =1) rk(0↔1), (0∀rik
↔ 1). Since the average output capacitances of each execution unit (Cavgk ) are nearly the same as the average output capacitances of the total system (Cavg), the total power con-sumption can be approximated to Eq. (8). Therefore, we can obtain an approximate power estimation model as shown in Eq. (9), where εgp is defined as the gate power coef-ficient. In this paper, we use the gate power coefficient as the unit for estimating power dissipation:
inactive j j active k k
total EU EU
3.1 Generic Subsample Algorithm
Many published papers have presented efficient algorithms for VLSI implementa-tion of moimplementa-tion estimaimplementa-tion [1, 3, 5, 6, 15, 19]. The FSBM algorithm with the SAD crite-rion is the most popular approach to motion estimation because of its good quality and regular data path. The algorithm uses Eqs. (10) and (11) to compare each current macro-block (CMB) with all the reference macro-blocks (RMB) in the search area to determine the best match and the motion vector is found in Eq. (11):
1 1
The motion vector is found using Eq. (11): current macro-block (CMB). S(i + u, j + v) is the luminance value at (i, j) of the reference macro-block (RMB), which offsets (u, v) from the CMB in the search area 2p-by-2p.
Much research has addressed subsample techniques for motion estimation in order to reduce the computation load of FSBM [21, 22]. Liu and Zaccarin, pioneers in devel-oping subsample algorithms, applied 4-to-1 subsampling to FSBM and significantly re-duced the computational load. As shown by simulation results, the 4-to-1 subsample al-gorithm reduces the computational load significantly while keeping the quality similar to that with exhaustive search [21]. Here, we will present a generic subsample algorithm in which the subsample rate ranges from 4-to-1 to 1-to-1. The generic subsample algorithm uses Eq. (12) as a matching criterion, called the subsample sum of absolute difference (SSAD), where SM8:m is the subsample mask for the subsample rate 8-to-m as shown in Eq. (13): The subsample mask SM8:m is generated from a basic mask as shown in Eq. (14):
8:
For example, consider the subsample rate 8-to-6. The subsample mask SM8:6 can be expressed in Eq. (15) and is illustrated in Fig. 3:
8:6
Fig. 3. The subsample mask of the subsample rate 8-to-6.
Given a subsample mask, the computational cost of the SSAD calculation can be lower than that of the SAD calculation. Since a reduction of computational cost implies reduced power consumption, the generic subsample algorithm allows the system power to scale with the changing subsample rate. The higher the subsample rate, the greater the number of inactive execution units (EUs). Accordingly, the power consumption of the system is proportional to the inverse of the subsample rate. Due to its flexibility in achieving an energy-quality tradeoff, the generic subsample algorithm is suitable for implementing power-aware architectures. However, the algorithm suffers from the aliasing problem in the high frequency band. The aliasing problem will degrade the MV quality and result in considerable quality degradation when the high-frequency band is messed up.
3.2 Content-Based Subsample Algorithm
As mentioned above, the generic subsample algorithm suffers from the aliasing problem due to the high subsample rate, leading to considerable quality degradation be-cause the high frequency band is messed up. To alleviate this problem, we propose using the content-based subsample algorithm (CSA), which only subsamples the low-fre- quency band. The CSA procedure is shown in Fig. 4. We first use edge extraction to separate high-frequency pixels (or edge pixels) from a macro-block and then subsample the remaining pixels (or low-frequency pixels). The determination of edge pixels starts with gradient filtering. Three popular gradient filters [26] were also used here to execute the content-based algorithm; they are the high-pass gradient filter, the Sobel gradient filter, and the morphological gradient filter. Eqs. (16) to (18) show the calculations of the three gradient filters:
High-Pass Gradient Filter:
Ghpf(i, j) = |MF(HPFmask, R)(i, j)|, (16)
where
1 1 1
1 8 1 .
1 1 1
HPFmask
− − −
= − −
− − −
// frame: t
Input current and reference frames, W × H;
for (y = 0; y < W/N; y++) { for (x = 0; x < H/N; x++) {
Perform gradient filtering;
Calculate the edge threshold:
threshold = m1t
(x, y) ⋅ max{G(i, j)} + (1 − m1t
(x, y)) ⋅ min{G(i, j)}
Determine edge pixels and edge mask;
Generate content-based subsample mask (GSM);
edge_cnt = total edges of CSM;
// update threshold parameter for the next frame m1t+1(x, y) = m1t(x, y) + Kp⋅ (csm_cnt − trg_cnt);
Fig. 4. The content-based subsample algorithm.
Sobel Gradient Filter:
In Eqs. (16) and (18), the MF(⋅) function is the mask filter operation as shown in Eq. (19):
1 1
1 1
( , )( , ) ( 1, 1) ( , ),
p q
MF M R i j M p q R i p j q
=− =−
=
∑ ∑
+ + ⋅ + + (19)where M is a 3-by-3 mask and R(i, j) is the luminance value at (i, j).
After obtaining the gradients, G, instead of using a constant threshold, we use a floating threshold to determine the edge pixels of the CMB. The floating threshold makes edge extraction more robust when video content varies. Eq. (21) shows the calculation of the floating threshold:
threshold = m1t(x, y) ⋅ max{G(i, j)} + (1 − m1t(x, y)) ⋅ min{G(i, j)}, for 0 ≤ m1t≤ 1, (20) where m1t
(x, y) is the threshold parameter of macro-block (x, y) in the t-th frame.
Following the threshold setting step, the algorithm uses the threshold value to pick the edge pixels and produce the edge mask as shown in Eq. (21):
1, for ( , )
( , ) .
0, otherwise
G i j threshold
EdgeMask i j ≥
= (21)
Finally, the contend-based subsample mask (CSM) is generated by merging the edge mask and the subsample mask, as shown in Eq. (22). In Eq. (22), the operator ∨ means logic a OR operation. According to the calculation of the CSM, the subsample rate in the CSA (CSR), denoted as Rs, is N2-to-csm_cnt, where csm_cnt is the number of 1’s in CSM and N2 is the macro-block size. Fig. 5 shows an example of a CSM where the subsample rate is 64-to-27:
CSM(i, j) = SM8:m(i, j) ∨ EdgeMask(i, j), 0 ≤ i, j ≤ N − 1. (22)
High Frequency Band Low Frequency Band (edge-pixels) (background-pixels)
Content-Based Subsample Mask (CSM)
Fig. 5. The components of a content-based subsample mask (CSM).
Once the CSM is generated, the algorithm can then determine the motion vection (MV) with the content subsample sum of the absolute difference (CSSAD) criterion. The CSSAD criterion is similar to SSAD mentioned in section 3.1 and shown in Eq. (23):
1 1
8: 8:
0 0
( , ) | ( , ) [ ( , ) ( , )] |,
N N
m m
i j
CSSAD u v CSM i j S i u j v R i j
− −
= =
=
∑ ∑
⋅ + + −for − p ≤ u, v ≤ p − 1. (23) The results of simulation show that the CSA can significantly reduce the computa-tion complexity with little quality degradacomputa-tion. However, there will exist a non-stacomputa-tionary problem with CSA when a power-aware architecture is implemented if the designer uses constant threshold parameters m1t
and statically sets the floating threshold for a given power mode. Since different video clips with the same threshold parameters will have different subsample rates, setting the threshold value without considering the content variation of the video clip will make the subsample rate non-stationary; that is, power consumption will not converge within a narrow range for a given power mode. The di-vergence of power consumption can result in a poor power-awareness. To solve this non-stationary problem, we use an adaptive control mechanism to adaptively adjust the threshold parameters so that the subsample rate can be stationary. The adaptive control mechanism used here is a run-time process that adjusts the threshold parameters fittingly according to the difference between the current subsample rate and the desired subsample rate (or target subsample rate).
Fig. 6. A block diagram of the edge-extraction unit with an adaptive control mechanism.
Fig. 6 shows a block diagram of the adaptive control mechanism. Given the battery status, the host processor sets the power mode and the target subsample rate as well. The target subsample rate is N2-to-trg_cnt, where trg_cnt is the target number of 1’s in the CSM. Then, the controller recursively updates the threshold parameter, m1t+1
(x, y), based on the current m1t
(x, y) and the difference of csm_cnt and trg_cnt, as shown in Eq. (24):
m1t+1
(x, y) = m1t
(x, y) + Kp⋅ (csm_cnt − trg_cnt);
if (m1t+1
(x, y) < 0) {m1t+1
(x, y) = 0}; (24) if (m1t+1
(x, y) > 1) {m1t+1
(x, y) = 1};
where m1t+1
(x, y) is the threshold parameter of macro-block (x, y) in the (t + 1)-th frame and Kp is the control parameter. The control parameter Kp will affect the settling time and steady-state error of the subsample rate.
3.3 Simulation Results
Figs. 7 and 8 show the simulation results for four 352-by-288 MPEG clips with the parameters N = 16 and p = 32. The control parameter Kp was set as 0.3. The target sub-sample rates were set to (4:1), (8:3), (2:1), (8:5), (4:3), (8:7), and (1:1); that is, the target subsample pixel counts were 64, 96, 128, 160, 192, 224, and 256, respectively. Note that the target subsample pixel counts were proportional to the power consumption. Thus, the figures can also be read as charts of power versus PSNR. The dashed lines indicate the results obtained using the generic subsample algorithm, and the solid lines indicate the
Figs. 7 and 8 show the simulation results for four 352-by-288 MPEG clips with the parameters N = 16 and p = 32. The control parameter Kp was set as 0.3. The target sub-sample rates were set to (4:1), (8:3), (2:1), (8:5), (4:3), (8:7), and (1:1); that is, the target subsample pixel counts were 64, 96, 128, 160, 192, 224, and 256, respectively. Note that the target subsample pixel counts were proportional to the power consumption. Thus, the figures can also be read as charts of power versus PSNR. The dashed lines indicate the results obtained using the generic subsample algorithm, and the solid lines indicate the