先進電子設計自動化技術研發-子計畫六：用於奈米晶片系統設計之功率意識高階合成研究(III)

(1)

子計畫六：用於奈米晶片系統設計之功率意識高階合成研究

(3/3)

計畫類別：整合型計畫

計畫編號： NSC94-2220-E-009-023-

執行期間： 94 年 08 月 01 日至 95 年 07 月 31 日

執行單位：國立交通大學電機與控制工程學系(所)

計畫主持人：董蘭榮

計畫參與人員：楊學之、江宗錫、宋岳璋、林耕興、吳智偉、林盟淳、

賴信丞、呂文豪、林毅慧、黃仕捷

報告類型：完整報告

報告附件：出席國際會議研究心得報告及發表論文

處理方式：本計畫可公開查詢

(2)

行政院國家科學委員會補助專題研究計畫

■ 成果報告

□期中進度報告

先進電子設計自動化技術研發

子計畫六：用於奈米晶片系統設計之功率意識高階合成研究

(3/3)

計畫類別：□ 個別型計畫 ■ 整合型計畫

計畫編號：NSC 94－2220－E－009－023－

執行期間： 94 年 8 月 1 日至 95 年 7 月 31 日

計畫主持人：董蘭榮

共同主持人：

計畫參與人員：楊學之、江宗錫、宋岳璋、林耕興、吳智偉、林盟淳、

賴信丞、呂文豪、林毅慧、黃仕捷

成果報告類型(依經費核定清單規定繳交)：□精簡報告 ■完整報告

本成果報告包括以下應繳交之附件：

□赴國外出差或研習心得報告一份

□赴大陸地區出差或研習心得報告一份

■出席國際學術會議心得報告及發表之論文各一份

□國際合作研究計畫國外研究報告書一份

處理方式：除產學合作研究計畫、提升產業技術及人才培育研究計畫、

列管計畫及下列情形者外，得立即公開查詢

■涉及專利或其他智慧財產權，□一年□二年後可公開查詢

執行單位：國立交通大學電機與控制工程學系

(3)

（一）計畫中文摘要

在

CMOS 電路尺寸逼近極限時, 電路的功率消耗與效能品質的平衡考量就顯得

非常重要。特別是隨著奈米技術的大幅進展, 能意識到耗能條件的設計方法成為

很重要的關鍵。所謂功率意識設計除了考慮平均的功率消耗外同時也包含了瞬間

的功率消耗狀態, 例如: 峰值功率, 功率梯度等。目前考慮瞬間功率消耗的設計

方法僅限於電晶體或是邏輯閘階層的設計。然而, 如果在系統設計階段就功率意

識找出解答, 積體電路的設計將可明顯地提升低階技術的功率意識最佳化程

度。本計畫的目的就在於探索可用於功率意識系統的高階合成方法來管理並減少

暫態功率。

關鍵詞: 功率意識; 高階合成; 系統晶片; 電腦輔助設計

（二）計畫英文摘要

As we get closer to the limits of scaling in CMOS circuits, it is imperative to consider

power/performance trade-offs and to develop appropriate power aware methodologies

and techniques for embedded systems. The use of nanometer technologies is making

it increasingly important to consider transient characteristics of a circuit's power

dissipation (e.g., peak power, and power gradient or differential) in addition to its

average power consumption. State-of-the-art transient power analysis and reduction

approaches are mainly at the transistor- and gate-levels. However, we believe

architectural solutions to transient power problems may complement and significantly

extend the scope of lower-level techniques, as was the case with average power

minimization. This project intends to exploit high-level synthesis approach to

transient power management and reduction in that a power-aware high-level synthesis

can impact the cycle-by-cycle peak power and peak power differential for the

synthesized implementation.

(4)

中英文摘要 ………. I

目錄 ………. II

報告內容 ………. 1

一、

前言 ……… 1

二、

研究目的 ……… 2

三、

文獻探討 ……… 2

四、

研究方法 ……… 3

五、

結果與討論 ……… 4

A Power-Aware Motion Estimation Architecture Using Content-based

Subsampling

……….……….

6 A Content-based Methodology for Power-Aware Motion Estimation

Architecture

………...

26 A Parallel-In Folding Technique for High-Order FIR Filter

Implementation

……….

31 System level verification on high-level synthesis of dataflow algorithms

using Petri net

……….…….

45 A NAND Flash Memory Controller for SD/MMC Flash Memory

Card

………..….

52 System-Level Verification on High-Level Synthesis of Dataflow

Graph

……….….

55 參考文獻 ……… 59

(5)

一、前言

With increasing demand of portable, power-aware multimedia devices, an architecture

that can be flexible in both power consumption and performance is highly required.

As we get closer to the limits of scaling in CMOS circuits, it is imperative to consider

power/performance trade-offs and to develop appropriate power aware methodologies

and techniques for embedded systems. The use of nanometer technologies is making

it increasingly important to consider transient characteristics of a circuit's power

dissipation (e.g., peak power, and power gradient or differential) in addition to its

average power consumption. State-of-the-art transient power analysis and reduction

approaches are mainly at the transistor- and gate-levels. However, we believe

architectural solutions to transient power problems may complement and significantly

extend the scope of lower-level techniques, as was the case with average power

minimization.

When circuit technology scales through shrinking the transistor feature size by a

factor of x, the capacitance is reduced by x and the supply voltage by x

2

. Therefore,

power decreases by a factor of x

3

, provided the frequency remains the same.

Unfortunately, with each generational scaling of the feature size, more complex,

aggressive designs are used. These designs employ higher clock frequency, larger

chip area and higher total number of transistors due to the use of more aggressive

speculative execution. The result is a significant increase in power dissipation. On

the other hand, aggressive, complex designs increase the opportunities available fore

power management: there are more individual units which can be placed on standby

when not needed by the application.

Another worrying trend is the increase in power density. Considering the Intel

family of microprocessors, for instance, the power density is expressed in terms of

watts/cm

2

: the current generation is getting close to the power density of a nuclear

reactor. This results in more expensive cooling mechanisms and reduced reliability.

The increase in total power dissipation as well as power density means that traditional

power management policies centered only at the device and VLSI levels are no longer

sufficient. As a result, power has propagated as an important design constraint to the

(6)

二、研究目的

As mentioned above, with increasing demand of portable, power-aware multimedia

devices, an architecture that can be flexible in both power consumption and

performance is highly required. This project will first investigate and characterize

power consumption of battery components and then come up with high level synthesis

approaches to balance the power dissipation and performance and thus save the power

consumption while maintaining required system performance. Given the transient

power constraints, the proposed project has four goals: to have the longest battery

lifetime while achieving the performance goals, to deliver task schedule and resource

allocation automatically, and to synthesize the SOC architectures at system level.

三、文獻探討

Currently, power-aware systems research at the architectural level for power saving is

concentrated on the following issues: instruction set architecture (ISA) selection,

instruction caches (I-cache) and the system bus, voltage and frequency scaling,

battery-consciousness, and task movement.

1. Instruction Set Architecture (ISA) Level: This is an active research area in the

context of general-purpose architectures; various researchers have commented on

the need to take power and energy into account in ISA design. However, not

much effort has been devoted to power-aware ISA design. Paper [1] employs a

fine-grained off-line scheduling approach which saves power by combining

multiple instructions into on complex but lower power instruction or by using

low-power versions of instructions while considering task deadlines. The

proposed scheme assumes that the ISA is sufficiently flexible; however, in

practice there is not much scope for the existence of complex instructions which

are functionally equivalent to a group of simpler instructions in the ISA design.

2. I-cache and Buses: The control path, which governs the fetch, issue and retiring

of instructions, is quite simple in typical embedded processors and occupies a

relatively small portion of the chip area. The caches take up most of the chip

area [2] and are responsible for a considerable percentage of the energy

dissipation even though memory is more energy efficient than control logic.

Paper [3] compresses the instructions in memory. This saves instruction fetch

energy by using fewer bits on a fetch. An alternative strategy by paper [4] also

concentrates on saving instruction energy. The authors employ a loop cache and

keep the tight loop in a small loop cache instead of accessing a larger block.

(7)

over-designed, provisioning resources for the worst-case execution time. Since

tasks rarely execute up to their worst case, there is significant scope for power and

energy savings using dynamic voltage and frequency scaling. Papers [5]-[8] are

in this category.

4. Battery Consciousness: The most important issues to be considered for

battery-driven systems are the total battery capacity and the battery discharge

profile. The latter is important in devising battery-aware schemes that are guided

by the discharge profile. Paper [9] considers distributed real-time systems and

develop battery model, which is used in two scheduling schemes: first they

optimize the battery discharge power profile, and then they use voltage scaling for

distributed real-time systems. The overall objective is to extend the battery

lifespan while meeting task deadlines and precedence requirements. The authors

claim that mitigating battery capacity loss requires reducing the discharge current

level and shaping its distribution.

5. Task Movement: Task movement is important in real-time systems for

fault-tolerance or load balancing purposes. However, power efficient task

movement heuristics have not been extensively investigated. One exception is

the work of paper [10]. The paper is based on the observation that a set of

processors can operate at a lower power level than a single one with the same

performance if there is enough parallelism.

四、研究方法

We consider the project as three parts: transient power management thru

high-level synthesis, system-level power-aware design automation, and high-level

synthesis for adaptive power-quality tradeoff in energy-aware multimedia

embedded systems. The yearly schedule is shown as follows:

1

st

Year:

1. Study on power characteristics of battery-based system.

2. Develop static scheduling algorithm under transient power constraints.

3. Demonstrate the proposed scheduling technique using state-of-the-art commercial

design flow.

(8)

frequency scaling.

3

rd

Year:

1. Study on battery-conscious multimedia systems.

2. Develop the adaptive power-quality tradeoff algorithm.

3. Develop the high-level synthesis for adaptive power management.

By the end of this project, we would expect as follows:

1. Publications: There will be at least two papers published in major

international conferences each year. We will publish at least two academic

journal papers in support of this three-year project. Also, there will be at

least two Ph.D. dissertations and six master theses funded by the project.

2. CAD environment: We are going to build up a high-level synthesis tool driven

by techniques from this project and embedded the tool into state-of-the-art

commercial design flow.

3. Training: There will be two Ph.D. students and six master students earned

their degrees within the execution period of this project.

五、結果與討論

In this year, the project has resulted in five journal papers and one conference paper:

1. Hsien-Wen Cheng and Lan-Rong Dung, “A Power-Aware Motion Estimation

Architecture Using Content-based Subsampling,” Journal of Information Science

and Engineering, vol. 22, no. 4, pp. 799-818, 2006.

2. Hsien-Wen Cheng and Lan-Rong Dung, “A Content-based Methodology for

Power-Aware Motion Estimation Architecture,” IEEE transactions on Circuits and

Systems II, vol.52, No.10, pp.631-635, 2006.

3. Lan-Rong Dung and Hsueh-Chih Yang, “A Parallel-In Folding Technique for

High-Order FIR Filter Implementation,” accepted by IEICE transactions on

Fundamentals.

4. Tsung-Hsi Chiang, and Lan-Rong Dung, “

System level verification on high-level

synthesis of dataflow algorithms using Petri net

,” accepted by WSEAS transactions on

Circuits and Systems.

5. Chuan-Sheng Lin and Lan-Rong Dung, “A NAND Flash Memory Controller for

SD/MMC Flash Memory Card,” to be appeared in IEEE transactions on Magnetics

6. Tsung-Hsi Chiang, and Lan-Rong Dung, “

System-Level Verification on

High-Level Synthesis of Dataflow Graph,” ISCAS 2006.

(9)

(10)

799

A Power-Aware Motion Estimation Architecture

Using Content-based Subsampling

*

HSIEN-WEN CHENG AND LAN-RONG DUNG Department of Electrical and Control Engineering

National Chiao Tung University Hsinchu, 300 Taiwan E-mail: [email protected]

This paper presents a novel power-aware motion estimation architecture for bat-tery-powered multimedia devices. As the battery status changes, the proposed architec-ture adaptively performs graceful tradeoffs between power consumption and compres-sion quality. The tradeoffs are considered to be graceful in that the proposed architecture is scalable with changing conditions and the compression quality is slightly degraded as the available energy is depleted. The key to such tradeoffs lies in a content-based sub-sample algorithm, first proposed in this paper. As the available energy decreases, the al-gorithm raises the subsample rate for maximizing the battery lifetime. Differently from the existing subsample algorithms, the content-based algorithm first extracts edge pixels from a macro-block and then subsamples the remaining low-frequency part. By doing so, we can alleviate the aliasing problem and, thus, limit the quality degradation as the sub-sample rate increases. Given a power consumption mode, the proposed architecture first performs edge extraction to generate a turn-off mask and then uses the turn-off mask to reduce the switch activities of processing elements (PEs) in a semi-systolic array. The reduction of switch activities results in significant power consumption savings. To achieve a high degree of scalability and qualified power-awareness, we use an adaptive control mechanism to set the threshold value for edge determination and make the reduc-tion of switch activities rather stareduc-tionary. As shown by experimental results, the archi-tecture can dynamically operate in different power consumption modes with little quality degradation according to the remaining capacity of the battery pack while the power overhead of edge extraction is kept under 0.8%

Keywords: motion estimation, image processing, VLSI architecture, video compression,

power-aware system

1. INTRODUCTION

Motion estimation (ME) has been notably recognized as the most critical part of many video compression applications, such as MPEG standards and H.26x [1], since it tends to dominate the computational and hence power requirements. With increasing demand for battery-powered multimedia devices, an ME architecture that can be flexible in both power consumption and compression quality is highly required. This requirement is driven by the user-centric perspective [2]. Basically, users have two views on using portable devices. Sometimes, users want extremely high video quality at the cost of re-duced battery lifetime. At other times, users want acceptable quality with extended bat-tery lifetime. This paper, therefore, presents a novel power-aware ME architecture that Received February 9, 2004; revised July 6, 2004; accepted July 27, 2004.

Communicated by Pau-Choo Chung.

*_{This work was supported in part by the National Science Council of Taiwan, R.O.C., under grant No. NSC}

(11)

uses a content-based subsample algorithm, which can adaptively perform tradeoffs be-tween power consumption and compression quality as the battery status changes. The proposed architecture is driven by a content-based subsample algorithm that allows the architecture to work in different power consumption modes with acceptable quality deg-radation. Since the control mechanism and data sequences in different power consump-tion modes are the same in the architecture, the power-aware algorithm can switch power consumption modes very smoothly on the fly. The block diagram shown in Fig. 1 illus-trates a typical application of the proposed power-aware ME architecture. The host proc-essor monitors the remaining capacity of the battery pack and switches power consump-tion modes. According to the power mode, the power-aware architecture sets the sub-sample rate and calculates the motion vector (MV) for motion compensation. Note that most portable multimedia devices, in practice, have a battery monitor unit and power management subroutines. The host processor and battery monitor unit should not be con-sidered as the overhead of using the power-aware architecture.

Fig. 1. The system block diagram of a portable, battery-powered multimedia device.

Many published papers have presented efficient algorithms for VLSI implementa-tion of moimplementa-tion estimaimplementa-tion, based on either high performance or low power design. How-ever, most of them cannot dynamically adapt the compression quality to different power consumption modes. Among these proposed algorithms, the Full-Search Block-Matching (FSBM) algorithm with the Sum of Absolute Difference (SAD) criterion is the most popular approach to motion estimation because of its good quality. It is particularly at-tractive when extremely high quality is required. Many types of architectures have been proposed for the implementation of FSBM algorithms [3-6]. However, they require a huge number of comparison/difference operations and result in a large computation load and high power consumption. To reduce the computational complexity of FSBM, re-searchers have proposed various fast algorithms. They either reduce the number of search steps [7-12] or simplify the calculation of the error criterion [13-16]. By combining step-reduction and criterion-simplification, some proposed two-phase algorithms balance

(12)

the performance between complexity and quality [17-19]. They first use FSBM with a simplified matching criterion to generate candidate vectors and then select the best mo-tion vector from among these candidates using the SAD criterion. These fast-search al-gorithms successfully improved the block matching speed while limiting the quality degradation, thus achieving low power implementation. However, a low power imple-mentation is not necessarily a power-aware system in that a power-aware system should adaptively modify its behavior according to the change of the power/energy status and achieve a balance between quality and battery life [20]. The requirement of ME algo-rithms to be suitable for power-aware designs is high degree of scalability in perform-ance tradeoffs. Unfortunately, the fast algorithms mentioned above do not meet this re-quirement.

The authors in [21, 22] presented subsample algorithms that significantly reduce the computation cost with low quality degradation. The reduction of the computation cost implies a savings in power consumption. Since the power consumption can be reduced by simply increasing the subsample rate, the subsample algorithms have a high degree of scalability and are very suitable for power-aware ME architectures. However, applying subsample algorithms for power-aware architectures may suffer from aliasing problem in the high frequency band. The aliasing problem degrades the compression quality rapidly as the subsample rate increases. To alleviate this problem, we extend traditional subsam-ple algorithms to obtain a content-based algorithm, called the content-based subsamsubsam-ple algorithm (CSA). In this algorithm, we first use edge extraction techniques to separate the high-frequency band from a macro-block and then subsample the low-frequency band only. By combining the edge pixels and subsample pixels, the algorithm generates a turn-on mask for the architecture to limit the switch activities of processing elements (PEs) in a semi-systolic array. By doing so, we can achieve significant power consump-tion savings and limit the quality degradaconsump-tion as the subsample rate increases. Because the number of high-frequency pixels varies with different video clips, we use an adaptive control mechanism to set a threshold value for edge determination and make the number of masked pixels stationary for a given power mode.

The CSA can be used in most existing ME architectures by turning off PEs accord-ing to the subsample rate. In this paper, we present a semi-systolic architecture with gated PEs. The proposed architecture shows that the CSA algorithm can dynamically alter the subsample rate as the power consumption mode changes. As shown by experi-mental results, the proposed architecture can work in different power consumption modes with acceptable and smooth quality degradation while keeping the power overhead of edge extraction under 0.8%.

The rest of the paper is organized as follows. In section 2, we introduce the back-ground of the power-aware paradigm. Section 3 presents subsample algorithms in detail. Section 4 describes the proposed power-aware architecture and gives experimental re-sults. Finally, in section 5, we draw conclusions of this work.

2. BACKGROUND

2.1 Battery Properties

(13)

linearly proportional to the output voltage. However, in practice, the behavior of a battery is less than ideal due to the variation in voltage and capacity. Two other important prop-erties of batteries are the rate capacity effect and recovery effect [23]. The first effect means that the capacity of a battery is dependent on the discharging rate, and the second one means that a battery with an intermittent load may have a larger capacity than one with a continuous load. Fig. 2 (a) illustrates the rate capacity effect by plotting the cell voltage of two different discharging loads as time advances. As shown by the curves, when the load is halved the battery life can be more than two times longer. Fig. 2 (b) shows the recovery effect, in where the reduction of the load causes a raise of the voltage. Therefore, one can extend the battery lifetime by gradually stepping down the power dissipation. The Intel SpeedStep technology, for instance, which is widely used in mobile CPUs, adopts the same strategy to extend the battery lifetime [24]. This technol-ogy changes the power consumption mode by scaling down the supplied voltage and operating frequency, hence degrading the performance in order to increase the battery lifetime.

(a) The rate capacity effect [25]. (b) The recovery effect. Fig. 2. Non-ideal battery properties.

From these two properties of batteries, we can learn two things. First, we can reduce the load to achieve a longer battery lifetime because halving the current can more than double the battery lifetime. Second, optimal performance can be achieved when the bat-tery is fully charged because the batbat-tery capacity can be recovered later by reducing the load. These properties provide strong motivation for developing power-aware designs and reason out the requirement of power-aware architecture − high degree of scalability in energy-quality tradeoffs.

2.2 Power Model

One can consider the major power consumption of a CMOS gate i as in Eq. (1), where Ci is the output capacitance, fi is the operation frequency, ri(0 ↔ 1) is the switch

activity of gate i, α and κ are constants:

2

(0 1).

i

gate i i DD i i

P = ⋅ ⋅ ⋅α C f V = ⋅ ⋅κ C r ↔ (1)

For an execution unit EUj in a VLSI system, the power consumption can be

com-puted using Eq. (2), where Ngate,j is the gate count of EUj: , 1 (0 1). gate j j N j j EU i i i P κ C r = =

∑

⋅ ⋅ ↔ (2)

(14)

After considering the activity of execution units, the total power consumption can be expressed as in Eq. (3) and approximated as in Eq. (5) by assuming that the switch activities are uniform within an execution unit; that is, rik(0↔ =1) rk(0↔1), (0∀rik

↔ 1). Since the average output capacitances of each execution unit (Cavgk ) are nearly the same as the average output capacitances of the total system (Cavg), the total power con-sumption can be approximated to Eq. (8). Therefore, we can obtain an approximate power estimation model as shown in Eq. (9), where εgp is defined as the gate power coef-ficient. In this paper, we use the gate power coefficient as the unit for estimating power dissipation: inactive _j j active _k k total EU EU EU EU P P P ∀ ∀ =

∑

+

∑

(3) , , inactive 1 active 1 0 (0 1) gate j gate k j k N N j k k i i i EU i EU i C C r κ κ ∀ = ∀ = =

∑

⋅ +

∑

⋅ ↔ (4) , active 1 (0 1) gate k k N k k i EU i r C κ ∀ = ≅

∑

↔

∑

(5) , 1 , , active (0 1) Ngate k k i i gate k k C k gate k N EU r N κ = ∀ ∑ =

∑

↔ × × (6) , active (0 1) k k k avg gate k EU r C N κ ∀ =

∑

↔ × × (7) , active ( ) (0 1) k k avg gate k EU C r N κ ∀ ≅ ⋅

∑

↔ × (8) , active (0 1) . k k gp gate k EU r N ε ∀ =

∑

↔ × (9)

3. SUBSAMPLE ALGORITHMS

3.1 Generic Subsample Algorithm

Many published papers have presented efficient algorithms for VLSI implementa-tion of moimplementa-tion estimaimplementa-tion [1, 3, 5, 6, 15, 19]. The FSBM algorithm with the SAD crite-rion is the most popular approach to motion estimation because of its good quality and regular data path. The algorithm uses Eqs. (10) and (11) to compare each current macro-block (CMB) with all the reference macro-blocks (RMB) in the search area to determine the best match and the motion vector is found in Eq. (11):

1 1 0 0 ( , ) | ( , ) ( , ) |, N N i j SAD u v S i u j v R i j − − = = =

∑ ∑

+ + − − p ≤ u, v ≤ p − 1. (10)

(15)

The motion vector is found using Eq. (11): , 1 min ( , ) ( , ) | , p u v p SAD u v MV u v − ≤ ≤ − = (11) where the macro-block size is N-by-N and R(i, j) is the luminance value at (i, j) of the current macro-block (CMB). S(i + u, j + v) is the luminance value at (i, j) of the reference macro-block (RMB), which offsets (u, v) from the CMB in the search area 2p-by-2p.

Much research has addressed subsample techniques for motion estimation in order to reduce the computation load of FSBM [21, 22]. Liu and Zaccarin, pioneers in devel-oping subsample algorithms, applied 4-to-1 subsampling to FSBM and significantly re-duced the computational load. As shown by simulation results, the 4-to-1 subsample al-gorithm reduces the computational load significantly while keeping the quality similar to that with exhaustive search [21]. Here, we will present a generic subsample algorithm in which the subsample rate ranges from 4-to-1 to 1-to-1. The generic subsample algorithm uses Eq. (12) as a matching criterion, called the subsample sum of absolute difference (SSAD), where SM8:m is the subsample mask for the subsample rate 8-to-m as shown in

Eq. (13): 1 1 8: 8: 0 0 ( , ) | ( , ) [ ( , ) ( , )] |, N N m m i j SSAD u v SM i j S i u j v R i j − − = = =

∑ ∑

⋅ + + − for − p ≤ u, v ≤ p − 1, (12)

SM8:m(i, j) = BM8:m(i mod 4, j mod 4). (13)

The subsample mask SM8:m is generated from a basic mask as shown in Eq. (14):

8: ( 2) ( 5) ( 2) ( 6) ( 3) ( 7) ( 4) ( 8) , ( 2) ( 5) ( 2) ( 6) ( 3) ( 7) ( 4) ( 8) m u m u m u m u m u m u m u m u m BM u m u m u m u m u m u m u m u m − − − −    ₋ ₋ ₋ ₋  =  ₋ ₋ ₋ ₋   ₋ ₋ ₋ ₋      (14)

where u(n), is a step function; that is,

1, for 0 ( ) . 0, for 0 n u n n ≥  =  _< 

For example, consider the subsample rate 8-to-6. The subsample mask SM8:6 can be

expressed in Eq. (15) and is illustrated in Fig. 3:

8:6 1 1 1 1 1 1 1 1 1 0 1 0 1 0 1 0 1 1 1 1 1 1 1 1 1 0 1 0 1 0 1 0 . 1 1 1 1 1 1 1 1 1 0 1 0 1 0 1 0 1 1 1 1 1 1 1 1 1 0 1 0 1 0 1 0 SM         =           (15)

(16)

Fig. 3. The subsample mask of the subsample rate 8-to-6.

Given a subsample mask, the computational cost of the SSAD calculation can be lower than that of the SAD calculation. Since a reduction of computational cost implies reduced power consumption, the generic subsample algorithm allows the system power to scale with the changing subsample rate. The higher the subsample rate, the greater the number of inactive execution units (EUs). Accordingly, the power consumption of the system is proportional to the inverse of the subsample rate. Due to its flexibility in achieving an energy-quality tradeoff, the generic subsample algorithm is suitable for implementing power-aware architectures. However, the algorithm suffers from the aliasing problem in the high frequency band. The aliasing problem will degrade the MV quality and result in considerable quality degradation when the high-frequency band is messed up.

3.2 Content-Based Subsample Algorithm

As mentioned above, the generic subsample algorithm suffers from the aliasing problem due to the high subsample rate, leading to considerable quality degradation be-cause the high frequency band is messed up. To alleviate this problem, we propose using the content-based subsample algorithm (CSA), which only subsamples the low-frequency band. The CSA procedure is shown in Fig. 4. We first use edge extraction to separate high-frequency pixels (or edge pixels) from a macro-block and then subsample the remaining pixels (or low-frequency pixels). The determination of edge pixels starts with gradient filtering. Three popular gradient filters [26] were also used here to execute the content-based algorithm; they are the high-pass gradient filter, the Sobel gradient filter, and the morphological gradient filter. Eqs. (16) to (18) show the calculations of the three gradient filters:

High-Pass Gradient Filter:

Ghpf(i, j) = |MF(HPFmask, R)(i, j)|, (16)

where 1 1 1 1 8 1 . 1 1 1 mask HPF − − −     = − − _{− − −}   

(17)

// frame: t

Input current and reference frames, W × H;

for (y = 0; y < W/N; y++) { for (x = 0; x < H/N; x++) {

Perform gradient filtering; Calculate the edge threshold:

threshold = m1t(x, y) ⋅ max{G(i, j)} + (1 − m1t(x, y)) ⋅ min{G(i, j)}

Determine edge pixels and edge mask;

Generate content-based subsample mask (GSM); edge_cnt = total edges of CSM;

// update threshold parameter for the next frame

m1 t+1 (x, y) = m1 t (x, y) + Kp⋅ (csm_cnt − trg_cnt); if (m1t+1(x, y) < 0) {m1t+1(x, y) = 0}; if (m1 t+1 (x, y) > 1) {m1 t+1 (x, y) = 1}; // find MV SSADmin(x, y) = ∞; for (u = − p; u < p; u++) { for (v = − p; v < p; v++) { 1 1 0 0 ( , ) | ( , ) ( ( , ) ( , )) |; N N i j SSAD u v CSM i j S i u j v R i j − − = = =

∑ ∑

⋅ + + − if SSADmin(x, y) > SSAD(u, v)

{SSADmin(x, y) = SSAD(u, v); MV(x, y) = (u, v);}

} // for loop index v } // for loop index u } // for loop index x } // for loop index y

Fig. 4. The content-based subsample algorithm.

Sobel Gradient Filter:

Gsobel(i, j) = |MF(SXmask, R)(i, j)| + |MF(SYmask, R)(i, j)|, (17)

where 1 2 1 1 0 1 0 0 0 and 2 0 2 . 1 2 1 1 0 1 mask mask SX SY − − − −         = = −   ₋     

Morphological Gradient Filter:

Gmorphological = (R ⊕ B) − (R B), (18) where 0 0 0 0 0 0 , 0 0 0 B=     

and the operations “⊕” and “” denote morphological dilation and erosion.

In Eqs. (16) and (18), the MF(⋅) function is the mask filter operation as shown in Eq. (19):

(18)

1 1 1 1 ( , )( , ) ( 1, 1) ( , ), p q MF M R i j M p q R i p j q =− =− =

∑ ∑

+ + ⋅ + + (19) where M is a 3-by-3 mask and R(i, j) is the luminance value at (i, j).

After obtaining the gradients, G, instead of using a constant threshold, we use a floating threshold to determine the edge pixels of the CMB. The floating threshold makes edge extraction more robust when video content varies. Eq. (21) shows the calculation of the floating threshold:

threshold = m1t(x, y) ⋅ max{G(i, j)} + (1 − m1t(x, y)) ⋅ min{G(i, j)}, for 0 ≤ m1t≤ 1, (20)

where m1t(x, y) is the threshold parameter of macro-block (x, y) in the t-th frame.

Following the threshold setting step, the algorithm uses the threshold value to pick the edge pixels and produce the edge mask as shown in Eq. (21):

1, for ( , ) ( , ) . 0, otherwise G i j threshold EdgeMask i j =  ≥  (21)

Finally, the contend-based subsample mask (CSM) is generated by merging the edge mask and the subsample mask, as shown in Eq. (22). In Eq. (22), the operator ∨ means logic a OR operation. According to the calculation of the CSM, the subsample rate in the CSA (CSR), denoted as Rs, is N2-to-csm_cnt, where csm_cnt is the number of 1’s in CSM

and N2 is the macro-block size. Fig. 5 shows an example of a CSM where the subsample rate is 64-to-27:

CSM(i, j) = SM8:m(i, j) ∨ EdgeMask(i, j), 0 ≤ i, j ≤ N − 1. (22)

High Frequency Band Low Frequency Band (edge-pixels) (background-pixels)

Content-Based Subsample Mask (CSM)

(19)

Once the CSM is generated, the algorithm can then determine the motion vection (MV) with the content subsample sum of the absolute difference (CSSAD) criterion. The CSSAD criterion is similar to SSAD mentioned in section 3.1 and shown in Eq. (23):

1 1 8: 8: 0 0 ( , ) | ( , ) [ ( , ) ( , )] |, N N m m i j CSSAD u v CSM i j S i u j v R i j − − = = =

∑ ∑

⋅ + + − for − p ≤ u, v ≤ p − 1. (23) The results of simulation show that the CSA can significantly reduce the computa-tion complexity with little quality degradacomputa-tion. However, there will exist a non-stacomputa-tionary problem with CSA when a power-aware architecture is implemented if the designer uses constant threshold parameters m1t and statically sets the floating threshold for a given

power mode. Since different video clips with the same threshold parameters will have different subsample rates, setting the threshold value without considering the content variation of the video clip will make the subsample rate non-stationary; that is, power consumption will not converge within a narrow range for a given power mode. The di-vergence of power consumption can result in a poor power-awareness. To solve this non-stationary problem, we use an adaptive control mechanism to adaptively adjust the threshold parameters so that the subsample rate can be stationary. The adaptive control mechanism used here is a run-time process that adjusts the threshold parameters fittingly according to the difference between the current subsample rate and the desired subsample rate (or target subsample rate).

Fig. 6. A block diagram of the edge-extraction unit with an adaptive control mechanism.

Fig. 6 shows a block diagram of the adaptive control mechanism. Given the battery status, the host processor sets the power mode and the target subsample rate as well. The target subsample rate is N2-to-trg_cnt, where trg_cnt is the target number of 1’s in the CSM. Then, the controller recursively updates the threshold parameter, m1t+1(x, y), based

on the current m1t(x, y) and the difference of csm_cnt and trg_cnt, as shown in Eq. (24):

m1t+1(x, y) = m1t(x, y) + Kp⋅ (csm_cnt − trg_cnt);

if (m1t+1(x, y) < 0) {m1t+1(x, y) = 0}; (24)

(20)

where m1t+1(x, y) is the threshold parameter of macro-block (x, y) in the (t + 1)-th frame

and Kp is the control parameter. The control parameter Kp will affect the settling time and steady-state error of the subsample rate.

3.3 Simulation Results

Figs. 7 and 8 show the simulation results for four 352-by-288 MPEG clips with the parameters N = 16 and p = 32. The control parameter Kp was set as 0.3. The target sub-sample rates were set to (4:1), (8:3), (2:1), (8:5), (4:3), (8:7), and (1:1); that is, the target subsample pixel counts were 64, 96, 128, 160, 192, 224, and 256, respectively. Note that the target subsample pixel counts were proportional to the power consumption. Thus, the figures can also be read as charts of power versus PSNR. The dashed lines indicate the results obtained using the generic subsample algorithm, and the solid lines indicate the results obtained using the content-based subsample algorithm with the three gradient filters. As shown by the results, the quality degradation due to the content-based algo-rithm was less than that due to the generic subsample algoalgo-rithm, and the type of gradient filter did not significantly affect the performance of the proposed algorithm. In addition, the adaptive control mechanism kept the subsample rate quite stationary. Tables 1 to 3 show the CSR errors with four 40-frame clips. From the results shown in tables, the

64 96 128 160 192 224 256 36.1 36.15 36.2 36.25 36.3 36.35 Weather

Sub−sample pixels of a marco−block(R_s−1⋅N2)

PSNR (dB) + : regular * : morphological context−based ♦ : hpf context−based o : Sobel context−based 64 96 128 160 192 224 256 26.68 26.7 26.72 26.74 26.76 26.78 26.8 26.82 26.84 26.86 Children

Sub−sample pixels of a marco−block(R_s−1⋅N2)

PSNR (dB)

+ : regular

* : morphological context−based

♦ : hpf context−based o : Sobel context−based

Fig. 7. The quality degradation of the weather clip. Fig. 8. The quality degradation of the children clip.

Table 1. The CSR error of the content-based subsample algorithm

(where the control parameter Kp = 0.3).

Video Clip Weather News Table-Tennis Children Target CSR Average CSR CSR Error Average CSR CSR Error Average CSR CSR Error Average CSR CSR Error 96 95.416 0.61% 95.366 0.66% 95.398 0.63% 95.490 0.53% 128 127.521 0.37% 127.720 0.22% 127.467 0.42% 127.754 0.19% 160 158.678 0.83% 159.795 0.13% 159.533 0.29% 159.430 0.36% 192 188.037 2.06% 191.313 0.36% 191.429 0.30% 189.632 1.23% 224 211.274 5.68% 221.224 1.24% 222.893 0.49% 216.001 3.57% The edge-extraction unit uses the high-pass gradient filter.

(21)

Video Clip Weather News Table-Tennis Children Target CSR Average CSR CSR Error Average CSR CSR Error Average CSR CSR Error Average CSR CSR Error 96 94.985 1.06% 94.728 1.33% 95.361 0.67% 95.114 0.92% 128 127.264 0.58% 127.688 0.24% 127.445 0.43% 127.304 0.54% 160 157.104 1.81% 159.766 0.15% 159.415 0.37% 159.702 0.381% 192 184.295 4.01% 191.183 0.43% 191.249 0.39% 188.451 1.85% 224 207.612 7.32% 220.949 1.36% 222.685 0.59% 215.536 3.78% The edge-extraction unit uses the Sobel gradient filter.

Video Clip Weather News Table-Tennis Children Target CSR Average CSR CSR Error Average CSR CSR Error Average CSR CSR Error Average CSR CSR Error 96 97.072 1.12% 95.547 0.47% 96.040 0.04% 95.959 0.04% 128 127.626 0.29% 127.718 0.22% 127.120 0.69% 127.267 0.57% 160 157.401 1.62% 159.659 0.21% 158.986 0.63% 158.885 0.70% 192 185.013 3.64% 191.477 0.27% 190.765 0.64% 189.279 1.42% 224 209.300 6.56% 222.434 0.70% 222.405 0.71% 218.118 2.63% The edge-extraction unit uses the morphological gradient filter.

average CSR error was as low as 1.12%, and the CSR error variance was as low as 0.00024. Because the subsample rate could be kept nearly stationary with given target subsample rate and power mode, we conclude that the power-awareness of the proposed algorithm is very good, and that the CSA can be applied in a power-aware architecture. The results also show that the selection of Kp was proper for controlling the threshold parameters. The following section will further address on the selection of the control pa-rameter.

3.4 Selection of the Control Parameter

As mentioned in sections 3.3 and 3.4, we use an adaptive control mechanism to ob-tain a stationary subsample rate while keeping the quality acceptable. However, if the control parameter is not properly selected, the settling time will be too long to achieve real-time switching, and the CSR error will be so large as to make the setting of the power consumption mode inaccurate and the power-awareness worse. The control pa-rameter, Kp, in Fig. 6 is the major factor affecting the settling time and the CSR error. After four 30-frame video clips were simulated with 1:1 of the initial subsample rate and 8:5 of the target subsample rate, the effects of the Kp selections were as shown in Figs. 9 and 10. Obviously, the higher the value of Kp, the shorter the settling time and the worse the stability of the CSR. As shown by the results, the suitable range of Kp was from 0.1 to 0.5.

(22)

0 5 10 15 20 25 30 96 128 160 192 224 256

Step response (Weather)

Frame number Subsample pixels (R s −1⋅ N 2) K_p 0.01 0.02 0.05 0.1 0.2 ← 0.3 ← 0.5 ← 0.7 ← 1.0 ← 1.5 0 5 10 15 20 25 30 64 96 128 160 192 224 256

Step response (Children)

Frame number Subsample pixels (R s −1⋅ N 2) K_p 0.01 0.02 0.05 0.1 0.2 ← 0.3 ← 0.5 ← 0.7 ← 1.0 ← 1.5

Fig. 9. The step response for varying Kp in the

case of the weather clip.

Fig. 10. The step response for varying Kp in the

case of the children clip.

4. THE POWER-AWARE ARCHITECTURE

4.1 System Architecture

According to the content-based subsample algorithm, we present a semi-systolic architecture in Fig. 11, which is based on existing architectures, such as that in [5]. The architecture contains an edge-extraction unit (EXU), an array of processing elements (PEs), a parallel adder tree (PAT), a shift register array (SRA), and a motion-vector se-lector (MVS). Given the power consumption mode, the EXU extracts high-frequency (or edge) pixels from the current macro-block (CMB) and generates 0-1 content-based sub-sample masks (CSM) for the PE array to disable or enable processing elements (PEs). The structure of the PE array, as shown in Fig. 12, is used to accumulate absolute pixel differences column by column while the parallel adder tree sums up all the results to gen-erate the value of the CSSAD. The MVS then performs a compare-and-select operation to select the best motion vector.

(23)

Fig. 12. The architecture of the PE array and shift register array.

Based on the semi-systolic architecture with the content-based subsample algorithm, the architecture dynamically disables some processing elements to reduce the power con-sumption since we assume the major power concon-sumption is mainly determined by the switch activity of the system [15]. After edge extraction is performed first, a threshold is set as the criterion for determining whether or not to enable/disable processing elements, thus dynamically changing the switch activities of the system to reduce the power con-sumption. Fig. 13 shows the PE structure and indicates how the CSM disables/enables processing elements. The CSM disables the PE by using the block element (BE), imple-mented by means of AND gates. The BEs can nullify the input signals of data path, which consists of the absolute difference unit (|a − b|) and the Adder unit. When a PE is disabled during a MV searching iteration, the circuits in the PE remain still until the next iteration starts; thus, the consumption of transient power can be reduced.

Fig. 13. The structure of a PE.

The edge-extraction unit contains two blocks: a gradient filter and a CSM generator. The gradient filter is implemented based on one of Eqs. (16) to (18). The proposed

(24)

archi-tecture only requires that a single gradient filter be embedded. However, we will show all of their implementations for the purpose of estimating the overheads and making a com-parison. Figs. 14 to 16 illustrate the implementations of high-pass, Sobel, and morpho-logical gradient filters respectively. Multiplexers are used to prevent boundary errors from occurring with border pixels of the CMB. The black dot in each multiplexer indicates the switching path used when processing a border pixel. Fig. 17 shows the structure of the CSM generator. The CSM generator first determines the threshold according to the gradient values and the power mode, and then generates the CSM by OR-merging the regular subsample pattern and the edge pattern, as shown in Eqs. (21) and (22).

Fig. 14. The architecture of the high-pass gradient filter.

Fig. 15. The architecture of the Sobel gradient filter.

(25)

Fig. 17. The architecture of edge-determination and the CSM generator.

4.2 Execution of Power-Aware ME

Power-aware motion estimation is performed in five phases: the initial CMB phase, initial RMB phase, SAD calculation phase, filtering phase, and edge-determination phase. The initial CMB phase involves loading the CMB data into a PE array, while the initial RMB phase involves filling up the PE array with RMB data to start the SAD calculation. As shown in Fig. 18, the initial CMB phase and initial RMB phase are executed in paral-lel with edge extraction; thus, the timing overhead of edge extraction is hidden by the initial phases. For p > 8, the timing overhead of edge extraction is zero.

Fig. 18. The execution phases of the power-aware architecture.

4.3 Experimental Results

Table 4 shows the synthesis results obtained using the TSMC 1P4M 0.35um cell li-brary, where the symbol Rs denotes the content-based subsample rate and εgp is the gate power coefficient defined in Eq. (9). Compared with the general semi-systolic architec-ture [5], the edge extraction unit (EXU) of the proposed architecarchitec-ture has the major

(26)

Table 4. Power analysis of the power-aware architecture.

EUi PE array SRA PAT + MVS EXU

AD + Adder Other Gate Count Gi _{117,760 58,708}_44,640_1,800_17,121 ri(0↔1) 4p2Rs -1 =4096Rs -1 4p2=4096 4p2=4096 4p2=4096 N2=256

Piconsumption 4.8e8 ⋅ Rs-1 2.4e8 1.8e8 7.37e6 4.38e6

( )

all

consumption gp

P ε 4.8e8 ⋅ Rs-1 + 4.3e8

N = 16 and p = 32. Cell library: TSMC 0.35um process.

overhead for the power-aware function. As mentioned above, we used one of the three gradient filters here to implement the EXU. As for the synthesis results, the gate counts of the three gradient filters were 595.33, 793.77, and 727.63, respectively. The variance of these values is very small compared to the overall gate count of the EXU. For instance, the gate count of EXU with the high-pass filter was equal to 14745. This number is much larger than the variance. This means that the selection of a gradient filter does not affect the overhead estimation very much. Therefore, we used the high pass filter to estimate the performance overhead caused by the EXU. From Table 4, one can see that the area overhead of EXU was 7.68%, while the worst-case power overhead was only 0.8% when the subsample rate was 4-to-1 for motion estimation with N = 16 and p = 32.

Fig. 19 shows the results for the video clip “table-tennis” under switching of the power consumption mode. The target subsample pixel count was reduced by 48 every 40 frames, and the control parameter Kp was set to 0.3. The results show that the adaptive control mechanism could enable the power consumption to reach the target level within 10 frames. According to the battery properties described in section 2, the curve shows that our power-aware architecture can extend the battery lifetime by gradually degrading the quality. A, B, C, and D correspond to the switching points in Fig. 2 (b), respectively.

40 80 120 160 200 6 6.5 7 7.5 8 8.5 9 9.5 10x 10

8 _{Switch the power mode (Table−tennis)}

Frame number power consumption ( εgp ) ← A ← B ← C ← D

Fig. 19. An application involving switching of the power mode in the case of the video clip “ta-ble-tennis.”

(27)

5. CONCLUSIONS

Motivated by the battery properties and the power-aware paradigm, this paper has presented an architecture-level power-aware technique based on a novel content-based subsample algorithm. When the battery capacity is full, the proposed ME architecture turns on all the PEs to provide the best compression quality. In contrast, when the battery capacity is short for full operation, instead of exhibiting an all-or-none behavior, the pro-posed architecture shifts to a lower power consumption mode by disabling some PEs in order to extend the battery lifetime with little quality degradation. Switching of power consumption mode can be smoothly accomplished; thus, the proposed architecture makes it possible to switch the power consumption mode with acceptable quality degradation. Although edge extraction plays a crucial role to dynamically adjusting the power con-sumption mode, it does not introduce much power dissipation and the timing overhead can be neglected. As shown by the simulation results, the proposed algorithm success-fully improves the compression quality of the generic subsample algorithm and switches the power consumption mode by adaptively adjusting the threshold parameters.

REFERENCES

1. P. Raghavan and C. Chakrabarti, “Battery-friendly design of signal processing algo-rithms,” in Proceedings of the IEEE Workshop on Signal Processing Systems, 2003, pp. 304-309.

2. M. J. Chen, L. G. Chen, and T. D. Chiueh, “One-dimensional full search motion es-timation algorithm for video coding,” IEEE Transactions on Circuits and Systems for

Video Technology, Vol. 4, 1994, pp. 504-509.

3. B. Liu and A. Zaccarin, “New fast algorithms for the estimation of block motion vectors,” IEEE Transactions on Circuits and Systems for Video Technology, Vol. 3, 1993, pp. 148-157.

4. H. W. Cheng and L. R. Dung, “EFBLA: a two-phase matching algorithm for fast motion estimation,” Advances in Multimedia Information Processing − PCM, LNCS 2532, 2002, pp. 112-119.

5. D. Linden, Handbook of Batteries, 2nd ed., McGraw-Hill, Inc., New York, 1995. 6. R. C. Gonzalez and R. E. Woods, Digital Image Processing, Addison Wesley, New

York, 1993.

7. C. H. Hsieh and T. P. Lin, “VLSI architecture for block-matching motion estimation algorithm,” IEEE Transactions on Circuits and Systems for Video Technology, Vol. 2, 1992, pp. 169-175.

8. K. Sauer and B. Schwartz, “Efficient block motion estimation using integral projec-tions,” IEEE Transactions on Circuits and Systems for Video Technology, Vol. 6, 1996, pp. 513-518.

9. J. R. Jain and A. K. Jain, “Displacement measurement and its application in inter-frame image coding,” IEEE Transactions on Communications, Vol. COM-29, 1981, pp. 1799-1808.

10. J. N. Kim, S. C. Byun, Y. H. Kim, and B. H. Ahn, “Fast full search motion estimation algorithm using early detection of impossible candidate vectors,” IEEE Transactions

(28)

on Signal Processing, Vol. 50, 2002, pp. 2355-2365.

11. T. Koga, K. Iinuma, A. Hirano, Y. Iijima, and T. Ishigura, “Motion compensated in-terframe coding for video conferencing,” in Proceedings of the IEEE National

Tele-communication Conference, Vol. 4, 1981, pp. G5.3.1-G5.3.5.

12. Y. K. Lai and L. G. Chen, “A data-interlacing architecture with two-dimensional data-reuse for full-search block-matching algorithm,” IEEE Transactions on Circuits

and Systems for Video Technology, Vol. 8, 1998, pp. 124-127.

13. S. Lee and S. I. Chae, “Motion estimation algorithm using low-resolution quantiza-tion,” IEE Electronic Letters, Vol. 32, 1996, pp, 647-648.

14. J. H. Luo, C. N. Wang, and T. Chiang, “A novel all-binary motion estimation (ABME) with optimized hardware architectures,” IEEE Transactions on Circuits and Systems

for Video Technology, Vol. 12, 2002, pp. 700-712.

15. Mobile Pentium III Processor in BGA2 and Micro-PGA2 Packages Datasheet, Intel

Corporation, pp. 55.

16. P. Kuhn, Algorithms, Complexity Analysis and VLSI Architectures for MPEG-4

Mo-tion EstimaMo-tion, Kluwer Academic Publishers, U.S.A., 1999.

17. R. Li, B. Zeng, and M. L. Liou, “A new three-step search algorithm for block motion estimation,” IEEE Transactions on Circuits and Systems for Video Technology, Vol. 4, 1994, pp. 438-442.

18. C. K. Cheung and L. M. Po, “Normalized partial distortion search algorithm for block motion estimation,” IEEE Transactions on Circuits and Systems for Video

Technology, Vol. 10, 2000, pp. 417-422.

19. S. Unsal and I. Koren, “System-level power-aware design techniques in real-time systems,” in Proceedings of the IEEE Special Issue on Real-Time Systems, Vol. 91, 2003, pp. 1055-1069.

20. W. Li and E. Salari, “Succesive elimination algorithm for motion estimation,” IEEE

Transactions on Image Processing, Vol. 4, 1995, pp. 105-107.

21. J. Y. Tham, S. Ranganath, M. Ranganath, and A. A. Kassim, “A novel unrestricted center-biased diamond search algorithm for block motion estimation,” IEEE

Trans-actions on Circuits and Systems for Video Technology, Vol. 8, 1998, pp. 369-377.

22. J. C. Tuan, T. S. Chang, and C. W. Jen, “On the data reuse and memory bandwidth analysis for full-search block-matching VLSI architecture,” IEEE Transactions on

Circuits and Systems for Video Technology, Vol. 12, pp. 61-72.

23. V. L. Do and K. Y. Yun, “A low-power VLSI architecture for full-search block- matching motion estimation,” IEEE Transactions on Circuits and Systems for Video

Technology, Vol. 8, 1998, pp. 393-398.

24. K. M. Yang, M. T. Sun, and L. Wu, “A family of VLSI designs for the motion com-pensation block-matching algorithm,” IEEE Transactions on Circuits and Systems

for Video Technology, Vol. 36, 1989, pp. 1317-1325.

25. W. Zhang, R. Zhou, and T. Kondo, “Low-power motion-estimation architecture based on a novel early-jump-out technique,” in Proceedings of the IEEE

Interna-tional Symposium on Circuits and Systems, Vol. 5, 2001, pp. 187-190.

26. C. Zhu, X. Lin, and L. P. Chau, “Hexagon-based search pattern for fast block motion estimation,” IEEE Transactions on Circuits and Systems for Video Technology, Vol. 12, 2002, pp. 349-355.

(29)

27. M. Bhardwaj, R. Min, and A. P. Chandrakasan, “Quantifying and enhancing power awareness of VLSI systems,” IEEE Transactions on Very Large Scale Integration

Systems, Vol. 9, 2001, pp. 757-772.

28. C. L. Su and C. W. Jen, “Motion estimation using MSD-first processing,” IEE

Pro-ceedings of Circuits, Devices and Systems, Vol. 150, 2003, pp. 124-133.

Hsien-Wen Cheng (鄭顯文) was born in 1968. He received the B.S. degree in Control Engineering from National Chiao Tung University, Hsinchu, Taiwan, R.O.C. in 1992. He joined the AVerMedia Technologies Inc. from 1994 to 2001. He is currently working toward the Ph.D. degree in the Electrical and Control Engineer, National Chiao Tung University. His research interests are video/image compression, motion estimation, VLSI architec-ture, and digital signal processing.

Lan-Rong Dung (董蘭榮) was born in 1966. He received a B.S.E.E. and the Best Student Award from Feng Chia University, Taiwan, in 1988, an M.S. in Electronics Engineering from Na-tional Chiao Tung University, Taiwan, in 1990, and Ph.D. in Electrical and Computer Engineering from Georgia Institute of Technology, in 1997. From 1997 to 1999 he was with Rockwell Science Center, Thousand Oaks, CA, as a Member of the Techni-cal Staff. He joined the faculty of National Chiao Tung University, Taiwan in 1999 where he is currently an Associate Professor in the Department of Electrical and Control Engineering. He received the VHDL Interna-tional Outstanding Dissertation Award celebrating in Washington DC in October, 1997. His current research interests include VLSI design, digital signal processing, hardware- software codesign, and System-on-Chip architecture. He is a member of Computer and Signal Processing societies of the IEEE.

(30)

A Content-Based Methodology for Power-Aware

Motion Estimation Architecture

Hsien-Wen Cheng and Lan-Rong Dung, Member, IEEE

Abstract—This paper presents a novel power-aware motion

estimation algorithm, called adaptive content-based subsample al-gorithm (ACSA), for battery-powered multimedia devices. While the battery status changes, the architecture adaptively performs graceful tradeoffs between power consumption and compression quality. As the available energy decreases, the algorithm raises the subsample rate for maximizing battery lifetime. Differing from the existing subsample algorithms, the content-based algorithm first extracts edge pixels from a macro-block and then subsamples the remaining low-frequency part. In this way, we can alleviate the aliasing problem and thus keep the quality degradation low as the subsample rate increases. As shown in experimental results, the architecture can dynamically operate at different power con-sumption modes with little quality degradation according to the remaining capacity of battery pack while the power overhead of edge extraction is under 0.8%.

Index Terms—Content-based image processing, motion

esti-mation (ME), power-aware system, subsample algorithm, very large-scale integration (VLSI) architecture, video compression, VLSI image processing.

I. INTRODUCTION

M

OTION ESTIMATION (ME) has been notably rec-ognized as the most critical part in many video compression applications, such as MPEG standards and H.26x [1], which leads to dominant computational and, hence, power requirements. With increasing demand of portable multimedia devices, recently, a power-aware ME that can be flexible in both power consumption and compression quality is highly required [2]. Fig. 1 illustrates a typical block diagram of the portable multimedia system powered by battery. In practice, a battery has the two most important nonideal properties, which are the rate capacity effect and recovery effect [3]. Fig. 2, as an ex-ample, illustrates that the system can extend the battery lifetime by gradually stepping down the power dissipation. Normally, system designers can choose points A, B, C, and D according to the discharging profile provided by battery manufacturer. When the battery monitor unit detects the voltage drop of battery, the host processor will change the power mode accordingly. By changing the operation mode of all power-aware components simultaneously, the architecture adapts the power dissipa-tion to battery status and, hence, raises battery performance.

Manuscript received February 24, 2004; revised September 13, 2004 and Feb-ruary 22, 2005. This work was supported in part by Taiwan MOE Program for Promoting Academic Excellent of Universities under Grant 91-E-FA06-4-4 and the National Science Council, Taiwan, R.O.C., under Grant NSC 93-2220-E-009-023. This paper was recommended by Associate Editor S.-F. Chang.

The authors are with the Department of Electrical and Control Engineering, National Chiao Tung University, Hsinchu 300, Taiwan, R.O.C. (e-mail:

Fig. 1. System block diagram of a portable, battery-powered multimedia device.

Fig. 2. Diagram representing the nonlinear discharging properties of battery.

This paper, therefore, intends to presents a power-aware ME architecture driven by an adaptive content-based subsample algorithm to lengthen the battery lifetime.

Many published papers have presented efficient algorithms for very large-scale integration (VLSI) implementation of motion estimation on high performance or low-power design. Yet most of them cannot dynamically adapt the compression quality to different power consumption modes. Among these proposed algorithms, the full-search block-matching (FSBM) algorithm with the sum of absolute difference (SAD) criterion is the most popular approach for ME because of its considerably good quality. It is particularly attractive to those who require extremely high quality. However, the huge number of

compar-ison/difference operations results in a high computation load

and power consumption [4], [5]. To reduce the computational complexity of FSBM, researchers have proposed various fast algorithms. They either reduce search steps [6]–[8] or simplify calculations of error criterion [9], [10]. These fast-search algorithms have successfully improved the block matching speed and, thus, led to a low-power implementation. However, a low-power implementation is not necessarily a power-aware system, in which the implementation should adaptively modify its behavior to balance the performance between quality and battery life [11]. The requirement for ME algorithms to be suitable for power-aware design is a high degree of scalability in performance tradeoffs. Unfortunately, the fast algorithms

(31)

The subsample algorithms present in [12] and [13] are very suitable for power-aware ME architecture because of their highly scalable characteristic. As the subsample rate in-creasing, the unsampled processing elements in a semisystolic array will be disabled to save the switch activities, that is, the power consumption of the architecture will be decreased correspondingly. However, applying subsample algorithms for power-aware architecture may suffer from the aliasing problem in the high-frequency band. The aliasing problem degrades the compression quality rapidly as the subsample rate increases. To alleviate the problem, we extend traditional subsample algorithms to an adaptive content-based subsample algorithm (ACSA). In the ACSA, we first use edge extraction techniques to separate the high-frequency band from a macro-block and then subsample the low-frequency band only. By merging the edge pixels and subsample nonedge pixels, the algorithm generates a turn-off mask for the architecture to disable the switching activities of processing elements (PEs) in a semisys-tolic array. This content-based algorithm keeps the quality degradation low as the subsample rate increases. Because the number of high-frequency pixels varies with different video clips, we introduce an adaptive control mechanism to set the threshold value for edge determination and make the number of masked pixels stationary for a given power mode. The ACSA can be implemented in most existing ME architectures by turning off PEs according to the subsample mask. In this paper, we present a semisystolic architecture with gated PEs. The simulation results show that the ACSA can dynamically alter the subsample rate as the power consumption mode changes and the architecture can work at different power consumption modes with acceptable and smooth quality degradation while the power overhead of edge extraction is under 0.8%.

II. SUBSAMPLEALGORITHMS A. Generic Subsample Algorithm

Here, we present a generic subsample algorithm (GSA) in which the subsample rate ranges from 4:1 to 1:1. The GSA uses SSAD

(1) as a matching criterion, called the subsample sum of absolute difference (SSAD), where is the subsample mask for the subsample rate 8 to as shown in

(2) where the macro-block size is -by- , is the luminance value at of the current macro-block, and is the luminance value at of the reference macro-block which offsets from the current macro-block in the searching

Fig. 3. Procedure of the ACSA.

where

for and for (3) Due to its flexibility in energy-quality tradeoffs, the GSA is suit-able for the implementation of power-aware architectures. The power consumption of the architecture is proportional to the in-verse of the subsample rate. However, the algorithm suffers from the aliasing problem which will degrade the ME quality and re-sults in a considerable degradation of quality when the high-fre-quency band is messed up.

B. Adaptive Content-Based Subsample Algorithm

As mentioned above, the GSA has an aliasing problem for a high subsample rate and leads to considerable quality degra-dation because the high-frequency band is interfered with. To alleviate the problem, we propose an ACSA that only subsam-ples the low-frequency band. The procedure of the ACSA is de-scribed in Fig. 3. We first use edge extraction to separate high-frequency pixels (or edge pixels) from a macro-block and then subsample the remaining pixels (or low-frequency pixels). The determination of edge pixels starts from gradient filtering. In this paper, we use three popular gradient filters to exercise the con-tent-based algorithm; they are the high-pass gradient filter, the Sobel gradient filter, and the morphological gradient filter. After obtaining the gradients denoted as , we use a floating threshold to determine the edge pixels of the current macro-block. The floating threshold makes the edge extraction more robust

先進電子設計自動化技術研發-子計畫六：用於奈米晶片系統設計之功率意識高階合成研究(III)

子計畫六：用於奈米晶片系統設計之功率意識高階合成研究

(3/3)

計畫類別： 整合型計畫

計畫編號： NSC94-2220-E-009-023-

執行期間： 94 年 08 月 01 日至 95 年 07 月 31 日

執行單位： 國立交通大學電機與控制工程學系(所)

計畫主持人： 董蘭榮

計畫參與人員： 楊學之、江宗錫、宋岳璋、林耕興、吳智偉、林盟淳、

賴信丞、呂文豪、林毅慧、黃仕捷

報告類型： 完整報告

報告附件： 出席國際會議研究心得報告及發表論文

處理方式： 本計畫可公開查詢

行政院國家科學委員會補助專題研究計畫

■ 成 果 報 告

□期中進度報告

先進電子設計自動化技術研發

子計畫六：用於奈米晶片系統設計之功率意識高階合成研究

(3/3)

計畫類別：□ 個別型計畫 ■ 整合型計畫

計畫編號：NSC 94－2220－E－009－023－

執行期間： 94 年 8 月 1 日至 95 年 7 月 31 日

計畫主持人：董蘭榮

共同主持人：

計畫參與人員： 楊學之、江宗錫、宋岳璋、林耕興、吳智偉、林盟淳、

賴信丞、呂文豪、林毅慧、黃仕捷

成果報告類型(依經費核定清單規定繳交)：□精簡報告 ■完整報告

本成果報告包括以下應繳交之附件：

□赴國外出差或研習心得報告一份

□赴大陸地區出差或研習心得報告一份

■出席國際學術會議心得報告及發表之論文各一份

□國際合作研究計畫國外研究報告書一份

處理方式：除產學合作研究計畫、提升產業技術及人才培育研究計畫、

列管計畫及下列情形者外，得立即公開查詢

■涉及專利或其他智慧財產權，□一年□二年後可公開查詢

執行單位：國立交通大學電機與控制工程學系

（一） 計畫中文摘要

在

CMOS 電路尺寸逼近極限時, 電路的功率消耗與效能品質的平衡考量就顯得

非常重要。特別是隨著奈米技術的大幅進展, 能意識到耗能條件的設計方法成為

很重要的關鍵。所謂功率意識設計除了考慮平均的功率消耗外同時也包含了瞬間

的功率消耗狀態, 例如: 峰值功率, 功率梯度等。目前考慮瞬間功率消耗的設計

方法僅限於電晶體或是邏輯閘階層的設計。然而, 如果在系統設計階段就功率意

識找出解答, 積體電路的設計將可明顯地提升低階技術的功率意識最佳化程

度。本計畫的目的就在於探索可用於功率意識系統的高階合成方法來管理並減少

暫態功率。

關鍵詞: 功率意識; 高階合成; 系統晶片; 電腦輔助設計

（二） 計畫英文摘要

As we get closer to the limits of scaling in CMOS circuits, it is imperative to consider

power/performance trade-offs and to develop appropriate power aware methodologies

and techniques for embedded systems. The use of nanometer technologies is making

it increasingly important to consider transient characteristics of a circuit's power

dissipation (e.g., peak power, and power gradient or differential) in addition to its

average power consumption. State-of-the-art transient power analysis and reduction

approaches are mainly at the transistor- and gate-levels. However, we believe

architectural solutions to transient power problems may complement and significantly

extend the scope of lower-level techniques, as was the case with average power

minimization. This project intends to exploit high-level synthesis approach to

transient power management and reduction in that a power-aware high-level synthesis

can impact the cycle-by-cycle peak power and peak power differential for the

synthesized implementation.

中英文摘要 ………. I

目錄 ………. II

報告內容 ………. 1

一、

前言 ……… 1

二、

研究目的 ……… 2

三、

文獻探討 ……… 2

四、

研究方法 ……… 3

五、

結果與討論 ……… 4

A Power-Aware Motion Estimation Architecture Using Content-based

Subsampling

……….……….

6

A Content-based Methodology for Power-Aware Motion Estimation

Architecture

計畫類別：整合型計畫

執行單位：國立交通大學電機與控制工程學系(所)

計畫主持人：董蘭榮

計畫參與人員：楊學之、江宗錫、宋岳璋、林耕興、吳智偉、林盟淳、

報告類型：完整報告

報告附件：出席國際會議研究心得報告及發表論文

處理方式：本計畫可公開查詢

■ 成果報告

計畫參與人員：楊學之、江宗錫、宋岳璋、林耕興、吳智偉、林盟淳、

（一）計畫中文摘要

（二）計畫英文摘要

一、前言

二、研究目的

三、文獻探討