國
立
交
通
大
學
電子工程學系 電子研究所
碩 士 論 文
基於增強和高效的離散餘弦變換的熱模型的構建及
其應用於熱感知擺放
Enhanced and Efficient DCT Based Thermal Model Construction
and Its Application to Thermal-Aware Placement
研 究 生:吳永證
指導教授:陳宏明 教授
基於增強和高效的離散餘弦變換的熱模型的構建及
其應用於熱感知擺放
Enhanced and Efficient DCT Based Thermal Model Construction
and Its Application to Thermal-Aware Placement
研 究 生:吳永證 Student:Yong-Zheng Wu
指導教授:陳宏明 Advisor:Hung-Ming Chen
國 立 交 通 大 學
電子工程學系 電子研究所
碩 士 論 文
A ThesisSubmitted to Department of Electronics Engineering and Institute of Electronics
College of Electrical and Computer Engineering National Chiao Tung University
in partial Fulfillment of the Requirements for the Degree of
Master of Science in
Electronics Engineering
July 2011
Hsinchu, Taiwan, Republic of China
中華民國一00年七月
基於增強和高效的離散餘弦變換的熱模型的構建
及其應用於熱感知擺放
學生: 吳永證
指導教授: 陳宏明 教授
國立交通大學 電子工程學系 電子研究所 碩士班
摘
要
在 這 篇 論 文 中 , 我 們 提 出 了 一 種 快 速 、 準 確 的 熱 感 知 解 析 式 擺 置 器 (Thermal-Aware Analytical Placer)。熱模型(Thermal model)是以格林函數(Green Function)建構,並以增強的離散餘旋變換(Enhanced-DCT)來產生完整晶片的溫度 曲線(Full-Chip Temperature Profile)。與以往其他的熱感知擺置器相比,我們的熱 模型與力平衡擺置器緊密地結合。我們提出基於二維高斯模型(2D Gaussian model)的熱散播力(Thermal Spreading Force)以及動態熱區域控制,以降低最高晶 片溫度,能優化總半周長(HPWL)和晶片上的溫度分佈。我們的熱模型已被最新的商業工具驗證,平均偏差在 6.5%內,並加速 240 倍。比起 Capo 與 APlace2,我們的擺置器可以加速 3-4 倍並達到同樣的結果。實 驗是以 ISPD2005 年的測資(Test Bench)測試,基準高達 200 萬邏輯閘的設計。其 結果更進一步以 GSRC 計算器評估總 HPWL,並用商用軟體 ICEPAK 做溫度分 佈。
Enhanced and Efficient DCT Based Thermal
Model Construction and Its Application to
Thermal-Aware Placement
Student: Yong-Zheng Wu
Advisor: Prof. Hung-Ming Chen
Department of Electronics Engineering
National Chiao Tung University
Hsinchu, Taiwan 300, R.O.C.
2011-07
(This page will be replaced by a pdf file which is transformed from the formal cover page edited with Microsoft Word.)
Enhanced and Efficient DCT Based Thermal Model
Construction and Its Application to Thermal-Aware
Placement
Student: Yong-Zheng Wu Advisor: Prof. Hung-Ming Chen
Department of Electronics Engineering Institute of Electronics
National Chiao Tung University
ABSTRACT (Chinese)
(This page will be replaced by a pdf file which is transformed from the Chinese abstract edited with Microsoft Word.)
Enhanced and Efficient DCT Based Thermal Model
Construction and Its Application to Thermal-Aware
Placement
Student: Yong-Zheng Wu Advisor: Prof. Hung-Ming Chen
Department of Electronics Engineering Institute of Electronics
National Chiao Tung University
ABSTRACT
In this thesis, we proposed a fast and accurate thermal aware analytical placer. Thermal model is constructed based on Green function with enhanced DCT to gen-erate full chip temperature profile. Unlike other previous thermal aware placers, our thermal model is tightly integrated with a flat force directed placement. A ther-mal spreading force based on 2D Gaussian model is proposed to reduce maximum on-chip temperature with dynamic hot region size control, optimizing between total half-perimeter wirelength (HPWL) and on-chip temperature distribution.
Our thermal model is verified by the most recent commercial tool and has an average deviation within 6.5% with 240x speed up. Our placer can reach the same quality compared to Capo and APlace2 with 3-4x speed up. Experiments are tested using ISPD 2005 benchmark with up to 2 million gate design. The results are further evaluated using GSRC Bookshelf Evaluator for total HPWL, and using ICEPAK for temperature distribution.
ACKNOWLEDGEMENTS
First and foremost, I would like to acknowledge the advice and guidance of my professor Hung-Ming Chen. I also thank my lab mates for all their help and take care of me, especially Ph.D Snieaulu for all his encouragement and suggestions. Special thanks to professor Yu-Min Lee and his Ph.D student Pei-Yu Huang without their knowledge and assistance this thesis would not have been successful.
The major experiment tools for this study was provided by National Chip Im-plementation Center (CIC), Xin-Zhu, Taiwan. I would like to thank their support and their co-operation, especially Mrs. Jing-Ru Chu, the research assistant of chip stack structure and thermal analysis, without her help and tutor this experimental results could not have been finished.
Finally, I thank to my parents for supporting me throughout all my studies and thank to the National Chiao Tung Unviversity (NCTU) for providing a very good and beautiful environment.
Contents
Abstract (Chinese) i Abstract ii Acknowledgements iii Contents iv List of Tables viList of Figures vii
1 Introduction 1
1.1 Previous Work . . . 1
1.1.1 Placement . . . 1
1.1.2 Thermal Model . . . 2
1.1.3 Thermal Aware Placement . . . 3
1.2 Our Contributions . . . 4
2 Thermal Model 6 2.1 Analytical Thermal Model Based on Green Function . . . 6
2.1.1 Homogeneous Problem . . . 8
2.1.2 Inhomogeneous Problem . . . 10
2.2 Enhancement of DCT . . . 13
2.2.1 Even Extension of DFT for DCT . . . 15
2.2.2 Reordering Input Data . . . 16
3 Application to Thermal Aware Placement 18 3.1 Placement Algorithm . . . 19
3.2 Considering Thermal Effect in Placement . . . 20
3.2.1 Obtaining Thermal Anchor . . . 21
3.2.2 Determine Magnitude of Thermal Force . . . 23
4 Experimental Results 25 4.1 Thermal Model . . . 25
4.2 Placement . . . 26
4.3 Thermal Aware Placement . . . 27
List of Tables
1.1 Previous work on thermal aware placement . . . 4
4.1 Temperature profile comparison with Icepak . . . 26
4.2 Placement results in ISPD 2005 benchmarks . . . 27
List of Figures
2.1 Summary overall thermal analysis flow chart . . . 6
2.2 Schematic of cross-sectional view of a VLSI chip with packaging . . . 7
2.3 Illustration of the simplified chip model . . . 8
2.4 The illustration of power density on meshing top surface of chip . . . 11
2.5 The illustration of butterfly algorithm for DFT . . . 14
2.6 (a) The original input sequence. (b) The even extension of original
input. . . 15
2.7 The sequence of input patterns after reordering. (a) The original
input sequence. (b) The sequence of vn and wn. . . 16
3.1 Placement of cells to demonstrate interleaving mechanism between
legalization and CG for adaptec1. . . 18
3.2 Illustration of the legalization process. . . 20
3.3 Flow Diagram of our thermal aware placement. . . 21
3.4 Forces exerted on a cell by thermal anchor and legalizing anchor.
Thermal anchor is placed on a circle and red rectangle is the identified
hot region. . . 22
3.5 An illustration of 2-D Gaussian force spread model with center
4.1 (a). Power density of adaptec1. (b). Power density of bigblue3. (c). Temperature distribution of adaptec1. (d). Temperature distribution of bigblue3. (e). Temperature distribution of adaptec1 using Icepak before thermal aware placement. (f). Temperature distribution of
adaptec1 using Icepak before thermal aware placement. . . 29
Chapter 1
Introduction
The demand to alleviate temperature variation and increase chip reliability in-creases as technology continue to advance. Recent studies have shown that chip performance and reliability can deteriorate to a certain magnitude due to
temper-ature variation. Evidence have shown that a 10◦
C variation in temperature can induce 5% increase in interconnect delay, 25% increase in crosstalk-induced noise and reduce component life time by 50% [7][6].
The situation for on-chip temperature and variation will only increase as technol-ogy advances. As node size scales down, components will be placed more compact which entails higher current density per unit area. Shrinking in interconnect di-mension will increase interconnect resistance which results in higher power and heat dissipation. Viewed in this light, the effect of chip temperature and variation must be addressed to when considering chip performance and reliability.
1.1
Previous Work
1.1.1
Placement
Regarding to current state-of-the-art, placement can be categorized to three cate-gories: partition based, linear force-directed and non-linear force directed. Partition-based placer such as Capo[21] partition the given placement using Fidducia-Mattheyes
in a top-down fashion. Although it does not require any model to approximate total HPWL, it trails behind force-directed and non-linear force directed placer in both quality and run time.
Non-linear forced directed placer such as NTUPlace3[9], mPL6[5] and APlace2[15] uses log-sum-exp model to approximate total HPWL. The log-sum-exp model is a very accurate model to approximate HPWL. However, log-sum-exp model is rela-tively complex which requires longer run time and is patented which requires license to use. To accelerate run time, non-linear force directed placer generally coarsen the given placement by clustering cells to reduce problem size.
Linear force directed placer such as RQL[24], simPL[16], FastPlace3[25] and KraftWerk2[22] use bound-to-bound, star-net or clique model to approximate total HPWL. The simplicity of linear force directed model implies faster run time but is less accurate compare to log-sum-exp model. However, recent development in [24] and [16] have outperformed all non-linear force directed placer in total HPWL by a notch and is approximately 3-5 times faster.
1.1.2
Thermal Model
Regarding to thermal model, most recent approaches can be categorized to nu-merical and analytical approach. Nunu-merical approach generally uses finite-difference method(FDM) [12] or finite-element method(FEM) [23] by meshing given silicon substrate and then solve a set of linear equation to obtain temperature profile. FDM discretizes the partial differential equation of heat conduction and uses forward-difference method to approximate temperature profile. FEM discretizes the given temperature for each grid point in design space and then points elsewhere is calcu-lated using interpolation. Generally, numerical approach can achieve high accuracy at the expense of relatively long run time making it suitable for post-layout thermal verification.
Analytical approach solves the differential equation of heat problem by first generating the fundamental solution of one unit heat source and then exploit the linearity of the heat equation to obtain the general solution of overall temperature distribution [27][26][13]. In contrast to numerical approach, analytical approach can quickly obtain an approximate solution in closed form representation without an amount volume of meshing making it suitable for where approximate solution is adequate.
After two classified approaches, there is two states of analysis to be concerned, transient-state and steady-state. The transient-state is concerned about ture in time-varying but the steady-state is interested in the stabilized of tempera-ture in long term state. In this paper, we proposed to consider the steady-state of temperature.
1.1.3
Thermal Aware Placement
To implement a thermal aware placement, two main components are usually required: a thermal model to conduct full chip thermal analysis and a placement mechanism to consider the optimization between thermal effect and wirelength.
Regarding to previous works on thermal aware placement, Table 1.1 summarizes previous work on thermal aware placer based on their thermal model and place-ment algorithm. Tsai et al.[23] constructs lumped RC matrix to model substrate heat conduction and obtain thermal profile using FDM, then simulated annealing is applied evenly spread out hot spot. Kahng et al. [14] adopts its thermal model from [23] and integrate the model to its previously published placer, APlace [15]. Chen et al.[8] simplifies the model in [23] and applies partition based placement to consider thermal effect. Goplen et al.[11] uses FEM method to conduct full chip temperature analysis and uses linear force directed placer based on star net model to mediate thermal effect. Jing et al.[17] constructs RC equivalent matrix to model
heat transfer and obtain thermal profile using FDM, Fiduccia-Mattheyses partition is then applied as their placement strategy. Bernd et al.[19] proposes a methodology to consider the impact of dynamic power density by adding additional temperature cost obtained from FDM into the iterative annealing optimization.
Table 1.1: Previous work on thermal aware placement
hhh hh hh hh hh hhh Placement Thermal Model FEM FDM Simulated Annealing [23] Partition-Based [11] [8],[17]
Linear Force Directed [19]
Non-Linear Force Directed [14]
1.2
Our Contributions
In spite of past effort, thermal aware placement today still suffers deficiency in terms of (i). accuracy of full-chip analysis,(ii). quality of placement and (iii). fast execution time. The accuracy of some thermal models are unknown since it lacks evaluation with accurate model. The quality of algorithms for placement is deficient either because it greedily focuses on thermal distribution or its placement methodology is relatively naive. Some methods are impractical to deal with million gate design due to execution speed when using simulated annealing or solving com-plex thermal model requires long execution time. In addition, none of our surveyed thermal-aware placers has tested on million gate design.
In this thesis, we presented a thermal aware placement to reduce maximum temperature using analytical thermal model combined with linear force-directed placer. Both of our thermal model and placer are analytical making it inherently fast. The run time bottleneck of Green function based thermal analysis is in its post-process DCT calculation and we reduced the complexity to O(N logN ) by applying even extension and input reordering algorithm. Our thermal model is verified with commercial tool ANSYS Icepak and demonstrated that the deviation of our thermal
model is within 6.5% and is 242x times faster.
The thermal model is integrated with a flat (no clustering is required) global placement based on linear-force model which is 3-4x faster compare to non-linear and partitioned based placer with reasonable placement quality. We experimented our thermal-aware placement on a set of open source benchmarks released by IBM and evaluated our the thermal profile using Icepak[1].
We proposed the concept of thermal anchor combined with dynamic resizing scanning of hot region to optimize the trade-off between maximum temperature and total HPWL. The experiments have performed on three kinds of weight adjustment which are linear force model, quadratic force model and quadratic Gaussian force model to demonstrate the effectiveness of our thermal aware placement.
The following chapters are organized as follows: Chapter 2 describes the analyt-ical thermal model and Chapter 3 depicts the application of thermal for placement. Chapter 4 presents the experimental results and finally Chapter 5 concludes the thesis.
Chapter 2
Thermal Model
2.1
Analytical Thermal Model Based on Green
Function
In early stage of physical design, the primary concern of thermal analysis focuses more on its speed rather than accuracy. Therefore, we adopted our thermal model using Green function to quickly perform thermal analysis. The concept of the Green function is to first derive the fundamental solution then use linear superposition to obtain the general solution for overall chip temperature distribution.
(a) Thermal analysis flow
Figure 2.1: Summary overall thermal analysis flow chart
follows, first the fundamental bases solution which satisfies boundary conditions is generated from homogeneous equation. Then, the general solution can be obtained by applying generated bases solution to Green function to express power density and temperature distribution. Finally, the solution to the heat conduction based on many power sources is in a form of DCT and IDCT. Thus, the input reordering is adopted to enhance DCT and IDCT execution time. As shown in Figure 2.1,
Given the law of conservation of energy, it states that the total amount of energy in a system remains constant over time. That means the changing rate of heat energy equals to the summation of conduction heat flowing into the unit volume and its pre-existing power density. Hence, the temperature distribution inside the chip is defined as
σ∂T (r, t)
∂t = ∇ · (κ∇T (r, t)) + p(r, t) r ∈ D (2.1)
which is also known as heat diffusion equation. T denotes temperature in Celsius (◦
C), the domain r = (x, y, z) κ denotes thermal conductivity (W/m◦
C) and σ = cpρ
has unit of (J/m3◦C). c
p is the specific heat, ρ is the heat capacity, p(r, t) is the
power density of heat source (W/m3) and D = (0, L
x) ∗ (0, Ly) ∗ (−Lz, 0) are the
dimensions of die.
(a) Chip Model
Figure 2.2: Schematic of cross-sectional view of a VLSI chip with packaging Figure 2.2 depicts a flip-chip model in which top and bottom surfaces of die are
contacted by printed circuit board (PCB) and heat sink. hp denotes primary heat
flow to the heat sink and hs denotes secondary heat flow to the PCB. The model
illustrated in Figure 2.3. The boundary conditions of such assumption are derived as ∂T (r, t) ∂x |x=0,Lx= ∂T (r, t) ∂y |y=0,Ly= 0 (2.2) κ∂T (r, t) ∂z |z=−Lz= hpT (x, y, −Lz, t) (2.3) κ∂T (r, t) ∂z |z=0= −hsT (x, y, 0, t) (2.4)
Since steady-state temperature analysis is the primary concern in this thesis, temperature and power density are assumed to be stable. The temperature distri-bution equation (2.1) can be written as a form of inhomogeneous equation.
∇2T (r) = −p(r)
κ (2.5)
Figure 2.3: Illustration of the simplified chip model
To solve the inhomogeneous equation and satisfy the boundary conditions, the solution bases of homogeneous equation needs to be solved to satisfy all boundary conditions. Finally, the general solution of temperature distribution in R domain can then be derived by substituting the bases solution from homogeneous equation into inhomogeneous equation (2.5).
2.1.1
Homogeneous Problem
In this subsection, the objective is to find the solution which satisfies the homo-geneous equation and inhomohomo-geneous conditions in top and bottom boundary. The homogeneous equation is first defined by setting the external force to zero which correspond to the power density in (2.1). That is,
σ∂T (r, t)
Given that Θ(r, t) is the solution to homogeneous equation, it is then substituted into homogeneous equation (2.6).
∂Θ(r, t)
∂t = γ
∂2Θ
∂r2 (2.7)
which γ is a constant equal to κ/σ
Θ(r, t) is a function of space domain R and time domain t in which space domain and time domain are independent to each other. Hence, Θ(r, t) can be separated to two functions of variable r and t respectively.
Θ(r, t) = τ (t)υ(r) (2.8)
(2.8) substitutes to (2.7) then gather time-dependent terms on one side and space-dependent terms on the other side.
τ′
τ = γ
υ′′
υ (2.9)
The left hand side is a function of time t and the right hand side is a function of space R. Thus, both sides of (2.9) must be equal to the same constant. By denoting
the constant as −α and let α = γω2 with ω > 0.
τ′
+ ατ = 0 (2.10)
υ′′+ ω2υ = 0 (2.11)
The fundamental solution of τ (t) and υ(r) that satisfies the homogeneous bound-ary conditions can then be obtained by using complex variable method.
τ (t) = e−αt (2.12)
For υ(z), a combination of trigonometric function is applied to satisfy the inho-mogeneous of top and bottom conditions. That is,
υ(z) = κcos(ωz(z + Lz)) + hp ωz sin(ωz(z + Lz)) (2.14) where ωz satisfies κ2ω2 z − hphs κωz(hp + hs) = cot(ωzLz) (2.15)
Since cosine is periodic, ωL must be an integer multiple of π. The general solution of homogeneous heat equation at infinite time is expressed as the summation of numerous solution bases υ(r).
Θ(x, y, z, ∞) = ∞ X i=0 ∞ X j=0 cos(iπ Lx x)cos(jπ Ly y)υ(z) (2.16)
2.1.2
Inhomogeneous Problem
To solve the inhomogeneous equation subject to specific initial and boundary
conditions, Green function denoted by G(r, r′
) is applied. Let G be the temperature
distribution of a unit point power source denoted by δ(r, r′
). The complete equation of G is
∇2G(r, r′) = −δ(r, r
′
)
κ (2.17)
and the boundary conditions is
∂G(r, r′ ) ∂x |x=0,a = ∂G(r, r′ ) ∂y |y=0,b (2.18) κ∂G(r, r ′ ) ∂z |z=−Lz = hpG(x, y, −Lz) (2.19) κ∂G(r, r ′ ) ∂z |z=0 = −hsG(x, y, 0) (2.20) where δ(r, r′) = 1 0 , r = r′ r 6= r′ (2.21)
The bases solution (2.16) obtained by solving homogeneous equation is a set of
orthogonal bases which satisfies the characteristics of Green function with Nx, Ny
and Nz truncation point. In addition, the integration of two orthogonal bases is
equal to δ. Therefore, (2.16) can be substituted to G(r, r′
). G(r, r′ ) = 1 κω2 Nx−1 X i=0 Ny−1 X j=0 Nz−1 X k=0 υij(x ′ , y′ )υk(z ′ ) (2.22)
The general solution of T (r) is obtained by using superposition of integrals of
power density distribution P (r) with G(r, r′
). T (r, ∞) = Z D G(r, r′ )P (r′ )dr′ (2.23)
Since power density is a function only of x and y direction, it may be also written in the form of superposition of orthogonal bases of x and y direction.
P (r) ≈ Pij = Z Lx 0 Z Ly 0 p(x′ , y′ )υij(x′, y′)dx′dy′ (2.24)
Figure 2.4: The illustration of power density on meshing top surface of chip In Figure 2.4, the overall chip power distribution is approximated by dividing
chip into M ∗ N grids with power density pmn in each grid. The power density in
each bin is the summation of power density of contained cells. For cells that are placed on the boundary between two bins, power density of the cell is divided based on cell area covered to each bin. Following equation describes the approximated
power density Pij for each bin. Pij = M −1 X m=0 N −1 X n=0 pmn Z (m+1)LxM mLx/M Z (n+1)LyN nLy/N υij(x′, y′)dx′dy′ = 4LxLy ijπ2 sin( iπ 2Lx )sin( jπ 2Ly ) ∗ M −1 X m=0 N −1 X n=0 pmncos( iπ(2m + 1) 2M )cos( jπ(2n + 1) 2N ) (2.25)
The second term on the right hand side of (2.25) is a form of type-I DCT, thus, the formula can be simplified as
Pij = Cij ∗ ˆPij (2.26) where Cij = 4LxLy ijπ2 sin( iπ 2Lx )sin( jπ 2Ly ) (2.27) ˆ Pij = DCT [pmn] (2.28)
Similar to power density distribution, temperature distribution can also be ap-proximated by dividing into numerous small bins. To describe temperature
distri-bution Tmn in each grid, (2.23) is substituted by (2.25).
T (r, ∞) ≈ T (m, n) = Z Nr 0 1 κω2 Nx−1 X i=0 Ny−1 X j=0 Nz−1 X k=0 υij(x′, y′)υk(z′)Pijdr′ (2.29)
Due to the bases in z direction described in (2.29) is independent to x and y
derived as T (m, n) = Kz Nx−1 X i=0 Ny−1 X j=0 CijPˆij Z Nx 0 Z Ny 0 υij(x ′ , y′)dx′dy′ = Kz Nx−1 X i=0 Ny−1 X j=0 CijDijPˆijcos( iπ(2m + 1) 2M )cos( jπ(2n + 1) 2N ) = Kz Nx−1 X i=0 Ny−1 X j=0 EijPˆijcos( iπ(2m + 1) 2M )cos( jπ(2n + 1) 2N ) = Kz · IDCT [EijPˆij] (2.30) where Kz = 1 κω2 Z Nz 0 υk(z ′ )dz′ (2.31) Eij = Cij ∗ Dij (2.32) Dij = 4NxNy ijπ2 sin( iπ 2Nx )sin( jπ 2Ny ) (2.33)
When i = 0 and j = 0, (2.32) cannot be evaluated. l’Hopital’s rule is then
applied to approximate the value of Eij when i ≈ 0 and j ≈ 0. Eij can then be
written as Eij = △x△y 4N Ly△xsin2(2Njπ) j2π2 4M Lx△ysin2(2Miπ) i2π2
16M N LxLysin2(2Miπ)sin2(2Njπ)
i2j2π4 , i = 0, j = 0 i = 0, j 6= 0 i 6= 0, j = 0 i 6= 0, j 6= 0 (2.34)
(2.30) is also a form of type-I inverse DCT (IDCT) of Eij∗ ˆPij. Solving DCT and
IDCT directly requires complexity of O(N4). However, O(N logN ) can be achieved
by applying input reordering. The runtime can be further enhanced by calculating
Kz and Eij beforehand since both terms are independent of the geometry of power
density distribution.
2.2
Enhancement of DCT
DFT can achieve complexity of O(N logN ) by applying butterfly algorithm, it can also be applied in DCT by operating on real data with even symmetry. Figure
2.5 is a schematic of butterfly algorithm used in DFT.
Figure 2.5: The illustration of butterfly algorithm for DFT
Given N real number of data input x(n), 0 ≤ n ≤ N − 1. DCT and DFT
transform can described as follows in which Ckis denoted as DCT and Fkis denoted
as DFT. Ck= DCT [xn] = N −1 X n=0 xncos π(2n + 1)k 2N , 0 ≤ k ≤ N − 1 (2.35) Fk = DF T [xn] = N −1 X n=0 xn· e−j 2πnk N , 0 ≤ k ≤ N − 1 (2.36)
Note that, ejn is a Euler’s formula in which j is an imaginary number.
ejn= cos(n) + jsin(n) (2.37)
By defining WN = e−j2π/N as twiddle factor, DFT can be expressed as
Fk=
N −1
X
n=0
xnWNnk (2.38)
In (2.35) and (2.36), 4N inputs are required in DFT to generate N outputs for DCT with complexity of O(4N log4N ). Even extension method[4] can be im-plemented to reduce complexity to O(2N log2N ). Further improvement to reduce complexity to O(N logN ) can be achieved by reordering input patterns [18].
2.2.1
Even Extension of DFT for DCT
The concept of even extension is to make the mirror input of DFT to eliminate the imaginary number generated by DFT itself.
(a) input (b) even-extension
Figure 2.6: (a) The original input sequence. (b) The even extension of original input.
yn is a series of 2N points that is the even extension of xn
yn= xxn
2N −n−1 ,
0 ≤ n ≤ N − 1
N ≤ n ≤ 2N − 1 (2.39)
By substituting (2.39) into (2.38), 2N output points Fk from DFT can be
ob-tained. Fk= N −1 X n=0 xnW2Nnk + 2N −1 X n=N x2N −n−1W2Nnk (2.40) where 0 ≤ k ≤ 2N − 1
By changing the summation variable in the second term of the right hand side
and factoring out the term W2N−k/2, equation can be derived as 2.41. Noting that
W2kN
2N = 1 for integer k and W2N2nk = WNnk.
Fk = W −k/2 2N N −1 X n=0 xn[W2NnkW k/2 2N + W −nk 2N W −k/2 2N ] (2.41) which is, Fk = W −k/2 2N · 2Re[W k/2 2N N −1 X n=0 xnW2Nnk] (2.42)
or Fk = W −k/2 2N · 2 N −1 X n=0 xncos π(2n + 1)k 2N (2.43)
By substituting C(k) in (2.35) to (2.43), the relation between DFT and DCT can be found as Ck = 1 2W k/2 2N Fk , 0 ≤ k ≤ 2N − 1 (2.44)
2.2.2
Reordering Input Data
Regarding to (2.42), applying even extension to input data still have complexity of O(2N log2N ). Further improvement can be achieved by applying input reordering which will be explained in this subsection. The concept of input reordering is to avoid extending number of input data to 2N points. Thus, instead of mirroring real number, elimination of the imaginary number can also be achieved by reordering input.
(a) input (b) reordering
Figure 2.7: The sequence of input patterns after reordering. (a) The original input
sequence. (b) The sequence of vn and wn.
The reordered sequence vn and wn can be obtained by retrieving even and odd
element from yn which also corresponds to the original input sequence xn.
vn = xx2n 2N −2n−1 , 0 ≤ n ≤ N −12 N +1 2 ≤ n ≤ N − 1 (2.45)
where wn is a reverse of vn. v(n) and w(n) are then substituted into DFT
Fk = N −1 X n=0 vnW2N2nk+ N −1 X n=0 wnW2N(2n+1)k (2.46) Fk = N −1 X n=0 vnW2N2nk+ N −1 X n=0 vN −n−1W2N2nkW2Nk (2.47)
With few manipulations of summation variable and utilizing the relation of Ck and
Fk, equation (2.49) can then be obtained.
Ck= W2Nk/2[ N −1 X n=0 vnWNnk+ N −1 X n=0 vnW −nk N W −k 2N] (2.48) which is, Ck= N −1 X n=0 vnWNnkW4Nk + N −1 X n=0 vnWN−nkW −k 4N (2.49)
Then, (2.49) also be written as
Ck = 2Re[W4Nk N −1 X n=0 vnWNnk] (2.50) where 0 ≤ k ≤ N − 1
DCT can then be solved by re-ordering Ck back to original input patterns using
N points in(2.50) with complexity of O(N logN ). IDCT can be solved using exact same approach as forward DCT.
Chapter 3
Application to Thermal Aware
Placement
(a) Iteration 1 (b) Legalization after
Itera-tion 1
(c) Iteration 11
(d) Legalization after Itera-tion 11
(e) Iteration 21 (f) Legalization after
Itera-tion 21
(g) Iteration 31 (h) Legalization after
Itera-tion 31
Figure 3.1: Placement of cells to demonstrate interleaving mechanism between le-galization and CG for adaptec1.
In this chapter, the application of thermal analysis for placement is presented and the general flow is depicted in Figure 3.3. The idea of force directed placement
is to model each net as a spring system. A force exists between two cells and pulls two cells toward each other if two cells are connected. To construct the force model for the given placement, each multi-pin net is converted to several two-pin nets and force is calculated for each two-pin net. The force between a two-pin net is defined as follows.
Fij = Wij[(xi− xj)2+ (yi− yj)2] (3.1)
where xi, yi and xj, yj are coordinate of cell i and cell j. The summation of all
the forces for every single two-pin net is
φ(x, y) =X i,j Fij = 1 2x TQx + dT xx + 1 2y TQy + dT yy (3.2)
The summation of force in x and y coordinate can be solved separately and force
equilibrium can be obtained by solving Qx = −dx and Qy = −dy.
Wij =
2
(p − 1)|xi− xj|
(3.3) After matrix is built, we applied Jacobi preconditioned Conjugate Gradient method and used Polak-Ribiere as line search direction parameter to solve the prob-lem.
3.1
Placement Algorithm
The placement strategy implemented in this thesis is based on simPL[16]. Similar to simPL, two placement algorithms are implemented. One greedily minimizes the total wirelength using Precondition Conjugate Gradient Method(CG) and the other serves as a legalizer that quickly generate a rough legalized placement. CG will try to pull every cell together to minimize wirelength and legalizer will try to pull
cell to free area to remove overlap. The result generated by legalizer will act as an anchor pulling cells toward free area. The two algorithms interleave with each other, solution generated by one algorithm will serve as input to another. The process will stop when solution generated by two algorithms are converge.
Figure 3.2: Illustration of the legalization process.
3.2
Considering Thermal Effect in Placement
To achieve a flat thermal profile, high power cells should be evenly spread across the chip. However, greedily spreading out cells is impractical since it dramatically increases the total HPWL which will cause resulting placement unroutable. In this thesis, we addressed thermal effect by minimizing maximum on-chip temperature. Experimental results show that a smoother temperature profile can be obtained by reducing maximum on chip
The placement implemented in this thesis is a flat placer which requires no clustering of cells. A flat placer is ideal to consider thermal effect since each cell can
be treated individually. To consider thermal effect, additional anchor force can be added to pull cells away from high temperature region. However, additional anchor force will increase total HPWL. Thus, the trade off between reducing maximum temperature and increase total HPWL should be carefully dealt with.
Figure 3.3: Flow Diagram of our thermal aware placement.
3.2.1
Obtaining Thermal Anchor
After hot region is identified, an imaginary circle with radius R using center from the obtained hot region will be formed. Anchors will be created on the perimeter of circle pulling cells away from the hot region. Each cell will be pulled by an anchor that is closest in distance.
Without considering thermal effect, a cell will only suffer forces exerted from ev-ery other node it connects to and the force exerted from the pseudo anchor produced during legalization stage. Considering thermal effect, additional thermal force will be exerted on the cells within hot region. Here, we defined the anchor that serves to reduce maximum temperature as thermal anchor and anchor that serves to
remove overlap as legalizing anchor. Figure 3.4 illustrates the concept of thermal anchor and legalizing anchor.
Figure 3.4: Forces exerted on a cell by thermal anchor and legalizing anchor. Ther-mal anchor is placed on a circle and red rectangle is the identified hot region.
To obtain the thermal anchor position for each cell, the vector to direct cell
away from center is calculated. (xi, yi) are the coordinate of cell i and (xo, yo) are
the coordinate of center point of hot region.
~x = xi − xo, ~y = yi− yo (3.4)
Magnitude of the vector is calculated to position thermal anchor on the perimeter
of the circle. R is the radius of the circle and (xt, yt) are the coordinates of the
thermal anchor. mag = 2 s R ~x2+ ~y2 (3.5) xt= xi+ mag ∗ xi (3.6) yt= yi+ mag ∗ yi (3.7)
The radius of circle and force exerted from each anchor determines the magnitude of perturbation of cells and how far it will move away from its original position. To reduce maximum temperature, the larger the circle is, the farther away cell is pulled away from the center.
3.2.2
Determine Magnitude of Thermal Force
As iteration number increases, the general structure of placement begins to form as cells becomes less congested. The force exerted by thermal anchor begin to in-crease such that high power density cells can be spread out more evenly. However, as structure of placement begins to form, perturbing too many cells to reduce maximum temperature will deviate the general structure of placement. To avoid perturbing too many cell and deviate cell too much from its original position, the size of hot re-gion and radius of circle will gradually decrease in size as iteration number increase. We initially set the size of hot region to 13 bins in width and height and decrease by average of 2.5% in each iteration.
In early iterations, cells are congested in center. Anchors generated by legalizer which pull cells from congested region towards white space also serves as a force to pull cells away from hot region. Thus, we let the overlap removal force dominate thermal driven force in early stage of iteration. To adjust the proportion of force between legalizing anchor and thermal anchor, the weight of the thermal anchor Wthermal in iteration n is set proportional to n2/N and weight of legalizing anchor
Wlegal is set proportional to n. Note that, N is total iteration number. Thermal
anchors have negligible effect on cells in early iterations and its impact will quadrat-ically increase as iteration number increases.
To minimize perturbation of cell and to prevent sharp increase in temperature at the perimeter of hot region, a 2D Gaussian model shown in Figure 3.5 is used to determine the magnitude of the force exerted for each thermal anchor to connecting
−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5
Figure 3.5: An illustration of 2-D Gaussian force spread model with center positioned at center of hot region.
cells. The 2-D Gaussian has the form
g(x, y) = Ae−(
(xi−xo)2 2σ2x +
(yi−yo)2
2σ2y ) (3.8)
xoand yoare the center point of the hot region. σx and σy are standard deviation
in xi and yidirection of the cells contained within hot region. Thus, the force exerted
from each thermal anchor Ftherm at iteration n can be derived as follow
FthermX = g(x, y) n2 N 1 |xt− xi| (3.9) FthermY = g(x, y) n2 N 1 |yt− yi| (3.10)
where, (xt,yt) is the coordinate of thermal anchor and (xi,yi) is the coordinate
Chapter 4
Experimental Results
In this chapter, experimental results are presented in three sections. First sec-tion presents the accuracy of the thermal model comparing to Icepak. The second presents the results of our placer comparing with partition-based placer and non-linear force directed placer. In third section, temperature distribution and total HPWL with and without considering thermal effect is presented.
The entire thermal aware placer is written in standard C++ language and com-piled using g++ 4.1.2. The experiment is conducted using Intel Xeon E5530 Quad Core machine operating at 2.4GHz. ISPD 2005 placement benchmark [2] is used for input benchmark. Final legalization and detail placement is delegated to an external binary FastDP [20]. An open-source C-code based FFT solver [10] is used in computation to generate temperature profile. All of the testcases are performed using same parameter without tuning for a specific testcase. The total HPWL is evaluated using open source GSRC bookshelf evaluator [3] and every temperature profile is evaluated with Icepak after legalization.
4.1
Thermal Model
To examine the accuracy of our thermal model, we reference the parameters from [12] and apply in our thermal model. Chip width and height are given in each
testcases and thickness of chip is set to 0.5 mm. Thermal conductivity κ of silicon
is set to 148 W/(m ·◦
C), heat transfer rate hp and hs is set to 8700 W/(m2· C) and
2017 W/(m2· C) respectively. Power density of each cell is a random value ranging
from 0 to 2∗106W/m2. Ambient temperature is set to 22.02◦
C and the grid number
Nx and Ny used in thermal model is set to 128.
Figure 4.1 is the comparison of temperature profile obtained by running our simulation model with ANSYS Icepak. The experimental results of max temperature and average temperature running in our thermal simulation and Icepak are presented as shown in Table 4.1. On average, our thermal simulation achieves an accuracy within 6.5% deviation compare to Icepak with 240 times speedup. Note that, the deviation of temperature distribution for each testcase is calculated by taking the average of the temperature difference for each grid between temperature calculated in our thermal simulation and temperature obtained from Icepak.
Table 4.1: Temperature profile comparison with Icepak
Without Thermal With Thermal Max Temp.
Our Icepak Our Icepak Red. (%)
# Max. Avg. Time(s) Max. Avg. Time(s) Deviation Max. Time(s) Max. Time(s) Deviation Our IcePak adaptec1 64.64 46.82 0.74 63.78 50.2 219.71 8.58 60.21 0.51 61.62 247.54 9.22 5.85 3.39 adaptec2 52.67 35.05 0.64 48.02 35.71 235.02 5.75 50.45 0.43 48.38 235.82 5.37 -0.02 -0.75 adaptec3 42.56 31.96 0.68 45.20 32.83 236.67 3.43 44.01 0.67 42.22 240.66 3.34 6.10 6.60 adaptec4 47.57 32.73 0.71 46.82 32.65 242.68 3.13 45.18 0.71 45.86 250.83 3.13 2.98 2.05 bigblue1 74.44 50.12 0.51 73.41 55.08 225.46 10.71 70.77 0.52 71.36 234.69 10.53 3.77 2.79 bigblue2 55.07 39.82 0.75 53.56 40.82 209.26 6.11 51.03 0.75 53.25 247.39 6.36 2.03 0.58 bigblue3 72.49 36.17 1.24 71.52 36.48 218.39 5.62 65.46 1.35 64.61 231.09 5.55 11.29 9.66 bigblue4 75.35 43.02 2.19 77.67 45.06 222.37 6.76 72.37 2.42 75.49 234.55 8.41 3.95 2.81 Avg. - - 0.93 - - 226.20 6.26 - 0.92 - 240.32 6.49 4.50 3.39
4.2
Placement
In this section, placement results are compared with Capo [21], APlace2 [15], mPL6 [5] and simPL [16]. Capo is a partition based placer, APlace2 is a non-linear force-directed placer, mPL6 is currently strongest non-non-linear force directed placer and simPL is currently one of the strongest placer in ISPD 2005 benchmark.
Our placer is on par in terms of total HPWL compare to APlace2 and outperform Capo10.5 by a nose with 3-4 times speed up. Comparing with state-of-the-art placer, our placer trail behind by 10% and is 4 times slower compare to simPL and 2 times faster compare to mPL6. However, although total HPWL trails behind state-of-the-art placers, the results presented in this section demonstrates that our placer can achieve reasonable placement quality within 10% comparing to the best placers with relatively fast execution speed comparing to partioned-based and non-linear force directed placer as be shown in Table 4.2.
Table 4.2: Placement results in ISPD 2005 benchmarks
Our Capo 10.5 APlace 2.0 simPL mPL6
benchmarks Cell # HPWL Runtime HPWL Runtime HPWL Runtime HPWL Runtime HPWL Runtime adaptec1 211K 87.69 7.82 88.14 25.95 78.35 35.02 77.73 2.27 77.93 18.36 adaptec2 255K 98.22 12.33 100.25 36.06 95.70 50.57 90.36 3.48 92.04 19.91 adaptec3 452K 242.49 25.03 276.80 78.19 218.52 119.53 208.95 7.04 214.16 58.92 adaptec4 496K 212.09 24.40 231.30 79.32 209.28 131.57 187.40 5.30 193.89 55.95 bigblue1 278K 107.46 13.20 110.92 41.78 100.02 44.91 97.42 4.01 96.80 22.82 bigblue2 558K 155.53 22.57 162.81 80.55 153.75 100.96 145.78 8.28 152.34 61.55 bigblue3 1110K 378.09 79.05 405.40 182.94 411.59 209.24 339.78 13.79 344.10 85.23 bigblue4 2180K 897.69 194.00 1016.19 567.15 871.29 489.05 808.22 35.80 829.44 189.83 Norm. - 1.00 1.00 1.07 3.07 0.97 3.97 0.90 0.26 0.92 1.89
4.3
Thermal Aware Placement
In Table 4.3, when weight of each thermal anchor increases linearly in each itera-tion, max temperature is reduced by 2.2% at the expense of 15% increase in HPWL. This is primarily due to large perturbation of cells in early iterations. By adjusting weight of thermal anchor to increase quadratically, better max temperature reduc-tion can be obtained with less increase in total HPWL. In addireduc-tion, we observed that moving cells in the hottest region is most effective to reduce maximum temperature, however, moving every single cells inside the hot region will cause temperature in-crease around the perimeter of hot region. Thus, we apply Gaussian force model to let cells closer to the hot region have more energy to move away while cells farther away from the hot region have less energy to move. With additional of thermal force
adjustment, 4.5% reduction in max temperature with 7% increase in total HPWL. Figure 4.1 is the temperature distribution of adaptec1 with and without con-sidering thermal effect. Note that region in red only denote the hottest region and does not correspond to a specific temperature. Maximum temperature for Figure 4.1(c) is 5.85% less than Figure 4.1(a) with 3.5% increase in total HPWL. It can be observed that a smoother temperature distribution can be obtained by consid-ering thermal effect. Noting that in Table 4.3, LW is the linear width and QW is Quadratic Width.
Table 4.3: Various approaches to reduce maximum temperature
Original LW increase W = n QW increase W = n2/N QW +Gaussian
benchmarks Cell# HPWL Temp. HPWL Temp. Red.(%) HPWL Temp. Red.(%) HPWL Temp. Red.(%) adaptec1 211K 87.69 63.95 91.40 60.72 5.05 89.33 60.43 5.50 90.76 60.21 5.85 adaptec2 255K 98.22 50.44 115.86 49.25 2.36 109.94 52.96 -5.00 105.76 50.45 -0.02 adaptec3 452K 242.49 46.87 282.51 49.68 -6.00 274.06 45.02 3.94 258.81 44.01 6.10 adaptec4 496K 212.09 46.57 219.20 44.43 4.59 228.01 45.33 2.66 219.12 45.18 2.98 bigblue1 278K 107.46 73.54 126.37 71.39 2.92 123.76 71.25 3.11 118.61 70.77 3.77 bigblue2 558K 155.53 52.09 160.73 53.22 -2.17 166.15 51.46 1.21 163.46 51.03 2.03 bigblue3 1110K 378.09 73.79 524.75 67.42 8.63 474.70 62.28 15.60 439.27 65.46 11.29 bigblue4 2180K 897.69 75.35 935.53 66.55 2.05 947.73 66.88 11.24 909.50 72.37 3.95 Norm. - 1.00 1.00 1.15 0.98 2.18 1.11 0.95 4.79 1.07 0.95 4.50
(a) (b)
(c) (d)
(e) (f)
Figure 4.1: (a). Power density of adaptec1. (b). Power density of bigblue3. (c). Temperature distribution of adaptec1. (d). Temperature distribution of bigblue3. (e). Temperature distribution of adaptec1 using Icepak before thermal aware place-ment. (f). Temperature distribution of adaptec1 using Icepak before thermal aware placement.
(a) (b)
(c) (d)
Chapter 5
Conclusions
In this thesis, we proposed a thermal aware placement using analytical thermal model and used a force directed algorithm for our placer. The analytical thermal model is implemented using Green function and temperature profile is solved using DCT with input reordering to enhance speed. Analytical thermal model can offer closed form thermal representation with negligible run time. Although analytical thermal model is less accurate but can obtain a temperature profile much faster within reasonable accuracy. Such characteristic is very suitable for thermal analysis during placement or floorplanning stage where thermal analysis does not require to be exact.
Regarding to our placement, we adopt our placer based on [16] with few modifi-cations of our own. Although it trails behind state-of-the-art placers, experimental result shows that the quality of our placer is within reasonable range. The general structure of our placer is relatively simple allowing thermal model easily integrated. Considering thermal effect, we proposed a new methodology by adding thermal anchors around identified hot region to reduce maximum temperature. The size of hot region is dynamic such that perturbation of cells is limited without sacrificing too much HPWL. The magnitude of the thermal anchor is calculated using a 2-D Gaussian model to obtain a smooth temperature profile.
Bibliography
[1] Ansys icepak. http://www.ansys.com/Products/Simulation+Technology/ Fluid+Dynamics/ANSYS+Icepak.
[2] Ispd 2005 benchmark. http://archive.sigda.org/ispd2006/contest.html. [3] S. N. Adya and I. L. Markov. Executable placement utilities. http://vlsicad.
eecs.umich.edu/BK/PlaceUtils/.
[4] N. Ahmed, T. Natarajan, and K. Rao. Discrete cosine transfom. volume C-23, pages 90 – 93, jan. 1974.
[5] T. F. Chan, J. Cong, J. R. Shinnerl, K. Sze, and M. Xie. mpl6: enhanced mul-tilevel mixed-size placement. In International Symposium on Physical Design, pages 212–214, 2006.
[6] S. Chaudhury.
[7] D. Chen, E. Li, E. Rosenbaum, and S.-M. Kang. Interconnect thermal modeling for accurate simulation of circuit timing and reliability. volume 19, pages 197 –205, feb 2000.
[8] G. Chen and S. Sapatnekar. Partition-driven standard cell thermal placement. In Proceedings of the 2003 international symposium on Physical design, ISPD ’03, pages 75–80, New York, NY, USA, 2003. ACM.
[9] T.-C. Chen, Z.-W. Jiang, T.-C. Hsu, H.-C. Chen, and Y.-W. Chang. Ntuplace3: An analytical placer for large-scale mixed-size designs with preplaced blocks and density constraints. volume 27, pages 1228–1240, July 2008.
[10] M. Frigo and S. G. Johnson. Fftw. http://www.fftw.org/links.html. [11] B. Goplen and S. Sapatnekar. Efficient thermal placement of standard cells in 3d
ics using a force directed approach. In Proceedings of the 2003 IEEE/ACM in-ternational conference on Computer-aided design, ICCAD 03, pages 86–, Wash-ington, DC, USA, 2003. IEEE Computer Society.
[12] B. Goplen and S. S. Sapatnekar. Efficient thermal placement of standard cells in 3d ics using a force directed approach. In Proc. Int. Conf. on Computer-Aided Design, pages 86–89, 2003.
[13] P.-Y. Huang, C.-K. Lin, and Y.-M. Lee. Full-chip thermal analysis for the early design stage via generalized integral transforms. In Design Automation Conference, 2008. ASPDAC 2008. Asia and South Pacific, pages 462 –467, march 2008.
[14] A. B. Kahng, S. mo Kang, W. Li, and B. Liu. Analytical thermal placement for vlsi lifetime improvement and minimum performance variation. In International Conference on Computer Design, pages 71–77, 2007.
[15] A. B. Kahng and Q. Wang. A faster implementation of aplace. In International Symposium on Physical Design, pages 218–220, 2006.
[16] M.-C. Kim, D.-J. Lee, and I. Markov. Simpl: An effective placement algo-rithm. In Computer-Aided Design (ICCAD), 2010 IEEE/ACM International Conference on, pages 649 –656, nov. 2010.
[17] J. Li and H. Miyashita. Thermal-aware placement based on fm partition scheme and force-directed heuristic. volume E89-A, pages 989–995, Oxford, UK, April 2006. Oxford University Press.
[18] J. Makhoul. A fast cosine transform in one and two dimensions. volume 28, pages 27 – 34, feb 1980.
[19] B. Obermeier and F. M. Johannes. Temperature-aware global placement. In Proceedings of the 2004 Asia and South Pacific Design Automation Conference, ASP-DAC ’04, pages 143–148, Piscataway, NJ, USA, 2004. IEEE Press.
[20] M. Pan, N. Viswanathan, and C. C. N. Chu. An efficient and effective detailed placement algorithm. In International Conference on Computer Aided Design, pages 48–55, 2005.
[21] J. Roy. Capo: Robust and scalable open-source min-cut floorplacer. In Inter-national Symposium on Physical Design, pages 218–220, 2006.
[22] P. Spindler, U. Schlichtmann, and F. M. Johannes. Kraftwerk2 - a fast force-directed quadratic placement approach using an accurate net model. volume 27, pages 1398–1411, 2008.
[23] C. H. Tsai and S. M. Kang. Cell-level placement for improving substrate ther-mal distribution. In IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., volume 19, pages 253–266, Feb. 2000.
[24] N. Viswanathan, G.-J. Nam, C. Alpert, P. Villarrubia, H. Ren, and C. Chu. Rql: Global placement via relaxed quadratic spreading and linearization. In Design Automation Conference, 2007. DAC ’07. 44th ACM/IEEE, pages 453 –458, june 2007.
[25] N. Viswanathan, M. Pan, and C. C. N. Chu. Fastplace 3.0: A fast multilevel quadratic placement algorithm with placement congestion control. In Asia and South Pacific Design Automation Conference, pages 135–140, 2007.
[26] B. Wang and P. Mazumder. Accelerated chip-level thermal analysis using mul-tilayer green’s function. volume 26, pages 325 –344, feb. 2007.
[27] Y. Zhan and S. Sapatnekar. High-efficiency green function-based thermal simu-lation algorithms. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, 26(9):1661 –1675, sept. 2007.