行政院國家科學委員會專題研究計畫成果報告

(1)

行政院國家科學委員會專題研究計畫成果報告

快速座標旋轉原理演算法應用於特殊數位信號處理之晶片設計與製作

研究成果報告(精簡版)

計畫類別：個別型

計畫編號： NSC 97-2221-E-216-044-

執行期間： 97 年 08 月 01 日至 98 年 07 月 31 日執行單位：中華大學微電子工程學系

計畫主持人：宋志雲共同主持人：辛錫進

計畫參與人員：碩士班研究生-兼任助理人員：柯律廷

報告附件：出席國際會議研究心得報告及發表論文

處理方式：本計畫可公開查詢

中華民國 98 年 07 月 20 日

(2)

行政院國家科學委員會補助專題研究計畫 █ 成果報告

□期中進度報告

快速座標旋轉原理演算法應用於特殊數位信號處理之晶片設計與製作

計畫類別：█ 個別型計畫 □ 整合型計畫計畫編號：NSC 97－2221－E－216－044－

執行期間： 97 年 8 月 1 日至 98 年 7 月 31 日

計畫主持人：宋志雲共同主持人：辛錫進計畫參與人員：柯律廷

成果報告類型(依經費核定清單規定繳交)：□精簡報告 □完整報告

本成果報告包括以下應繳交之附件：

□赴國外出差或研習心得報告一份

□赴大陸地區出差或研習心得報告一份

█出席國際學術會議心得報告及發表之論文各一份

□國際合作研究計畫國外研究報告書一份

處理方式：除產學合作研究計畫、提升產業技術及人才培育研究計畫、

列管計畫及下列情形者外，得立即公開查詢

□涉及專利或其他智慧財產權，□一年□二年後可公開查詢

執行單位：中華大學

(3)

中文摘要

本研究報告第一部分說明一個座標旋轉原理(CORDIC)為基礎之分離式基底快速傅利葉轉換(split-radix fast Fourier transform)核心用於正交頻分複用技術(OFDM)，例如超寬頻無線網絡(UWB)，非對稱數位用戶線路(ADSL)，數位音訊廣播(DAB)，數位視訊廣播–陸上系統 (DVB-T) ，高速數位用戶線路 (VHDSL) 以及全球互通微波存取

(WiMAX) 。高速 128/256/512/1024/ 2048/4096/8192 點

快速傅利葉轉換處理器於本研究中以台積電製程

0.18 微米

(1p6m)

完成晶片設計，所有控制電路均集成於單一晶片上。所完成之快速傅利葉轉換處理

器在功率消耗上以及晶片面積上，均獲得較佳之效率。

關鍵詞-矽智產權(Intellectual Property)，快速傅利葉轉換，分離式基底，座標旋轉原理，

正交頻分複用技術。

本研究報告第二部分說明一個超大型積體電路架構之二維順向及反向

8×8

正弦轉換處理

器，此一處理器的架構具備平行及疊流結構，其中包含 64 位元之靜態隨機記憶體，6 個位元的唯讀記憶體置放參數，與傳統演算法之架構相比較，節省記憶體空間。此一處理器中的乘法器悉數被座標旋轉原理處理器取代，節省許多硬體，同時減少功率消耗，增加多效率。

關鍵詞-順向及反向正弦轉換，平行及疊流結構，低功率，座標旋轉原理。

(4)

英文摘要

The first part of report presents a Coordinate Rotation Digital Computer (CORDIC)-based split-radix fast Fourier transform (FFT) core for Orthogonal Frequency Division Multiplexing (OFDM) systems, for example, Ultra Wide Band (UWB), Asymmetric Digital Subscriber Line (ADSL), Digital Audio Broadcasting (DAB), Digital Video Broadcasting – Terrestrial (DVB-T), Very High Bitrate DSL (VHDSL), and Worldwide Interoperability for Microwave Access (WiMAX). High-speed 128/256/512/1024/ 2048/4096/8192-point FFT processors have been implemented by 0.18 μ

m

(1p6m) at 1.8V, in which all the control signals are generated internally.

These FFT processors outperform the conventional ones in terms of both power consumption and core area.

Key-Words: - Intellectual Property (IP), FFT, split-radix, CORDIC, OFDM.

Two-dimensional discrete cosine transform (DCT) and inverse discrete cosine transform (IDCT) have been widely used in many image processing systems. In the second part of report, efficient architectures with parallel and pipelined structures are proposed to implement

8×8

DCT and IDCT processors. In which, only one bank of SRAM (64 words) and coefficient ROM (6 words) is utilized for saving the memory space. The kernel arithmetic unit, i.e. multiplier, which is demanding in the implementation of DCT and IDCT processors, has been replaced by simple adders and shifters based on the CORDIC algorithm. The proposed architectures for 2-D DCT and IDCT processors not only simplify hardware but also reduce the power consumption with high performances

.

Key-Words: - DCT, IDCT, parallel and pipelined architecture, low-power, CORDIC.

(5)

報告內容:

一、可重組式快速傅利葉轉換之超大型積體電路架構

1 Introduction

High-performance fast Fourier transform (FFT) processor is needed especially for real-time digital signal processing applications. Specifically, the computation of discrete Fourier transform (DFT) ranging from 128 to 8192 points is required for the orthogonal frequency division multiplexer (OFDM) of the following standards: Ultra Wide Band (UWB), Asymmetric Digital Subscriber Line (ADSL), Digital Audio Broadcasting (DAB), Digital Video Broadcasting – Terrestrial (DVB-T), Very High Bitrate DSL (VHDSL) and Worldwide Interoperability for Microwave Access (WiMAX) [1]-[8]. Thompson [9] proposed an efficient VLSI architecture for FFT in 1983. Wold and Despain [10] proposed pipelined and parallel-pipelined FFT for VLSI implementations in 1984. Widhe [11] developed efficient processing elements of FFT in 1997.

To reduce the computation complexity, the split-radix 2/4, 2/8, and 2/16 FFT algorithms were proposed in [12]-[15].

As the Booth multiplier is not suitable for hardware implementations of large FFT, we propose the CORDIC-based multiplier. Moreover, we develop a ROM-free twiddle factor generator using simple shifters and adders only [1], which obviates the need to store all the twiddle factors in a large ROM space. As a result, the proposed CORDIC-based split-radix FFT core with the ROM-free twiddle factor generator is suitable for the wireless local area network (WLAN) applications. In this report, a high-performance 128/256/512/1024/2048/4096/8192- point FFT processor is presented for the European and Japanese standards. The remainder of this report proceeds as follows. In Section 2, the split-radix 2/8 FFT algorithm and the CORDIC algorithm are reviewed briefly. In Section 3, the reusable IP 128-point CORDIC-based split-radix FFT core is proposed. In Section 4, the hardware implementations of FFT processors are described. The performance analysis is presented in Section 5. Finally, the conclusion is given in Section 6.

2 Review of Split-Radix FFT and CORDIC Algorithm 2.1 Split-Radix FFT

The idea behind the split-radix FFT algorithm is to compute the even and odd terms of FFT separately. The even term of the split-radix 2/8 FFT algorithm is given by

2 ))

( ) ( ( )

2 (

^/² ¹ _/₂

0

nk N N

n

N W n x n x k

X

∑

⁻

=

+ +

= (1)

where

^/²

2 2

/ jN

N e

W

− π

= and

k

= 0 , 1 , 2 ,...., (

N

/ 2 ) − 1 . The odd term is as follows:

nk N nl N l l l

l l

N

n

l l

W W W NW n x NW

n x

NW n N x n x NW

n x

NW n x NW n x n x l

k X

8 / 8

4 2

4

4 4

1 8 /

0

2 4 4

) ) 8 ) ( 7 8 )

( 5

8 ) ( 3 8) ( ( ) 8 ) ( 6

8 ) ( 4 8 )

( 2 ) ( ((

) 8 (

−

=

+ + +

+

+ + + + +

+

+ + +

+

=

+

∑

(2)

where 1

k

= 0 , 1 , 2 ,...., (

N

/ 8 ) − and

l=1,3,5,7.

The split-radix 2/8 FFT algorithm, which

(6)

combined with radix-2 and radix-4 proves effective to develop a reusable IP 128-point FFT core.

2.2 CORDIC Algorithm

The CORDIC algorithm in the circular coordinate system is as follows [16].

) ( 2 ) ( ) 1

(

i x i y i

x

+ = − σ

_i ⁻ⁱ

(3)

) ( 2 ) ( ) 1

(

i y i x i

y

+ = + σ

_i ⁻^j

(4) )

( ) ( ) 1

(

i z i i

z

+ = − σ

_i

α (5)

i

) = tan

⁻

2

⁻i

(

¹

α (6)

where )) σ

_i

=

sign

( i

z

( with

z

(

i

) → 0 in the rotation mode, and σ

_i

= −

sign

(

x

(

i

)) ⋅

sign

(

y

(

i

)) with 0

y

(

i

) → in the vectoring mode. The scale factor:

k(i)

is equal to

1+

σ

_i²2⁻²ⁱ

. After n micro-rotations, the product of the scale factors is given by

∏

⁻

=

− −

=

+

=

¹

0 1 2

0

1

( )

ⁿ

1 2

i n i

i

i k

K

(7)

Notice that CORDIC in the circular coordinate system with rotation mode can be written by

⎥⎦

⎢ ⎤

⎣

⎥⎡

⎦

⎢ ⎤

⎣

⎡

= −

⎥⎦

⎢ ⎤

⎣

⎡

0 0 0 0

0 0

cos sin

sin cos

y x z z

z K z

y x

c n

n

, (8)

where

_⎥

⎦

⎢ ⎤

⎣

⎡

0 0

y

x

and

_⎥

⎦

⎢ ⎤

⎣

⎡

n n

y

x

are the input vector and the output vector, respectively,

z₀

is the

rotation angle, and K

c

is the scale factor. In [1], the circular rotation computation of CORDIC was used for complex multiplication with

e⁻^j^θ

, which is given by

⎥⎦

⎢ ⎤

⎣

⎥⎡

⎦

⎢ ⎤

⎣

⎡

= −

⎥⎦

⎢ ⎤

⎣

⎡

] Im[

] Re[

cos sin

sin cos

] Im[

] Re[

' '

X X X

X

θ θ

θ

(9)

3 Reusable IP 128-Point CORDIC-Based Split-Radix FFT Core

Figure 1 shows the proposed 128-point CORDIC-based split-radix FFT processor, which can be used as a reusable IP core for various FFT with multiples of 128 points. Notice that the modified split-radix 2/8 FFT butterfly processor and the ROM-free twiddle factor generator are used. In addition, an internal (128 32-bit) SRAM is used to store the input and output data for hardware efficiency, through the use of the in-place computation algorithm [1].

3.1 CORDIC-Based Split-Radix 2/8 FFT Processor

For the butterfly computation of the proposed CORDIC-based split-radix 2/8 FFT processor,

sixteen complex additions, two constant multiplications (CM), and four CORDIC operations are

needed, as shown in Figure 2. The CORDIC algorithm has been widely used in various DSP

applications because of the hardware simplicity. According to equation (9), the twiddle factor

(7)

The pipelined CORDIC arithmetic unit can be obtained by decomposing the CORDIC algorithm into a sequence of operational stages. In [17], we derived the error analysis of fixed-point CORDIC arithmetic, based on which, the number of the CORDIC stages can be determined effectively. For example, the number of the CORDIC stages is 12 if the overall relative error of 16-bit CORDIC arithmetic is required to be less than 10

⁻³

. The pipelined CORDIC arithmetic unit with 12 stages and an additional pre-scalar stage. In which, the pre-calculated scaling factor

K_c

≈ 1 . 64676 and the Booth binary recoded format leads to 1.101001. The main concern for the design of the CORDIC arithmetic unit is throughput rather than latency. The proposed CORDIC arithmetic unit in terms of gate counts is less than 4 real multipliers significantly. In addition, the power consumption can be reduced significantly by using the proposed CORDIC arithmetic unit; it has been reduced by 30% according to the report of PrimePower® distributed by Synopsys.

As the twiddle factors:

W and ₈¹ W are equal to ₈³

( 1 ) 2

2 −

j

and ( 1 ) 2

2 +

j

− ,

respectively, a complex number, say (

a

+

bj

) , times

W or ₈¹ W can be written by ₈³

)) (

) 2 ((

)) 2 1 2 ( ( 2 )

(

a

+

bj

× −

j

=

a

+

b

+

j

−

a

+

b

(10)

)) ( ) 2 ((

)) 2 1 2 ( ( 2 )

(

a

+

bj

× − +

j

= −

a

−

b

+

j a

+

b

(11)

where

2

can be represented as 1 . 0 1 0 1 010 using the Booth binary recoded form (BBRF).

Thus, the CM unit can be implemented by using simple adders and shifters only. Figure 3 shows the pipelined CM architecture, which uses three subtractions/additions and therefore improves on the computation speed significantly.

Based on the above-mentioned CORDIC arithmetic unit and CM unit, the computational circuit and hardware architecture of the CORDIC-based split-radix 2/8 FFT butterfly computation are realized. As one can see, the pipelined CORDIC arithmetic unit aims at increasing the throughput of complex multiplications..

3.2 ROM-Free Twiddle Factor Generator

In the conventional FFT processor, a large ROM space is needed to store all the twiddle factors.

To reduce the chip area, a twiddle factor generator is thus proposed. Figure 4 shows the ROM-free twiddle factor generator using simple adders and shifters for 128-point FFT. In which, the 16-bit accumulator is to generate the value

2n

π for each index n; 1

n

= 2

^log²^N⁻³

− , the 16-bit shifter is to divide

2n

π by N, and the 16-bit shifter/adder is to produce the twiddle factors: θ

_N¹ⁿ

,

n N

θ

3

, θ

_N⁵ⁿ

and θ

_N⁷ⁿ

. By using the twiddle factor generator, the chip area and power consumption can be reduced significantly at the cost of an additional logic circuit. Table 1 shows the gate counts of the full-ROM storing all the twiddle factors, the CORDIC twiddle factor generator [1]

and the ROM-free twiddle factor generator.

(8)

4 Implementation of FFT Processors

The 128/256/512/1024/2048/4096/8192- point FFT processors. In which, the radix-2 and split-radix 2/4 butterfly processors [1] using the pipelined CORDIC arithmetic units and twiddle factor generators are implemented; and moreover, two memory banks (4096/2048/1024/512/256/0 × 32-bit and 8192/4096/2048/1024/512/256/128 × 32-bit) are allocated for increased efficiency by using the in-place computation algorithm [1]. Hardware architecture is shown in Figure 5.

The hardware code written in Verilog

^®

is running on a workstation with the ModelSim

^®

simulation tool and Synopsys

^®

synthesis tool (design compiler). The chips are synthesized by the TSMC 0.18 m μ 1p6m CMOS cell libraries [18]. The physical circuit is synthesized by the Astro

^®

tool. The circuits are evaluated by DRC, LVS and PVS [19].

The layout view of the8192-point FFT processor is shown in Figure 6. The core areas, power consumptions, clock rates of 128-point, 256-point, 512-point, 1024-point, 2048-point, 4096-point and 8192-point FFT processors are shown in Table 2. All the control signals are internally generated on-chip. The chip provides both high throughput and low gate count.

5 Performance Analysis of The Proposed FFT Architecture

FFT processors used to compute 128/256/512/1024/ 2048/4096/8192-point FFT are composed mainly of the 128-point CORDIC-based split-radix 2/8 FFT core; the computation complexity using a single 128-point FFT core is

O(N/6)

for N-point FFT. The log-log plot of the CORDIC computations versus the number of FFT points is shown in Figure 7. As one can see, the proposed FFT architecture is able to improve the power consumption and computation speed significantly

.

6 Conclusion

This report presents low-power and high-speed FFT processors based on CORDIC and split-radix techniques for OFDM systems. The architectures are mainly based on a reusable IP 128-point CORDIC-based split-radix FFT core. The pipelined CORDIC arithmetic unit is used to compute the complex multiplications involved in FFT, and moreover the required twiddle factors are obtained by using the proposed ROM-free twiddle factor generator rather than storing them in a large ROM space.

The CORDIC-based 128/256/512/1024/2048/4096/8192- point FFT processors have been implemented by 0.18

μm

CMOS, which take 395

μs

, 176.8

μs

, 77.9

μs

, 33.6

μs

, 14

μs

, 5.5

μs

and 1.88

μs

to compute 8192-point, 4096-point, 2048-poin, 1024-point, 512-point, 256-point and 128-point FFT, respectively.

The CORDIC-based FFT processors are designed by using the portable and reusable

Verilog

^®

. The 128-point FFT core is a reusable intellectual property (IP), which can be

implemented in various processes and combined with an efficient use of hardware resources for

the trade-offs of performance, area, and power consumption.

(9)

[2] J. C. Kuo, C. H. Wen, A. Y. Wu, “Implementation of a programmable 64/spl sim/2048-point FFT/IFFT processor for OFDM-based communication systems,” Proceedings of the 2003 International Symposium on Circuits and Systems, Volume 2, 25-28 May 2003 pp.II-121 - II-124.

[3] L. Xiaojin, Z. Lai, C. J. Cui, “A low power and small area FFT processor for OFDM demodulator,”

IEEE Transactions on Consumer Electronics, Volume 53, Issue 2, May 2007, pp. 274 – 277.

[4] J. Lee, H. Lee, S. I. Cho, S. S. Choi, “A high-speed, low-complexity radix-216 FFT processor for MB-OFDM UWB systems,” Proceedings of the 2006 IEEE International Symposium on Circuits and Systems, May 2006.

[5] A. Cortes, I. Velez, J. F. Sevillano, A. Irizar, “An approach to simplify the design of IFFT/FFT cores for OFDM systems,” IEEE Transactions on Consumer Electronics, Volume 52, Issue 1, Feb. 2006, pp.26 – 32.

[6] Y. H. Lee, T. H. Yu, K. K. Huang, A. Y. Wu, “Rapid IP design of variable-length cached-FFT processor for OFDM-based communication systems,” IEEE Workshop on Signal Processing Systems Design and Implementation, Oct. 2006 pp.62-65.

[7] C. L. Wey, W. C. Tang, S. Y. Lin, “Efficient memory-based FFT architectures for digital video broadcasting (DVB-T/H),” 2007 International Symposium on VLSI Design, Automation and Test, 25-27 April 2007, pp.1-4.

[8] Y. W. Lin, H. Y. Liu, C. Y. Lee, “A 1-GS/s FFT/IFFT processor for UWB applications,” IEEE Journal of Solid-State Circuits, Volume 40, Issue 8, Aug. 2005, pp.1726-1735.

[9] C. D. Thompson, “Fourier transform in VLSI,” IEEE Transactions on Computers, Vol.32, No. 11, 1983, pp.1047-1057.

[10] E. H. Wold, A. M. Despain, “Pipelined and parallel-pipelined FFT processor for VLSI implementation,” IEEE Transactions on Computers, Vol.33, No. 5, 1984, pp.414-426.

[11] T. Widhe, “Efficient implementation of FFT processing elements,” Linkoping Studies in Science and Technology, Thesis No. 619, Linkoping University, Sweden, 1997.

[12] P. Duhamel, H. Hollmann, “Implementation of "split-radix" FFT algorithms for complex, real, and real symmetric data.” IEEE International Conference on Acoustics, Speech, and Signal Processing, Volume 10, April 1985, pp.784 – 787.

[13] A .A. Petrovsky, S. L. Shkredov, “Automatic generation of split-radix 2-4 parallel-pipeline FFT processors: hardware reconfiguration and core optimizations,” 2006 International Symposium on Parallel Computing in Electrical Engineering, pp.181-186.

[14] S. Bouguezel, M. O. Ahmad, M. N. S. Swamy, “A new radix-2/8 FFT algorithm for length-q/spl times/2/sup m/ DFTs,” IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications, Volume 51, Issue 9, 2004, pp.1723- 1732.

[15] W. C. Yeh, C. W. Jen, “High-speed and low-power split-radix FFT.” IEEE Transactions on Acoustics, Speech, and Signal Processing, Volume 51, Issue 3, March 2003, pp.864 – 874.

[16] M. D. Ercegovac, T. Lang, “CORDIC algorithm and implementations.” Digital Arithmetic, Morgan Kaufmann Publishers, 2004, Chapter 11.

[17] T. Y. Sung, H. C. Hsin, “Fixed-point error analysis of CORDIC arithmetic for special-purpose signal processors,” IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, Vol.E90-A, No.9, Sep. 2007, pp.2006-2013.

(10)

[18] “TSMC 0.18 CMOS Design Libraries and Technical Data, v.3.2,” Taiwan Semiconductor Manufacturing Company, Hsinchu, Taiwan, and National Chip Implementation Center (CIC), National Science Council, Hsinchu, Taiwan, R.O.C., 2006.

[19] Cadence design systems: http://www.cadence.com/products /pages/ default.aspx.

(11)

Reg. Memory

128*32 Reg.

Modify Split- Radix 2/8 FFT

Architecture

Controller

8*32 8*32

32 32

16 16

-1

-1 -1

-1

-1 -1

) (n x

) 8 / (n N x +

) 4 / (n N x +

) 8 / 3 (n N x +

) 2 / (n N x +

) 8 / 5 (n N x +

) 4 / 3 (n N x +

) 8 / 7 (n N x +

−j

n

WN

n

WN³ n

WN⁵

n

WN⁷

) 1 8 (k+ X

) 3 8 (k+ X

) 5 8 (k+ X

) 7 8 (k+ X

CM 8(1)

CM

8(3) CORDIC

CORDIC CORDIC CORDIC

) 8 ( k a

) 2 8 ( k+ a

) 4 8 ( k+ a

) 6 8 ( k+ a

A d d S u b

]

R e [X I m [ X ]

S h i f t e r 2 / S u b

L a t c h L a t c h

M u x

] ' I m [ 2 _ 2 ] ' R e [ 2

2 X X

S h i f t e r 2 / S u b

S h i f t e r 4 / S u b S h i f t e r 4 / S u b

Fig. 1 The proposed 128-point CORDIC- based split-radix FFT processor

Fig. 2 Data flow of the butterfly computation of the modified split-radix 2/8 FFT

Fig. 3 Constant multiplier (CM) architecture for the modified split-radix 2/8 FFT

(12)

16-bit Accumulator

16-bit Reg.

16-bit Shifter

16-bit Shifter/Adder

n N

θ1 θ_N⁵ⁿ θ_N³ⁿ θ_N⁷ⁿ

Control π

2

4

8 16

16

16 16 16 16

2 2

8192-point FFT Processor 4096-point FFT Processor 2048-point FFT Processor 1024-point FFT Processor 512-point FFT Processor

256-point FFT Processor 128-point FFT Processor

IP R

a d i x 2 S P l i t 2/4

P/S S/P

S P l i t 2/8 S P l i t 2/8 S P l i t 2/8 S P l i t 2/8

4096/2048/1024/512/256/0*32 Internal Memory

8192/4096/2048/1024/512/256/128*32 External Memory

Fig. 4 Proposed ROM-free twiddle factor generator for 128-point FFT

Fig. 5 Hardware architecture of the 128/256/512/1024/2048 /4096 /8192-point

FFT processor

(13)

F u l l - T w i d d l e F a c t o r R O M

C O R D I C T w i d d l e F a c t o r G e n e r a t o r

R O M - f r e e T w i d d l e F a c t o r G e n e r a t o r ( S u n g , H s i n a n d C h e n g , 2 0 0 8 ) 8 1 9 2 - P o i n t R O M

b i t 1 6 K 4 ×

1 1 - b i t A d d e r 1 1 - b i t S h i f t e r

1 6 - b i t C O R D I C 1 6 - b i t S h i f t e r 1 6 - b i t A d d e r b i t

K 1 8

~ ~ 150 g a t e s ~ 50 g a t e s ~ 90 g a t e s ~ 2 0 0 g a t e s

1 6 - b i t A c c u m u l a t o r 1 6 - b i t S h i f t e r 1 6 - b i t S h i f t e r / A d d e r g a t e s 2 2 0 0 2 9 0

~ × + ×

g a t e s 0 9

~ 2 0 0 g a t e s

~

1 6 - b i t R e g i s t e r g a t e s 3 2

~

1 b i t ~ 1 g a t e

( T . Y . S u n g , 2 0 0 6 ) [ 1 ]

FFT Size Core Area

Power Consumption

Clock Rate

128-point

2.28mm²

80mW 200MHz 256-point

2.37mm²

84mW 200MHz 512-poiint

2.49mm²

88mW 200MHz 1024-point

2.62mm²

94mW 200MHz 2048-point

2.81mm²

99mW 200MHz 4096-point

3.10mm²

106mW 200MHz 8192-point

3.62mm²

117mW 200MHz Fig. 7 Log-log plot of the CORDIC computations versus the number of FFT points

Table 1 Hardware requirements of the full-ROM, the CORDIC twiddle factor generator [1], and the ROM-free twiddle factor generator

Table 2 Core areas, power consumptions, clock rates of 128-, 256-, 512-, 1024-, 2048-,

4096- and 8192-point FFT processors

(14)

二、高效率以座標旋轉演算法為基礎之二維順向及反向正弦轉換處理器

1 Introduction

With the rapid growth of modern communication applications and computer technologies, image compression continues to be in great demand. Three categories of image compression techniques have been developed: differential pulse code modulation, transform coding and subband coding.

In many cases, transform coding is preferable. The simplest transform coding approach is based on Walsh-Hadamard transform, in which the kernel matrix involves simple additions and subtractions only [1]. Karhunen-Loeve transform (K-LT) is the optimal; however the computation involved is too complicated to be implemented. Discrete cosine transform (DCT), which approximates well K-LT [2], is efficient and therefore adopted by the Joint Photographic Expert Group (JPEG) standard. Many DCT algorithms have been developed [3]-[11]; the VLSI implementations of DCT for real-time applications can be found in [12]-[25].

Though fast Fourier transform (FFT) can be utilized to implement DCT, it requires complex-valued computations. In addition, the order of N-point DCT through the use of FFT is

O

(log 2

N

+ 1 ) . CORDIC (COordinate Rotation DIgital Computer) is a well known algorithm, which evaluates various elementary functions including sine and cosine functions by using simple adders and shifters. CORDIC is suitable for the design of high performance chips using VLSI technology. In this paper, the CORDIC approach to the development of fast, memory-efficient DCT and IDCT is proposed to simplify hardware implementations and reduce power consumption.

The remainder of this paper proceeds as follows. In Section 2, the CORDIC algorithm is reviewed briefly. In Section 3, fast and efficient CORDIC-based 2-D DCT and IDCT algorithms are presented. The implementations of the proposed low-power, parallel and pipelined architectures for 2-D DCT and IDCT processors are given in Section 4. Finally, conclusion can be found in Section 5.

2 Review of CORDIC Algorithm

The basic CORDIC algorithm is given by [26]-[27]

i i i i

i x y

x₊₁

= − σ 2

⁻

(1)

i i i i

i y x

y₊₁

= + σ 2

⁻ (2)

i i i

i z

z₊₁

= − σ α

(3)

where i=0, 1,2, …., n-1, and )

arctan( ⁱ

i

=

−

α 2

(4)

In the i^th micro-rotation, the direction of rotation denoted by

σ

_i is determined by sign(z_i) with

→ 0

zn in the rotation mode;

σ

_i

= −

sign(x_i)

⋅

sign(y_i) with y_n

→ 0

in the vectoring mode; and

(15)

∏

⁻

=

− −

=

− −

=

+

= σ

+

=

¹

0 1 2

0

2 1 2

0

1

1 2

ⁿ

1 2

i n i

i

i i n

i

ki

K (5)

One may take the iteration sequence: {0, 0, 0, 1, 2, ….,n } for the CORDIC algorithm in the circular coordinate system to expand the convergence range of angles as follows.

o o 180 189

3141 3

2 2

0 0

>

≅

+ ∑

⋅ +

= =

−

) ( .

arctan )

arctan(

) arctan(

θ ⁿ

j

j n

max (6)

Thus, the convergence range of angles is expanded to cover a whole plane of [−180^o,180^o}, in other words, the input angle is unconstrained [28]-[29].

3 The CORDIC-Based DCT and IDCT Algorithm The N-point 1-D DCT is defined as

) n ( N x

m ) n cos ( N K

) m (

Y ^N

n

m ⎥⎦⎤⋅

⎢⎣⎡ + π

=

∑

⁻

= 1

0 2

1 2 2

1 (7)

where m

= 0

,....,N

− 1

,

2 = 1

Km for m=0, and K_m

= 1

for m>0.

For image applications, a separable 2-D DCT can be obtained by using the tensor product of two 1-D DCTs. Specifically, the M ×N-point 2-D DCT is defined as

⎥⎦⎤

⎢⎣⎡ + π

⎥⎦⋅

⎢⎣ ⎤

⎡ + π

⋅

⋅ ⋅

= ⋅

∑∑

⁻

=

−

= N

v ) n cos ( M

u ) m cos ( ) n , m ( x

N M

) v ( c ) u ( ) c v , u ( Z

M m

N

n 2

1 2 2

1

0 1

0

(8)

where u

= 0

,....,M

− 1

,v

= 0

,....,N

− 1

,

2 = 1

) k (

c for k =0, and c(k)

= 1

for k >0. Equation (8) can be rewritten by

⎭ ⎬

⎫

⎩ ⎨

⎧ ⎥⎦ ⎤ ⋅

⎢⎣ ⎡ + π

⋅

⎥⎦ ⋅

⎢⎣ ⎤

⎡ + π

⋅

= ∑ ∑

⁻

=

−

=

1

0 1

0

2 1 2 2

1 2

1 2 2

1

^N

n M

m

) n , m ( N x

v ) n cos ( ) v ( N c

M u ) m cos ( ) u ( M c

) v , u (

Z (9)

For 8× DCT, let 8

⎥⎥

⎥

⎦

⎤

⎢⎢

⎢

⎣

⎡

−

⋅

=

f d c a a c d f

e b b e e b b e

d a f c c f a d

c f a d d a f c

b e e b b e e b

a c d f f d c a

1 1 1 1 1 1 1 1

8

T 1 (10)

where ⎟

⎠

⎜ ⎞

⎝

= ⎛ π 2 cos 16

a , ⎟

⎠

⎜ ⎞

⎝

= ⎛ π 2 cos 8

b , ⎟

⎠

⎜ ⎞

⎝

= ⎛ π

16 2 cos 3

c , ⎟

⎠

⎜ ⎞

⎝

= ⎛ π

16 2 cos 5

d ,

⎟⎠

⎜ ⎞

⎝

= ⎛ π

8 2 cos 3

e , and ⎟

⎠

⎜ ⎞

⎝

= ⎛ π

16 2 cos 7

f .

The transform coefficientsZ( vu, ) of 8× DCT can be grouped into an array denoted by Z, which can 8 be written by

(16)

TYt

Z

=

(11)

where Y

=

TX^t. Thus, the computation of separable 2-D DCT can be obtained by using 1-D DCT computation as follows.

2-D DCT(X) = 1-D DCT((1-D DCT(X))^t) (12)

Similarly, a separable M× -point 2-D IDCT can be obtained, which is given by N

⎥⎦ ⎤

⎢⎣ ⎡ + π

⎥⎦ ⋅

⎢⎣ ⎤

⎡ + π

⋅

⋅ ⋅

= ⋅

∑∑

⁻

=

−

= N

v ) n cos ( M

u ) m cos ( ) v , u ( Z

N M

) v ( c ) u ( ) c n , m ( x

M u

N

v

2 1 2 2

1 2 2

1

0 1

0

(13)

where u

= 0

,....,M

− 1

,v

= 0

,....,N

− 1

,

2 = 1

) k (

c for k=0, and c(k)

= 1

for k>0.

The 2-D IDCT computation using 1-D IDCT computation is as follows.

2-D IDCT(Z)=1-D IDCT((1-D IDCT(Z))^t ) (14)

In which, X=T^tZT,

Y = T

^t

Z

^t, and therefore

X= T^tY^t

(15)

3.1 Fast 1-D DCT Algorithm

Matrix T defined by equation (10) can be further decomposed to obtain a fast algorithm for 1-D DCT . Specifically, the fast 8-point DCT is given by

⎥⎥

⎦

⎤

⎢⎢

⎣

⎡

+ + + +

⎥⎥

⎦

⎤

⎢⎢

⎣

⎡

−

= −

⎥⎥

⎦

⎤

⎢⎢

⎣

⎡

) ( x ) ( x

e b b e

b e e b ) ( Y

) ( Y

4 3

5 2

6 1

7 0

1 1 1 1

6 4 2 0

(16)

⎥⎥

⎦

⎤

⎢⎢

⎣

⎡

+

−

− +

−

⎥⎥

⎦

⎤

⎢⎢

⎣

⎡

−

=

⎥⎥

⎦

⎤

⎢⎢

⎣

⎡

) ( x ) ( x

a c d f

c f

a d

d a f

c

f d

c a

) ( Y

4 3

5 2

6 1

7 0

7 5 3 1

(17)

Figure 1 shows the data flow of 8-point DCT, where the blocks

named CORDIC(2) and CORDIC(5) are constructed by the same structure with rotation angle π/16; the blocks named CORDIC(3) and CORDIC(4) are of the same structure with rotation angle 5π/16. The details of blocks CORDIC(1), CORDIC(2) and CORDIC(3) are shown in Figure 2.

3.2 Fast 1-D IDCT Algorithm

Similarly, the fast 8-point IDCT can be obtained by further decomposing Matrix T, which is given by

⎥⎥

⎤

⎢⎢

⎡ +

⎥⎥

⎥

⎦

⎤

⎢⎢

⎢

⎣

⎡

− −

−

− −

=

⎥⎥

⎦

⎤

⎢⎢

⎣

⎡

) ( Y

) ( Y ) ( Y

c f a

a e b

d f c

a e b

d c f a

e b

d c f a

e b

) ( x

3 7 1 6 2

4 0

1 1 1 1

3 4 7 0

(18)

(17)

⎥⎥

⎥

⎦

⎤

⎢⎢

⎢

⎣

⎡ −

⎥⎥

⎥

⎦

⎤

⎢⎢

⎢

⎣

⎡

−

− −

−

− −

−

=

⎥⎥

⎦

⎤

⎢⎢

⎣

⎡

) ( Y

) ( Y ) ( Y

f a d c b e

a f c d b e

f a d c b e

) ( x

5 3 1 7 6 2

4 0

1 1 1 1

6 2 5 1

(19)

Figure 3 shows the data flow of 8-point IDCT, where the blocks named R0 and R2 are constructed by the same structure with rotation angles

π/16

and

5π/16

, and block R1 involves a rotation of angle

6π/16

. The details of blocks R0 and R1 are shown in Figure 4.

3.3 CORDIC-Based 2-D DCT and IDCT Architectures

The demanding operation of both DCT and IDCT, i.e. multiplication can be simplified significantly by using CODIC. Specifically, multipliers required for the implementation of both DCT and IDCT can be replaced by simple shifters and adders through the use of CORDIC processor. In which, each CORDIC operates in the circular coordinate system with some fixed angle that can be pre-decomposed into a sequence of micro-rotations, { }

σ_i

, and can be stored in ROM. Based on constant-rotation CORDIC [30]-[31], multiplier involved in DCT and IDCT can be implemented by using two shifters and two adders. Figure 5 depicts a constant-rotation CORDIC in the circular coordinate system. It is noted that each CORDIC processor used as the arithmetic unit (AU) in the proposed fast DCT/IDCT architectures can save hardware by one third to achieve low-power consumption.

4 The Proposed 2-D DCT and IDCT Processors

Based on equations (11) and (15), an efficient pipelined architecture has been developed for 2D DCT/IDCT. Figure 6 shows the proposed architecture for

8×8

DCT and IDCT processors, where one 64-word SRAM bank, two 8-point DCT/IDCT processors and a control unit are used.

The 8-point 1-D DCT/IDCT input-processor denoted by P1 writes the intermediate result into the row and column of SRAM bank alternately. P2 denotes the 8-point 1-D DCT/IDCT output-processor, which first reads data from the column and raw of SRAM bank alternately and then outputs the final result. Figure 7 shows the finite state machine (FSM) of the control unit, which manages data flow and timing for 2-D DCT/IDCT operations.

4.1 Implementation of 1-D DCT/IDCT Processor

Constant-rotation CORDIC processors are used to implement 8-point DCT and IDCT processors, which are shown in Figure 8 and Figure 9, respectively. Note that the transform matrices of DCT and IDCT are column symmetry and row symmetry, respectively, the shuffle structures are simplified and no multipliers are utilized.

4.2 Implementation of 2-D DCT/IDCT Processors

One of the crucial issues using the conventional DCT/IDCT architectures with single processor is

long latency and low throughput [6]-[8]. Moreover, the conventional DCT/IDCT architectures

with single memory bank can not be pipelined [21]-[25], [29]. In [18]-[20], Sung proposed a

pipelined architecture with dual memory banks to double the throughput of DCT/IDCT at the cost

of increasing memory space. To save memory space while increasing throughput, we propose

(18)

pipelined 2-D DCT/IDCT architecture with single memory bank, which is shown in Figure 6; The latency of 1-D DCT/IDCT processor is 8 clocks, hardware complexity is of order O(N-

log₂N

), and throughput is 8 outputs per cycle. Note that multiplier is replaced by simple adders and shifters based on constant-rotation CORDIC so that many desirable properties, e.g. small area, low-power and high throughput can be obtained. Specifically, the proposed 2-D DCT/IDCT architecture can improve throughput by two times that of the conventional architectures and save memory space by 50%. Table 1 shows comparisons between the proposed architecture and the conventional architectures [6]-[8], [18]-[20] with dual memory banks, and [21]-[25]. Table 2 shows comparisons with other commonly used architectures [9]-[12].

The proposed pipelined architecture for 32-bit fixed-point 2-D DCT and IDCT processors have been written in Verilog

^®

and synthesized by TSMC 0.18 m μ 1P6M CMOS cell libraries [32]. Core size and power consumption can be obtained from the reports of Synopsys

^®

design analyzer and PrimPower

^®

[33], respectively. The core sizes of the proposed 2-D DCT and IDCT processors are

2372×2372μm²

and

2396×2396μm²

, respectively; their respective power dissipations are 127.7 mW at 1.8V with clock rate of 34.4MHz and 116.7 mW at 1.8V with clock rate of 35.7MHz. The layout views of the implemented 2-D DCT and IDCT processors are shown in Figure 10 and Figure 11, respectively, which are much suited to the JPEG and MPEG applications.

Due to financial problem, the platform for architecture development and verification has been implemented; the architecture is implemented on Xilinx XC2V6000 FPGA emulation board [34]

with an 8051 microcontroller and interface circuits (USB 2.0) [35] as shown in Figure 12. This 8051 microcontroller uses USB 2.0 bus to read the original image from PC through DMA channel and write the processed image back to PC. The Xilinx XC2V6000 FPGA chip implements DCT and IDCT. Figure 13 shows the architecture development and verification board. The original

512

512×

Lena image is shown in Figure 14; the reconstructed image is shown in Figure 15.

Through the proposed architectures for 32-bit fixed-point DCT/IDCT, the peak-signal-to-noise ratio (PSNR) of the reconstructed image is 44.6dB. The proposed 2-D DCT/IDCT processors have been applied to various images with great satisfactions.

5 Conclusion

By taking into account the symmetry properties of the fast DCT and IDCT algorithms, high

efficiency architectures with parallel and pipelined structures have been proposed to implement

DCT and IDCT processors. For image applications, a separable 2-D DCT/IDCT can be obtained

by using the tensor product of two 1-D DCT/IDCT operations. The proposed 2-D DCT/IDCT

processor is composed of two successive 1-D DCT/IDCT kernels with single memory bank. In

the constituent 1-D DCT/IDCT processors, the CORDIC algorithm with rotation mode in the

circular coordinate system has been utilized for the arithmetic unit (AU) involved, i.e. the

multiplication computation. The proposed DCT/IDCT architectures are not only regularly

structured but also highly scalable and flexible as well. The DCT and IDCT processors are

(19)

performances, area and power consumption trade-offs. The proposed 2-D DCT and IDCT processors are much suited to the applications of JPEG, MPEG-4 and H.264 standards.

References:

[1] D. F. Elliott, K. R. Kao, Fast Transforms Algorithms, Analysis, Applications, Chapter 8, Walsh-Hadamard Transform, Prentice-Hall, 1982, pp.301-303.

[2] R. J. Clarke, Relation between the Karhenen Loeve and Cosine Transform,” IEEE Proceedings, Part F, vol. 128, no. 6, Nov. 1981, pp.359-360.

[3] M. J. Narasimha, A. M. Peterson, On the Computation of the Discrete Cosine Transform, IEEE Transactions on Communications, vol. 26, no. 6, June 1978, pp. 934-936.

[4] R. M.Haralick “A Storage Way to Implement the Discrete Cosine Transform,” IEEE Transactions on Computers, July 1976, pp.764-765.

[5] W. H. Chen, C. H. Smith, S. C. Fralick, “Fast Computational Algorithm for the Discrete Cosine Transform,” IEEE Transactions on Communications, vol. 25, no. 9, Sept. 1977, pp.1004-1009.

[6] T. Y. Sung, VLSI Parallel and Distributed Computation Algorithms for DCT Processors, Proceedings IEEE International Phoenix Conference on Computer and Communications, Scottsdale, Arizona, USA, 1990, pp.121-125.

[7] T. Y. Sung, VLSI Parallel and Distributed Processing Algorithms for Multidimensional Discrete Cosine Transforms, 1990 A Two-Track International Conference on Databases, Parallel Architectures, and their Applications, Miami Beach, Florida, USA, March 1990, pp.36-39.

[8] T. Y. Sung, Novel Parallel VLSI Architectures for Discrete Cosine Transforms, Proceedings IEEE International Conference on Acoustics, Speech and Signal Processing, Albuquerque, New Mexico, USA, April 1990, pp.998-1001.

[9] Y. P. Lee, T. H. Chen, L. G. Chen, C. W. Ku, A Cost-Effective Architecture for 8×8 two-dimensional DCT/IDCT Using Direct Method, IEEE Transactions on Circuits Systems for Video Technology, vol. 7, no. 1, June 1997, pp.459-467.

[10] Y. T. Chang, C. L. Wang, New Systolic Array Implementation of the 2-D Discrete Cosine Transform and Its Inverse, IEEE Transactions on Circuits Systems for Video Technology, vol. 5, no. 1, April 1995, pp.150-157.

[11] S. F. Hsiao, W. R. Shiue, A New Hardware-Efficient Algorithm and Architecture for Computation of 2-D DCTs on a Linear Array, IEEE Transactions on Circuits and Systems for Video Technology, vol.

11, Nov. 2001, pp.1149-1159.

[12] S. F. Hsiao, J. M. Tseng, New Matrix Formulation for Two-Dimensional DCT/IDCT Computation and its Distributed-Memory VLSI Implementation, IEE Proc.-Vis. Image Signal Process, vol. 149, no.

2, April 2002, pp.97-107.

[13] S. F. Hsiao, Y. H. Hu, T. B. Juang, C. H. Lee, Efficient VLSI Implementations of Fast Multiplierless Approximated DCT Using Parameterized Hardware Modules for Silicon Intellectual Property Design, IEEE Trans. Circuits and Systems, Part-I: Regular Papers, vol. 52, no. 8, Aug. 2005, pp.1568-1579.

[14] V. Srinvasan, K. J. R. Liu, VLSI Design of High-Speed Time-Recursive 2-D DCT/IDCT Processor for Video Applications, IEEE Transactions on Circuits Systems for Video Technology, vol. 6, no. 1, Feb. 1996, pp.87-96.

[15] T. Kuroda, A 0.9-V, 150-MHz, 10-mW, 4mm², 2-D Discrete Cosine Transform Core Processor with Variable Threshold-Voltage(VT) Scheme, IEEE Journal of Solid-States Circuits, vol. 31, no. 11, Nov.

(20)

1996, pp.1770-1778.

[16] R. Rambaldi, A. Uguzzoni, R. Guerrieri, A 35μW 1.1 V Gate Array 8× IDCT Processor for 8 Video-Telephony, Proceedings IEEE International Conference on Acoustics, Speech and Signal Processing, 1998, pp.2993-2996.

[17] T. H. Chen, A Cost-Effective 8× 2-D IDCT Core Processor with Folded Architecture, IEEE 8 Transactions on Consumer Electronics, vol. 45, no.2, May 1999, pp.333-339.

[18] T. Y. Sung, Y. H. Sung, A Novel Implementation of Cost-Effective Parallel-Pipelined 8×8 DCT Processor, The Fourth IEEE Asia-Pacific Conference on Advanced System Integrated Circuits (AP-ASIC) 2004, Fukuoka, Japan, August 3-5, 2004, pp.200-203.

[19] T. Y. Sung, Y. S. Shieh, M. J. Sun, A High-Throughput and Memory-Efficiency 2-D DCT Architecture Based on CORDIC Rotation, The 23rd Workshop on Combinatorial Mathematics and Computation Theory (Algo-2006), Changhua, Taiwan, April 28~29, 2006, pp.369-372.

[20] T. Y. Sung, M. J. Sun, H. C. Hsin, C. W. Yu, Low-Power and High-Speed Architectures for 2-D DCT and IDCT Based on CORDIC Rotation, 19^th Computer Vision, Graphics, and Image Processing Conference, Taiwan Aug. 13-15, 2006, pp.1024-1029.

[21] Y. H. Hu, Z. Wu, An Efficient CORDIC Array Structure for the Implementation of Discrete Cosine Transform, IEEE Transactions on Signal Processing, vol. 43, no. 1, Jan. 1995, pp.331-.336.

[22] H. Jeong, J. Kim, W. K. Cho, Low-Power Multiplierless DCT Architecture Using Image Data Correlation, IEEE Transactions on Consumer Electronics, vol. 50, no. 1, Feb. 2004, pp.262-267.

[23] D. Gong, Y. He, Z. Gao, New Cost-Effective VLSI Implementation of a 2-D Discrete Cosine Transform and Its Inverse, IEEE Transactions on Circuits and Systems for Video Technology, vol. 14, no. 4, April 2004, pp. 405-415.

[24] V. Dimitrov, K. Wahid, G. Jullien, Multiplication-Free 8× 2D DCT Architecture Using Algebraic 8 Integer Encoding, Electronics Letters, vol. 40, no. 20, Sept. 2004, pp.1310-1311.

[25] M. Alam, W. Badawy, G. Jullien, A New Time Distributed DCT Architecture for MPEG-4 Hardware Reference Model, IEEE Transactions on Circuits and Systems for Video Technology, vol. 15, no. 5, May 2005, pp.726-730.

[26] J. E. Volder, The CORDIC Trigonometric Computing Technique, IRE Transactions on Electronic Computers, vol. EC-8, 1959, pp.330-334.

[27] J. S. Walther, A Unified Algorithm for Elementary Functions, Spring Joint Computer Conference Proceedings, vol. 38, 1971, pp.379-385.

[28] X. Hu, R. G. Harber, S. C. Bass, Expanding the range of the Convergence of the CORDIC Algorithm, IEEE Transactions on Computers, vol. 40, no. 1, 1991, pp.13-21.

[29] T. Y. Sung, Y. H. Sung, The Quantization Effects of CORDIC Arithmetic for Digital Signal Processing Applications, The 21^st Workshop on Combinatorial Mathematics and Computation Theory, Taiwan, May 21~22, 2004, pp.16-25.

[30] T. Y. Sung, Y. H. Hu, T. M. Parng, Design and Implementation of a VLSI CORDIC Processor, Proc.

1986 IEEE Int. Symp. Circuits Syst., vol. 3, 1986, pp.934-935.

(21)

[32] TSMC 0.18

μ

mCMOS Design Libraries and Technical Data, v.3.2, Taiwan Semiconductor Manufacturing Company, Hsinchu, Taiwan, and National Chip Implementation Center (CIC), National Science Council, Hsinchu, Taiwan, R.O.C., 2006.

[33] Synopsys products, http://www. synopsys.com/ products.

[34] Xilinx FPGA products, http://www. xilinx.com /products.

[35] T. Y. Sung, C. W. Yu, Y. S. Shieh, A High-Efficient and Cost-Effective LCD Signal Processor, 7th International Conference on Computer Vision, Pattern Recognition and Image Processing, Taiwan, 2006, pp.939-942.

行政院國家科學委員會專題研究計畫 成果報告