A Linear Array for DCT and IDCT

A Novel Linear Array for Discrete Cosine Transform

3 A Linear Array for DCT and IDCT

Based on the proposed approach to fast DCT computation shown in Fig. 8, an efficient architecture for implementing the fast DCT/IDCT processor is thus presented in this section. Recall that the DCT of a signal, x₈ , can be efficiently obtained by

8 8 8

8 Fˆ R x

C    . Let y₈R₈x₈ , then we have

8 8

8 Fˆ y

C   . The matrix-vector multiplication of

8 x

R  , in which six CSA(3,2)s (carry-save-adder (3,2)) and one CLA (carry-look-ahead-adder) [9-10]

are utilized, and therefore four simple-addition time and one CLA computation time is required to compute each element of y₈. The multiplier-array (MA) consisted of four multipliers and the CLA-array (CA) consisted of eight CLAs, respectively, which are used to compute the matrix-vector computation of Fˆ₈ ; thus, only one y₈ multiplication time with one CLA computation time

Proceedings of the 9th WSEAS Int. Conference on INSTRUMENTATION, MEASUREMENT, CIRCUITS and SYSTEMS

is needed to compute each element of C , i.e. the ₈ DCT coefficient. Fig. 12 depicts data flow of the proposed fast DCT processor with pipelined linear-array architecture [11]. As a result, only five multiplication cycles with five addition cycles are needed to compute 8-point DCT. In general, for N-point DCT, the computation time and hardware complexity of the proposed fast DCT processor are

) 8 / 5 ( N

O and O(N/2), respectively.

Figure 13 shows data flow of the proposed fast IDCT algorithm [11], where C is the DCT of an ₈ 8-point signalx₈; z₈Fˆ₈^¹C₈, and x₈R₈^¹z₈ . The so-called full-CSA(4,2) (FCSA(4,2)) consisted of two CSA(3,2) and one CLA for the computation of z8 [21-22]. It is noted that the CLA-array consisted of eight CLAs can also be used for the computation of x8. As shown in Fig. 13, only five multiplication cycles with three addition cycles are needed to compute 8-point IDCT. As one can see, the computation time and hardware complexity of the proposed fast IDCT architecture are the same as that of the proposed fast DCT architecture. In addition, only 16-word RAM/registers and 10-word ROM are required to store the intermediate results and constants, respectively; and the latency time is only 5-multiplication-cycle.

Fig. 15 shows system block diagram of the proposed fast DCT/IDCT architecture. The platform for architecture development and verification has been designed as well as implemented in order to evaluate the development cost. It is noted that the throughput can be improved by using the proposed architecture while the computation accuracy is the same as that obtained by using the conventional one with the same word length. Thus, the proposed programmable DCT/IDCT architecture is able to improve the power consumption and computation speed significantly. The proposed DCT/IDCT processor used to compute 8/16/32/64 -point DCT/IDCT are composed mainly of the 8-point DCT/IDCT core; the computation complexity using a single 8-point DCT/IDCT core is O( N5 /8) for extending N-point DCT/ IDCT computation.

Moreover, the reusable intellectual property (IP) DCT/IDCT core has also been implemented in Matlab^® for functional simulations. The hardware code written in Verilog^® is running on a workstation with the ModelSim^® simulation tool and Xilinx^® ISE smart compiler.

4 Conclusion

By taking advantage of subband decomposition, a high-efficiency architecture with pipelined structures is proposed for fast DCT/IDCT computation.

Specifically, the proposed DCT/IDCT architecture not only improves throughput by more than two times that of the conventional architectures [2-6], but also saves memory space significantly [1]. Table 1 shows comparisons between the proposed architecture and the conventional architectures [2-6].

Table 2 shows comparisons with other commonly used architectures [1], [7-8]. In addition, the proposed fast DCT/IDCT architecture is highly regular, scalable, and flexible. The DCT/IDCT processor designed by using the portable and reusable Verilog^® is a reusable IP, which can be implemented in various processes; combined with efficient use of hardware resources for trade-offs of performance, area and power consumption; and therefore is much suited to the JPEG and MPEG-1/2 applications.

References:

[1] T. Y. Sung, “Memory-Efficient and high-performance 2-D DCT and IDCT processors based on CORDIC rotation”, WSEAS Trans. Electronics, Issue 12, Vol. 3, Dec. 2006, pp.565-574.

[2] Y. H. Hu, Z. Wu, “An efficient CORDIC array structure for the implementation of discrete cosine transform, IEEE Trans. Signal Processing, No.1, Vol.43, Jan. 1995, pp.331-.336.

[3] H. Jeong, J. Kim, W. K. Cho, “Low-power multiplierless DCT architecture using image data correlation”, IEEE Trans. Consumer Electronics, No.1, Vol.50, Feb. 2004, pp.262-267.

[4] D. Gong, Y. He, Z. Gao, “New cost-effective VLSI implementation of a 2-discrete cosine Transform and its inverse”, IEEE Trans.

Circuits Syst. for Video Technology, vol. 14, no.

4, April 2004, pp. 405-415.

[5] V. Dimitrov, K. Wahid, G. Jullien,

“Multiplication-free 8 2D DCT architecture 8 using Algebraic integer encoding”, Electronics Letters, No.20, Vol.40, Sept. 2004, pp.1310-1311.

[6] M. Alam, W. Badawy, G. Jullien, “A new time distributed DCT architecture for MPEG-4 hardware reference model”, IEEE Trans.

Circuits Syst. for Video Technology, No.5, Vol.15, May 2005, pp.726-730.

[7] S. F. Hsiao, J. M. Tseng, “New matrix formulation for two-dimensional DCT/IDCT computation and its distributed-memory VLSI

Proceedings of the 9th WSEAS Int. Conference on INSTRUMENTATION, MEASUREMENT, CIRCUITS and SYSTEMS

xL ,

2 , L

2 , H

S B_ D C T

S B_ D ST

S B_ D C T

S B_ D ST 1 , L L

1 , L H

1 L , H

1 , H H

+ 2 L L ,

2 L H ,

2 L L L ,

2 L L H ,

2 LH L ,

2 L H H ,

C M8

4 ,

xH M⁴















 

5 . 0 5 . 0 0 0

0 0 5 . 0 5 . 0

5 . 0 5 . 0 0 0

0 0 5 . 0 5 . 0 M4



 





  5 . 0 5 . 0

5 . 0 5 . M2 0



















 

5 . 0 5 . 0 0 0 0 0 0 0

0 0 5 . 0 5 . 0 0 0 0 0

0 0 0 0 5 . 0 5 . 0 0 0

0 0 0 0 0 0 5 . 0 5 . 0

5 . 0 5 . 0 0 0 0 0 0 0

0 0 5 . 0 5 . 0 0 0 0 0

0 0 0 0 5 . 0 5 . 0 0 0

0 0 0 0 0 0 5 . 0 5 . 0

implementation”, IEE Proc.-Vis. Image Signal Process, No. 2, Vol. 149, April 2002, pp.97-107.

[8] H. S. Hou, “A fast recursive algorithm for computing the discrete cosine transform”, IEEE Trans. Acoust., Speech, Signal Processing, No.10, Vol. ASSP-35, Oct. 1987, pp.1455-1461.

[9] I. Koren, Computer arithmetic algorithm, Second edition, A. K. Peters, Natick, MA, 2002, Chapter 5.

[10] T. Y. Sung, H. C. Hsin, “Design and simulation of reusable IP CORDIC core for special-purpose processors,” IET Computers & Digital Techniques, Vol.1, No.5, Sept. 2007, pp.581-589.

[11] G. H. Golub, C. F. Van Loan, Matrix computations, The John Hopkins University Press, 1996, Chapter 6. Parallel matrix computations, pp.275-307.

Fig. 1 Data flow of computing the 2-point subband DCT

Fig. 2 Data flow of computing C^ˆ_LL_,₄ and C_LL_,₂ based on subband decomposition

Fig. 3 Data flow of computing C_LH_,₂ and S^ˆ_LH_,₄ based on subband decomposition.

Fig. 4 Data flow of computing C_L_,₄and C_H_,₄ using 4-point subband DCT and DST

8-point DCT/IDCT

Hsiao [7] Hsiao [8] Shieh, Sung, Hsin, 2010

Real-multipliers - - 4

CORDIC - 3 -

Real-adders 10 14 26

Complex-Multipli ers

3 -

Delay elements (Words)

171 - -

Hardware complexity

O(logN) O(logN) O(N/2)

Computation complexity

) log (N N

O O(NlogN) O( N5 /8)

Pipelinability no yes yes

Scalability good good better

The conventional architectures

The proposed high- efficient architecture 8-point

DCT/IDCT

The parallel architectures with single memory-bank [2]-[6]

This work [Shieh, Sung, Hsin, 2010]

Processors 8 --

Real-multipliers 16 4

Real-Adders 18 26

RAM (Registers) 64 16

ROM 6 10

Hardware

complexity O(Nlog₂N1)

) 2 / (N O

Computation complexity

) 2 ( N

O O( N5 /8)

Latency 16 5

Pipelinability no yes

Scalability poor better

Power consumption

poor better

Table 2 Comparisons of the proposed architecture and other commonly used architectures

2-point SB_DCT

2-point SB_DST

2 ,

xL L M2

1 , L L L

1 ,

xL LH

2 ,

CL LL

2 ,

CLLH

2 ,

CLL

2-point DCT

4-point

SB_DCT Cˆ^LL^,⁴

2-point SB_DCT

2-point SB_DST

2 ,

xLH M₂

1 ,

xLHL

1 ,

xLHH

2 ,

CLHL

2 ,

CLHH

2 ,

CLH

2-point DCT

4-point

SB_DST SˆLH,4

4-points SB_DCT

4-points SB_DST 4

2 ,

xL L

xLH ,

4 ,

CˆLL

4 ,

SˆLH

+ CL,4

4-points SB_DCT

4-points SB_DST 4

2 ,

xH L

2 ,

xH H

4 ,

CˆH L

4 ,

SˆH H

+ CH,4

x8 M8

Proceedings of the 9th WSEAS Int. Conference on INSTRUMENTATION, MEASUREMENT, CIRCUITS and SYSTEMS

FA MA X FCSA(4,2) X CA (DCT)

] [n x

(IDCT) ] [n C

(DCT) ] [n C

(IDCT) ] [n x

Fig. 5 Data flow of computing ˆ _,₈

CL and C_L_,₄ based on subband decomposition

Fig. 6 Data flow of computing S^ˆ_H_,₈ and C_H_,₄ based on subband decomposition

Fig. 7 Data flow of computing C₈ using 8-point subband DCT and DST

Fig. 8 Block diagram of the proposed (8-point) fast DCT algorithm based on subband decomposition

Fig. 11 System block diagram of the proposed DCT/IDCT architecture

Cycles FA MA CA

Add. _1 y[0] -- C[0]

Add. _2 y[1] -- C[1]

Add. _3 y[2] -- --

Mul. _1 y[3] y[2]0.9239,y[2](0.3827) 3827

. 0 ] 3 [ 

y ,y[3](0.9239)

Add. _4 y[4] -- C[2],C[3] Mul. _2 y[5] y[4]0.9062,y[4](0.1802),

) 3182 . 0 ( ] 4 [  

y ,y[4]0.2126

Mul. _3 y[6] y[5]0.3754,y[5](0.0746), 7682

. 0 ] 5 [ 

y ,y[5]0.5133

Mul. _4 y[7] y[6]0.1802,y[6]0.9062, 2126

. 0 ] 6 [ 

y ,y[6]0.3182

Mul. _5 ---- y[7](0.0746),y[7](0.3754), 5133

. 0 ] 7 [ 

y ,y[7]0.7682

Add. _5 ---- C[4],C[5],

] 6 [ C ,C[7]

Fig.9 Data flow of the proposed fast DCT processor with pipelined linear-array architecture (Add._: addition-cycle, Mul._: multiplication-cycle)

Cycles MA FCSA CA

Mul. _1 C[2]0.9239,C[3](0.3827) 3827

. 0 ] 2 [ 

C ,C[3]0.92393 ] 0 [

z ,z[1] --

Mul. _2 C[4]0.9062,C[5](0.1802), )

3182 . 0 ( ] 6 [  

C ,C[7]0.2126 ] 2 [

z ,z[3] C_0+C_1

=C_01 Mul. _3 C[4]0.3754,C[5]0.3754,

7682 . 0 ] 6 [ 

C ,C[7](0.5133)

] 4 [

z C_01+C_2

=C_02 Mul. _4 C[4](0.3182),C[5]0.7682,

2126 . 0 ] 6 [ 

C ,C[7]0.5144

] 5 [

z C_02+C_3

=C_03 Mul. _5 C[4]0.2126,C[5](0.5133),

3182 . 0 ] 6 [ 

C ,C[7]0.7682

] 6 [

z C_03+C_4

=C_04

Add. _1 -- z[7] C_04+C_5

=C_05

Add. _2 -- -- C_05+C_6

=C_06

Add. _3 -- -- C_06+C_7

=C_07 ] 0 [ x ,x[1],

] 2 [ x ,x[3],

] 4 [ x ,x[5],

] 6 [ x ,x[7]

Fig. 10 Data flow of the proposed fast IDCT processor with pipelined linear-array architecture

4-points SB_DCT

4-points SB_DST 4

2 ,

xLL

xLH,

4 ,

CˆLL

4 ,

SˆLH

+ ^C^L^,⁴

4-point DCT

8-point

SB_DCT CˆL,8

4-points SB_DCT

4-points SB_DST 4

2 ,

xHL

2 ,

xHH

4 ,

CˆHL

4 ,

SˆHH

+ ^C^H^,⁴

8-point SB_DST

4-point DCT

8 ,

SˆH

8-points SB_DCT

8-points SB_DST 4 ,

4 H,

8 ,

CˆL

8 ,

SˆH

+ ^C⁸













2 , HH 2 , HL 2 , LH

2 , LL

C C C C

x8 R₈ Fˆ₈ C₈

Proceedings of the 9th WSEAS Int. Conference on INSTRUMENTATION, MEASUREMENT, CIRCUITS and SYSTEMS

國科會補助計畫衍生研發成果推廣資料表

日期 2010年09月02日

國科會補助計畫

研發成果名稱

發明人 (創作人)

技術說明

技術移轉可行性及預期效益技術/產品應用範圍

產業別

計畫名稱:

計畫主持人:

計畫編號: 學門領域:

(中文)

(英文)

成果歸屬機構

(中文)

(英文)

座標旋轉原理演算法應用於二維及三維特殊信號處理器之晶片設計與製作 (I)

宋志雲

98 -2221-E -216 -037 - 積體電路及系統設計

高無寄生動態範圍及無乘法器之直接數位頻率合成器

High-SFDR and Multiplierless Direct Digital Frequency Synthesizer

中華大學宋志雲

使用混合座標旋轉原理設計及製作直接數位頻率合成器。此一設計之架構為無乘法器，包含小量之唯獨記憶體(16X4 -位元)以及疊流資料路徑，所產生無寄生動態範圍超過84.4 dBc。系統晶片由台積電 1P6M CMOS製程設計，並且在 Xilinx陣列處理器上實體模擬。證明此一以混合座標旋轉原理為基礎之直接數位頻率合成器適合由超大型積體電路製作，在硬體成本，功率消耗以及無寄生動態範圍上都有具備優勢。

本合成器於高頻條件下,有更高之無寄生動態範圍，達到169.7dBc。比較現存的直接數位頻率合成器，其有非常好的無寄生動態範圍。

This research presents a hybrid COordinate Rotation DIgital Computer (CORDIC) algorithm for designs and implementations of the direct digital frequency synthesizer (DDFS). The proposed multiplier-less architecture with small ROM (16X4 -bit) and pipelined data path provides a spurious free dynamic range (SFDR) of more than 84.4 dBc.A SoC (System on Chip) has been designed by 1P6M CMOS, and then

emulated on the Xilinx FPGA. It is shown that the hybrid CORDIC-based architecture is suitable for VLSI implementations of the DDFS in terms of hardware cost, power consumption, and SFDR. In case of 16-bit word length, the high-frequency SFDR is 169.7 dBc.As one can see, the proposed DDFS is superior in terms of SFDR, hardware cost, and 電機及電子機械器材業

無線數位高頻寬網絡設備及晶片

可轉移晶片設計原始碼及相關實驗資料，改進相關設備之性能。

98 年度專題研究計畫研究成果彙整表

計畫主持人：宋志雲計畫編號：98-2221-E-216-037-

計畫名稱：座標旋轉原理演算法應用於二維及三維特殊信號處理器之晶片設計與製作(I) 量化

成果項目 ^{實際已達成}

數（被接受或已發表）

預期總達成數(含實際已

達成數)

本計畫實際貢獻百

分比

單位

備註（質化說明：如數個計畫共同成果、成果列為該期刊之封面故事 ...

等）

期刊論文 2 2 100%

研究報告/技術報告 0 0 100%

研討會論文 2 2 100%

論文著作篇

專書 0 0 100%

申請中件數 0 0 100%

專利已獲得件數 0 0 100% 件

件數 0 0 100% 件

技術移轉

權利金 0 0 100% 千元

碩士生 1 0 100%

博士生 1 0 100%

博士後研究員 0 0 100%

國內

參與計畫人力

（本國籍）

專任助理 0 0 100%

人次

期刊論文 4 4 100%

研究報告/技術報告 0 0 100%

研討會論文 4 4 100%

論文著作篇

專書 0 0 100% 章/本

申請中件數 0 0 100%

專利已獲得件數 0 0 100% 件

件數 0 0 100% 件

技術移轉

權利金 0 0 100% 千元

碩士生 1 0 100%

博士生 1 0 100%

博士後研究員 0 0 100%

國外

參與計畫人力

（外國籍）

專任助理 0 0 100%

人次

其他成果

(

無法以量化表達之成

果如辦理學術活動、獲得獎項、重要國際合作、研究成果國際影響力及其他協助產業技術發展之具體效益事項等，請以文字敘述填列。)

擔任研討會議程委員及分組討論主席

成果項目量化 名稱或內容性質簡述

測驗工具(含質性與量性) 0

課程/模組 0

電腦及網路系統或工具 0

教材 0

舉辦之活動/競賽 0

研討會/工作坊 0

電子報、網站 0

科教處計畫加填項

目計畫成果推廣之參與（閱聽）人數 0

國科會補助專題研究計畫成果報告自評表

請就研究內容與原計畫相符程度、達成預期目標情況、研究成果之學術或應用價

值（簡要敘述成果所代表之意義、價值、影響或進一步發展之可能性）、是否適

合在學術期刊發表或申請專利、主要發現或其他有關價值等，作一綜合評估。

在文檔中行政院國家科學委員會專題研究計畫成果報告 (頁 33-41)

A Novel Linear Array for Discrete Cosine Transform

3 A Linear Array for DCT and IDCT

4 Conclusion

國科會補助計畫衍生研發成果推廣資料表

國科會補助計畫

研發成果名稱

發明人 (創作人)

技術說明

技術移轉可行性及 預期效益 技術/產品應用範圍

產業別

成果歸屬機構

98 年度專題研究計畫研究成果彙整表

(

國科會補助專題研究計畫成果報告自評表

請就研究內容與原計畫相符程度、達成預期目標情況、研究成果之學術或應用價

值（簡要敘述成果所代表之意義、價值、影響或進一步發展之可能性） 、是否適

合在學術期刊發表或申請專利、主要發現或其他有關價值等，作一綜合評估。

技術移轉可行性及預期效益技術/產品應用範圍

值（簡要敘述成果所代表之意義、價值、影響或進一步發展之可能性）、是否適