Low Power 2D DCT Chip Design for Wireless
Multimedia Terminals
Liang-Gee Chen, Juing-Ying
Jiu,
Hao-Chieh Chang, Yung-PinLee,
and Chung-Wei KuDSPAC Design Lab, Department of Electrical Engineering National Taiwan University
Taipei, Taiwan,
R.O.C.
Tel: 886-2-363-5251 ext 443Fax: 886-2-363-8247
e-mail: { lgchen, howard) @video.ee.ntu.edu.tw
Abstract
-
In this paper, a low power 2-D DCT architecture based on direct 2-D approach is proposed. The direct 2-D consideration reduces computational complexity. According to this algorithm, a parallel distributed arithmetic (DA) architecture at reduced supply voltage IS derived. In the real circuitimplementation of the chip, an adder of low power consumption is designed, as well as a power-saving ROM and a low voltage two-port SRAM with sequential access. The resultant 2-D DCT chip is realized by 0 . 6 ~ m single-poly double-metal technology. Critical path simulation indicates a maximum input rate of 133MHz, and it consumes 13 BmW at 1 OOMHz. a
I.
Introduction
The Discrete Cosine Transform (DCT), among various transforms, is the most popular and effective one in image and video compression, such as JPEG, MPEG, H.261 and H.263. !jince these standards recently apply to battery-operated systems like portable computers (Notebook), personal digital assistants (PDA) and wireless communication equipments, it becomes imperative to develop low power DCT chip as one component of these energy-crucial desktops.
Since DCT h:as been standardized in recent years, many researchers and companies have took lots of resources to implement it. The conventional row-column approach has the advantage of regularity for VLSI implementation, which causes most 2-D DCT chips to be designed in this way. However, the computational complexity of the row-column approach is more than that of the direct method. And low computational amount is considered mainly in low power algorithm level. Although the direct method incurs the irregularity in realizing 2-D I)CT chips, the feature of low computational Complexity is still attractive for low power DCT chip design. This fact motivates our research for fewer computations and regular 2-D DCT architecture for real chip implementation with the direct method.
As to low power DCT design, T. Kuroda et al. proposed a 0.9V., ISOMHz, IOmW, 2-D DCT with variable threshold-voltage scheme implemented by 0.3pm CMOS triple-well technology. However, the chip achieved low power by only taking the circuit and device level into account, not including algorithm level consideration. Therefore, we propose a 2-D DCT chip incorporating low power considerations in algorithm,
direct 2-D DCT algorithm is briefly discussed. The architecture exploiting this algorithm is described in Section 111. In Section IV, The main circuit module designs, including adders and memories, are presented. The core characteristics are shown in Section V. Finally, a conclusion is given in Section VI.
11. The Direct 2-D DCT algorithm
The 2-D DCT of an N x N real signal x
,,,,,,
2 , withkernel factor 2c(nl)c(n2)/N neglected, is defined as:
N - I N - I
Yk,.k? =
c
c
X,,,.>1* ' n,=nJll=ncos
[
2 4 n 1 +'Nl]cos[
2 4 %+IN,
4 N 4 N
n l , n 2 , k l , k 2 =0,1,
...
N - 1In the following, assume that N is to be a power of 2. Using the permutation, signal x,,,,,,~ can b e permuted
as: - Y n , . n ,
-
X 2 n l . ~ n 2n, =0,
...,
N12-1,n2 = 0 ,
...,
N l 2 - 1
-
-
X 2 N - 2 n l - 1 . 2 n ,n ,
= N 1 2,...
, N-
l , n ,= 0
,...
, N I 2-
1-
-
X 2 n , . 2 N - 2 n , - In ,
= O,...,
N 1 2 - l , n 2 = N 1 2,...,
N - 1
-
-
'2 N-Znl-1.2N-2n,-In, = N 1 2 ,
...,
N - l , n ,
= N / 2 ,...,
N - l
Thus, Yk,,k2can be rewritten as:n , , n , , k l , k 2 =0,1,
...
N - 1Now consider the following expression: -
architecture,-and circuit design levels.
IV-4
1
where
W 4 ,
= exp(
-
j -:;)
It is not difficult to find thatY,,,
can be computed fromUk,,
by the following set of expressions:Note that
(4)
requiresUkf,,
in (3) to be computed for all k , and only a sufficient subset of k, such that { k , N - k , ] covers all possible valuesofk,.
By the following relation [4]
4n,
+
1
=(4t
+
1)(4n,
+
l)mod4N,
( 5 )
Where 0 <= t,n,,n, <= N - I , the signal Y,,,,,,~ is mapped as y,!,,,. If n, is fixed, the mapping from n2 to t is one-to-one. However, with different n,, the mapping order is not the same.By substituting (5) into (3), (3) can be rewritten as:
N-I N-I
'kI.k2
=
c
Ynl.rW4N( 6 4
(4n, +I)[kl +(4f+I)k,
1
In the above deduction, we let k,+(4t+l)k2 = a N + b , where a integer and 0 <= b <=
N-I.
We can find that the n,'s summation ofU,,,
is indeed an N-point I-D DCT. An N x N 2-D DCT can therefore be realized by N N- point I-D DCT's with some additions. Comparing with the row-column method which need 2 N N-point 1-D DCT's to perform an N x N 2-D DCT, this approach with less operation complexity is more suitable for low power consideration in the algorithm level.111. Low Power 2-D DCT
Architecture
Since the direct 2-D DCT algorithm discussed above reduces the computation complexity, it is obvious that the architecture based on it shall lead to the goal of low power. The low power 2-D DCT architecture is shown in Figure 1. Since the DCT input and output is ranging from -255-255 and -2040-2040, respectively, the word-
length of the input data is 9-bit and that of the output data is 12-bit. However, for convenience, the kernel factor Zc(n,)c(n,)/N is neglected in deducing the direct 2-D DCT method. Therefore, the word-length of the output data turns out to be 16-bit for covering all the output range. Besides, since the I-D DCT computation is
ordering the input and output data. Hence, 9-bit input data are fed word-serially and through the input SRAM, the data are converted into 64 bit-serial data for 2-D DCT. After these data are processed, the output SRAM changes the 64 word-parallel data to 16 bit-parallel data
for next stage, usually zig-zag scan. The proposed 2D- DCT architecture with parallel 1-D DCT computation implemented in DA method is shown in Figure 2.
U U
Fig. 1. Low Power 2-D DCT chip architecture
krp 00 22 44 66 77 $$ SS 11 07 25 4s 61 70 52 $4 16 04 26 47 65 7s 51 so 12 os 2I 40 62 74 56 37 IS 02 2s 46 60 75 54 $1 17 05 24 41 67 72 5s $6 I O 06 20 4s 64 71 51 $2 I$ 01 27 42 63 76 50 $5 14
Fig. 2. The proposed parallel DA 2-D DCT architecture
IV. Chip Implementation
The proposed low power 2-D DCT chip consists of mainlv adders. memories and registers. Thus. reducing
-
implemented with DA method, two-pok S R A M s I V - a e power consumption in these iomponents will makeA. Adder Design
The adder is used as the accumulator in calculating the I-D DCT result. Since the adder is also operating at low voltage, the parallelism is employed in order to compensate for the speed loss. First, the adder adopts the square-root carry- select structure shown in Figure 3.
After dividing the larger adder into several stages, these stages are implemented with Manchester adder for its improvement on (he carry-lookahead by using a single gate for generating carry
C,.
Therefore, a large-bit adder is formed by combining the square-root carry-select adder in architecture and the Manchester adder in stage circuit. This adder has two characteristics inherited from the two adders mentioned above: carry-select for high speed and Manchester for low power.B. Power-Saving ROM
Since the 1-11 DCT in our chip is implemented by DA method, the 170M is needed to hold the content of the look-up table that is pre-computed. In order to eliminate the static power consumption due to the DC path existing in static pseudo-nMOS ROM, a better approach is to use pre-charged logic. The ROM decoder and data circuits are shown in Figure 4. An address transition detection (ATD) circuit is employed to
generate the pre-charge signal /we, which is activated only when the input addresses change. The ROM decoder and data circuits are shown in Figure 4. During the pre-charge phase, pre = 0 and the bit-lines are pre- charged to V,,,. Meanwhile, the AND gates in decoder ensure that all pull-down paths through the NMOS are off during pre-charging. In the evaluation phase, pre = 1 and if the word-.line is activated high, the bit-line is
Fig. 4. ROM decoder and ROM data Circuit in the proposed power-saving ROM
discharged. For the PMOS and NMOS are not turned on --+
Jl
--+-_
I
simultaneously dllring pre-charging Or evaluation phase, Fig. 5 .The core of the two-port SRAM circuits include memory
there is no DC path from vm> to GND’ and thus’ no static
DC power dissipation. cells, write buffers, sence amplifiers and precharging circuits
C. Low-Voltage Two-Port
SRAM
Since the prloposed 2-D DCT is implemented with DA parallel architecture, the data reordering is needed for bit-serial word-parallel data operation. Thus, the two- port SRAM shown in Figure 5 is used for data mapping and data reorderiing. Note that the input port size n is different from the: output port size m. While the two-port SRAM (n=9, m=64) is for the input ping-pong mode, the (n=64, m=16) two-port SRAM is for the ouput ping-
pong mode. The sense amplifier consists of across- coupled pair of PMOS transistors and NMOS input devices. This differential pair applies the positive feedback to accelerate the sense speed.
ClllYO C a R v l C W O C W l C W O C W l C W O C W l
V. Chip Performance and Specifications
By incorporating the module circuits discussed above, the proposed low power 2-D DCT chip with direct method is implemented. The core characteristics are summarized in Table I.Besides, in order to understand more details about the power distribution in the designed chip, a power simulation at 100 MHz by components is shown in Table
11. From this table, it is obvious that registers consume most power than others do. Then, excluding the clock buffers, the first runner up is memory module. Hence, reducing the power consumption of registers and memories will contribute more to achieve the proposed chip. That is the reason why we design low power components such as registers, memories and adders.
Since the DCT is applied to portable applications recently, the power consumption becomes a critical point in designing
a
2-D DCT chip. The implementation in [ 11and the product presented in [2] are not dedicated to low power design. Thus, they consume larger power. The chip reported by [3] which utilized variable threshold- voltage scheme by controlling back-bias voltage and better technology achieved a lOmW 2-D DCT core -&ocessor. The main features of these chip
Although the chip presented by [3] consumes low power, its implementation lacks the low power consideration in algorithm level. Our chip is design by taking the low power algorithm, architecture, and circuits into consideration. The ideas in both chips do not conflict. Hence, combining the low power algorithm and architecture in our chip and the variable threshold- voltage scheme in [3] will lead to a 2-D DCT chip with lower power dissipation than both two chips.
Authors D.Slawecki et al.[l] SGS-THOMSON[2] T.Kuroda et al.[3] Our Chip TABLE I Chip Characteristics ~
Tech. Core area Trans. Voltage Clock rate Power
2 p n 72.68mm2 67929 5v 5OMHz IW
sv
20MHz 1.5w0.3pm 4mm2 120000 0.9V I5OMHz lOmW
0.6pm S0.6mm2 7 152017 2v IOOMHz 138mW
.I
internal Word-length 16 bits
Technology 0 . 6 ~ CMOS SPDM
No. of Transistors 152017
Core Size 7.85mm6.45mm
Die Size 8.98mmx7.79mm
Clock Rate 1 OOMHz
Latency 198cycles
Block Size 8 x 8
Supply Voltage 2.0
v
VI. Conclusion
A low-power high-performance 2-D DCT chip is implemented. The design features that contribute most to this result are as follows. First, the usage of the direct 2- D DCT algorithm reduces the 2-D DCT into 1-D DCT and some additions. Also, a fast algorithm of 1-D DCT is employed. Both of these decrease the computational complexity which means low power consumption per block operation. Besides, a parallel distributed arithmetic (DA) architecture with the direct 2-D DCT approach is proposed in order to compensate the speed loss due to the reduced internal supply voltage.
In addition to the considerations in algorithm and architecture level, low power design methdologies in logic-style and circuit level are applied to the real circuit implementation of the proposed 2-D DCT. Since adders, memories and registers are the main modules of the proposed DCT design, a power-saving in these circuits contribute to the goal significantly.
Finally, the proposed low power 2-D DCT chip with direct method is implemented. The maximum frequency simulated of the chip is 133MHz at last. It meets the requirement of the real-time HDTV signal processing for the chrominance format 4:2:0 and 4:2:2. The power simulated is 138mW at lOOMHz by 0.6pm single-poly double-metal technology.
Power 138mW
Reference
TABLE I1
Simulated Power Dissipation by Components
Module Counts Power(mW) Percentage(%)
Registers Clock buffers SRAM32x16 ROM 1 %bit adder SRAM64x9 I-bit ALU Controller 2923 I 4 64 64 2 3 20 I 35.38 29.35 21.76 17.5 I 15.48 I 1
.os
5.81 I .27 25.64% 21.27% 15.77% 12.69% I I .22% 8.03% 4.21% <0.92%[ 11 D. Slawecki and W. Li,”DCT/IDCT processor design for high data rate image coding,” IEEE Trans. Circuits Syst. Video Technol., vol. 2, pp. 135-146, June 1992.
[2] SGS-THOMSON Microelectronics, “2-D Discrete Cosine Transform Image Processor,” product num. IMSA121.
[3] T. Kuroda et al., “A 0.9V, 150 MHz, IO-mV, 4mmz, 2- D Discrete Cosine Transform Core Processor with Variable Threshold-Voltage (VT) Scheme. ”IEEE J.
Solid-state Circuit”, vol. 31, no. 11, pp. 1770-1779, Nov 1996.
[4] Y. P. Lee, T. H. Chen, L. G. Chen, M. J. Chen, and C.
W. Ku, “A Cost-Effective Architecture for 8 n 8 2-D DCTADCT Using Direct Method,” IEEE Trans. Circuits Syst. Video Technol. vol 7. No 3., pp. 459- 467, June 1997.
[5] J. Yuan and Svensson C.,”High-Speed CMOS Circuit Technique,” IEEE J. Solid-state Circuits, vol 24, no.
I , pp. 62-70, Feb. 1989.
TABLE 111 Processor Comparsion