A block-based architecture for lifting scheme discrete wavelet transform

(1)

PAPER

A Block-Based Architecture for Lifting Scheme Discrete Wavelet

Transform

Chung-Hsien YANG†a), Nonmember, Jia-Ching WANG†, Member, Jhing-Fa WANG†,

and Chi-Wei CHANG†, Nonmembers

SUMMARY Two-dimensional discrete wavelet transform (DWT) for processing image is conventionally designed by line-based architectures, which are simple and have low complexity. However, they suffer from two main shortcomings - the memory required for storing intermediate data and the long latency of computing wavelet coefficients. This work presents a new block-based architecture for computing lifting-based 2-D DWT co-efficients. This architecture yields a significantly lower buffer size. Ad-ditionally, the latency is reduced from N2down to 3N as compared to the line-based architectures. The proposed architecture supports the JPEG2000 default filters and has been realized in ARM-based ALTERA EPXA10 De-velopment Board at a frequency of 44.33 MHz.

key words: discrete wavelet transform, JPEG2000, lifting scheme, line-based DWT, VLSI

1. Introduction

Over the past decade, the discrete wavelet transform (DWT) has been widely applied in the area of image processing. The DWT is used in the decorrelation step of systems for compressing still pictures. Several research results indicate that wavelets outperform discrete cosine transforms (DCT) in terms of image quality at high compression ratios, by avoiding the block distortion problem suffered by DCT-based solutions. DWT has traditionally been implemented by convolution, which depends on both a large number of computations and a large storage size. In 1994, the lifting scheme, a new method which is known superior to conven-tional convolution-based DWT was proposed in [1], [2]. In addition to providing a significant reduction in memory and the computational complexity, lifting scheme provides in-place computation of the wavelet coefficients by overwrit-ing the memory locations where contain the input sample values. Furthermore, it has less hardware implementation and faster computation time. Therefore, the specification of the DWT kernels in JPEG2000 is only provided in terms of the lifting coefficients and not the convolutional filters.

Memory is an important constraint in many image compression applications. Existing DCT-based compres-sion algorithms, including those defined under the JPEG standard use memory very eﬃciently because, if required,

Manuscript received December 7, 2005. Manuscript revised September 29, 2006. Final manuscript received February 8, 2007.

†_{The authors are with the Department of Electrical}

Engineer-ing, National Cheng Kung University, No.1, University Road, Tainan City 701, Taiwan (R.O.C.).

a) E-mail: [email protected] DOI: 10.1093/ietfec/e90–a.5.1062

they can operate on individual image blocks such that the minimum amount of memory required is very low. Al-though wavelet-based coders outperform DCT-based coders in terms of compression eﬃciency, their implementations have not yet matured. Memory eﬃciency is in fact one of the most important issues to be addressed before wavelet-based techniques can be widely deployed, and this is currently one area of extensive research activity related to JPEG2000 stan-dard.

In the JPEG2000 verification model [9], the following wavelet filters are proposed: (5, 3) (5-tap highpass filter, 3-tap lowpass filter), (9, 7), C(13, 7), S(13, 7), (2, 6), (2, 10) and (6, 10). To be compliant with JPEG2000, the codec has to implement a (5, 3) filter in lossless mode and a (9, 7) fil-ter in lossy mode. Some proposed architectures [3]–[7] do not implement all of the filters and the data paths are in a line-based fashion, resulting in a large buffer size and the late production of wavelet coefficients. In [3], [4], the DWT is processed by two main modules - a row module and a column module. Another structure was presented in [5] to implement all stages of the transform using recursive archi-tecture. Direct implementation of the lifting scheme was described in [6] and the architecture in [7] improves upon this direct implementation by its folded structure. All of these methods use line-based data flow to process the DWT and suffer from large intermediate data storage. This paper proposes a new block-based architecture that can implement lifting scheme DWT and significantly reduce the amount of memory required. This memory efficiency is also advanta-geous in terms of computation speed. Instead, in our pro-posed system, the enforced “locality” of the filtering opera-tions makes it more likely that strips of the image get loaded into the on-chip memory only once.

The rest of this paper is organized as follows. Sec-tion 2, briefly reviews the lifting scheme. SecSec-tion 3 analyzes the precision analysis and the data flow. Section 4 explains the proposed architectures. Section 5 presents the FPGA implementation results and comparisons with others’ work. Finally, Sect. 6 draws conclusions.

2. Lifting Scheme

The basic concept that underlies the lifting scheme is the factorization of the polyphase matrix of a wavelet filter into a sequence of alternating upper and lower triangular matri-ces and a diagonal matrix. Let h(z) and g(z) be the low-pass Copyright c 2007 The Institute of Electronics, Information and Communication Engineers

(2)

Fig. 1 Lifting scheme DWT.

(a)

(b)

Fig. 2 Lifting steps for (a) the (5, 3) filter-bank and (b) the (9, 7) filter-bank.

and high-pass analysis filters. The corresponding polyphase matrix is defined as,

P(z)= he(z) ho(z) ge(z) go(z) , (1)

where he(z) contains the even coeﬃcients of h(z), ho(z)

con-tains the odd coeﬃcients h(z), ge(z) contains the even

coef-ficients of g(z) and go(z) contains the odd coeﬃcients g(z),

respectively. Then, P(z) can be factored into lifting steps as,

P(z)= K 0 0 1/K m i=1 1 pi(z) 0 1 1 0 ui(z) 1 . (2) As shown in Fig. 1, the P(z) factorization, involves of three steps:

(1) Prediction step, in which the even samples are mul-tiplied by the time domain equivalent of pi(z), then

added to the odd samples;

(2) Update step, in which updated odd samples are mul-tiplied by the time domain equivalent of ui(z), then

added to the even samples;

(3) Scaling step, in which the even samples are multi-plied by 1/K and the odd samples by K.

The inverse DWT is performed by traversing in the re-verse direction; changing the factor K to 1/K, factor 1/K to

K, and reversing the signs of the coeﬃcients in pi(z) and

ui(z).

The original 1-D signal{s0₀, d₀0, s0₁, d₁0, s0₂, d₂0, . . .} is split into odd and even indexed subsequences, and then these values are modified using alternating prediction and updat-ing steps. The computational steps are summarized as

dn_i = dn_i−1+

k

(3)

Table 1 Computational complexity comparison between convolution and lifting schemes.

sni = s n−1 i + k un(k)dnk, n∈ [1, 2, . . . M], (4) where{sn i} and {d n

i} are, respectively, the even and odd

se-quences, pn(k) and un(k) are, respectively, the prediction and

updated weights at the nth iteration and M is the number of lifting sequence. For the (5, 3), C(13, 7), S(13, 7), (2, 6), (2, 10) filter-bank, M=1, while for the (9, 7) and (6, 10) filter-bank, M=2. Equation (3) indicates the prediction step that consists of predicting each odd sample and subtract-ing it from the odd sample to form the prediction error{dn

i}.

Equation (4) indicates the update step that consists of updat-ing the even samples by addupdat-ing to them a linear combination of the already modified odd samples,{dn

i}, to form the

up-dated sequence{sn

i}. The output of the final prediction step

will be the high-pass coeﬃcients up to a scaling factor K, while the output of the final update step will be the low-pass coeﬃcients up to a scaling factor 1/K. For the (9, 7) filter-bank, K= 1.230174104914001. The lifting steps of the (5, 3) filter-bank and the (9, 7) filter-bank [8] are depicted in Fig. 2.

The number of computations required for calculation of a high-pass, low-pass pair of wavelet transforms using convolution and lifting scheme is given in Table 1. The reduction in the number of multiplications for the lifting scheme is significant for odd-tap filters compared with con-volution. For even-tap filters, the convolution scheme has fewer or an equal number of multiplications. The number of additions for lifting scheme is lower in both odd and even tap filters. Such reduction in the computational complexity makes lifting schemes attractive for both high throughput and low-power applications.

3. Precision Analysis

The drawback of using fixed-point data format for imple-menting application-specific integrated circuit (ASIC) chips is that the precision can be reduced. To overcome this draw-back, we need to increase the additional bits for ensuring precision using image quality analysis.

The filter coefficients of the seven filters in JPEG2000 considered herein range from 0.003906 to 2 [4]. To convert the filter coefficients to integers, these coefficients are mul-tiplied by 256. The value of the coefficients range from 1 to 512, so that 10 bits can be used to represent the coefficients in 2’s complement form. At the end of multiplication, the

Fig. 3 General lifting-based structures.

product is shifted right by eight bits to yield the required re-sult. The rounding is applied to the individual product terms instead of the result of the filter operation.

Now we consider the format of signal values for hard-ware implementation. The signal values must be shifted left to increase the precision. The extension of the shift is de-termined by image quality analysis. Consider the general structure of lifting schemes, as indicated in Fig. 3. Given the equation

y = a(x1+ x2)+ b(x3+ x4)+ x5, (5)

where a and b are the coeﬃcients, xk, 1 ≤ k ≤ 5, are the

signal inputs, and y is the transformed value. Assume A= Round (256× a) and B = Round (256 × b), Eq. (5) can be expressed as follows,

y ≈ 1

256[A(x1+ x2)+ B(x3+ x4)]+ x5. (6) If the input values xk are shifted by the extension bits, S ,

then y≈ 1 2S 1 256[A(2 S x1+2Sx2)+B(2Sx3+2Sx4)+2Sx5]. (7)

The order of the computation is changed to improve its pre-cision ˆy(S ) = 1 2S A(2S_x 1+ 2Sx2) 256 round + B(2S_x 3+ 2Sx4) 256 round + 2S x5 round , (8) where the subscript round represents the function of round-ing. Rounding occurred when each term has been calcu-lated. The SNR values with diﬀerent extension bit num-bers, for the Baboon, Lenna, Elaine, and Boat images, after three levels of forward and inverse transforms are given in Table 2. For a set of given images, we varied the exten-sion bit number S to select the bit number S with saturated SNR performance. That is, the bit number greater than S will only introduce slight SNR improvement. According to Figs. 4 and 5, when S > 5, this proposed architecture uses five extension bits for processing the DWT.

(4)

Table 2 SNR values after three levels of DWT.

Fig. 4 SNR values among diﬀerent extension bits after three level DWT using (5, 3) filter.

Fig. 5 SNR values among diﬀerent extension bits after three level DWT using (9, 7) filter.

Once the number of extension bits is chosen, the width of the data path must be determined, as can be done by observing the maximum and minimum values for the for-ward and inverse transform at the end of each level. Table 3 presents the maximum and minimum values for the Baboon, Lenna, Elaine and Boat images with five extension bits. This table indicates that 16 bits are required to represent the trans-formed values in 2’s complement representation.

The multiplier multiplies a 16-bits number by a 10-bit

Table 3 Maximum and minimum values with five extension bits.

number and then rounds the product that has eight LSBs (to account for the increased precision of the filter coeﬃcients) and two MSBs to form a 16-bit output. (Sixteen bits are re-quired to represent the outputs and therefore the two MSBs are sign extension bits.)

4. Proposed VLSI Architectures

4.1 Proposed Data Flow Diagram

For each level of the DWT using line-based method, the fil-tering along columns is performed after the completion of the filtering along rows as shown in Fig. 6. For instance, in image processing, it requires N2_{words for intermediate data}

storage. This may be unreasonable to fit on a single chip for even moderately sized images. While the line-based method can be eﬃcient for 1-D applications, 2-D line-based archi-tectures suﬀer from the bottleneck that the required memory equals to the input data size. Besides this disadvantage, the line-based approach does not lend itself to parallel process-ing.

In this paper, the proposed data flow for the DWT does not follow the line-based method. A new block-based fash-ion is presented in this paper. When the input image is di-vided into several blocks, the coeﬃcients of each layer (i.e.,

LL, LH, HL, HH) can be concurrently obtained within a

block. For this method, it can be thought of a window slid-ing over the image. The overlappslid-ing design smoothly slides the window across the image. The idea behind the overlap-ping block architecture is to take only as many inputs as re-quired to compute a set of outputs. For example, a 1-D ver-sion would require only one input per filter length (L), and produces two outputs: a low-pass and a high-pass. The 2-D case takes L2inputs and produces four outputs. In general, an n-dimensional transform needs Ln _{inputs to produce 2n}

outputs. Figure 7 presents an example of the data flow, us-ing a (5, 3) filter-bank. The size of input image is assumed to be 5× 5 pixels, and a block of 3 × 3 pixels is used. There are three intermediate data produced in Fig. 7(a). Figure 7(b) depicts that the three intermediate data are used to generate the first transformed data Z (the black circle, for HH layer) and to generate the other intermediate data simultaneously. When the transform coeﬃcient Z within a block has been calculated, the corresponding intermediate data Y (the gray circle) no longer needs the buﬀer. In summary, Fig. 7(a) to (k) show how the output data Z are calculated from the input data X.

(5)

Fig. 6 Procedure of 2-D line-based DWT.

Fig. 7 Data flow diagrams of the proposed DWT transform.

4.2 Proposed Architectures

The proposed block-based architecture for 2-D DWT is de-picted in Fig. 8. The outputs in each level are LL, LH, HL, and HH. The LL data are used for the next level of decompo-sition. This system has three primary stages. The first stage reads the input data and the block controller forms a “block” according to double buﬀer scheme. After a “block” of input data is ready for processing, it is sent to the pipeline register for the next stage.

The second stage is the PE Y controller that processes the intermediate transform data within a block and stores the

data to the Buffer Y. The last stage, the PE Z controller, pro-cesses the final transform coefficients. Buffer Z is only used in 4M filters because two passes of one dimension transform is calculated in a round. The registers in Fig. 8 are used for storing of the second-pass input. The details are discussed in the following subsections.

4.3 Block Controller Modules

The block controller modules read the image input data. The BUFFER X is used to store input data. It is utilized to seg-ment the image data into sub-blocks. BUFFER X contains two banks (MEM1 and MEM2) to implement the double-buﬀer scheme. The first step is to read data from the Ex-ternal Memory into MEM1 (see Fig. 8). When the MEM1 is full of the image data, second, the MEM2 reads the im-age data. The MEM1 can be simultaneously read, forming a “block” for processing. The MEM2 will wait until the pro-cessing of MEM1 is completed. The third step is similar to step 2 but with the MEM1 and MEM2 exchanged. The first step is executed only once, after which, the second and the third steps are performed alternatively till the entire image is completely processed. The roughly finite state machine of the block controller is described in Fig. 9.

4.4 Processing Elements (PE) Modules

Two PE modules are used in our design. The PE Y reads a block of data from BUFFER X; calculates the intermediate data Y, and writes the data into BUFFER Y, when the PE Z reads a block of data from BUFFER Y; calculates the trans-form data Z, and writes the data into BUFFER Z. The basic computation unit, MAC, is indicated in Fig. 10. Figures 11 and 12 show the structures of the 2M and 4M filter banks, respectively. The REG1 and REG2 are used for storing the overlapped data of the block in the 2M filter banks. While the 4M filter banks are being processed, all four registers are used to reduce the numbers of memory access. Thus the re-accessing of the memory can be prevented to diminish the power consumption.

In our algorithm, a block has two frames. In each frame, the processing element calculates the high-pass and low-pass pair of coeﬃcients. The PE Y and PE Z can si-multaneously perform transform when the PE Z has enough

(6)

Fig. 8 Proposed architecture of 2-D discrete wavelet transform.

Fig. 9 Illustration of the finite state machine for the block controller.

Fig. 10 Architecture of basic computation unit, MAC.

input data to do so. Thus, the computational time can be significantly reduced.

Fig. 11 Architecture of processing element for 2M filters.

Fig. 12 Architecture of processing element for 4M filters.

4.5 Memory Modules

The structures of double-buﬀer and overlapping are adopted, so the size of MEM1 and MEM2 in the proposed block-based architecture is N×2, where N is the width of the input image. While dealing with the MEM1 (MEM2) data, all of them is processed in the PE Y and stored in the BUFFER Y. At this time, the PE Z starts to deal with the other dimension

(7)

since it has suﬃcient data for processing. The size of the memory is much lower than those associated with line-based architecture whose memory requirement is N× N/2 [3], [4].

BUFFER Y and BUFFER Z have size N× 4. Referred to Fig. 13, when a row of the intermediate data is processed, the three other rows can be accessed for simultaneous pro-cessing of other dimensions. These four rows can be rewrit-ten circularly.

Fig. 13 Organization of BUFFER Y and BUFFER Z.

Table 4 (a) Schedule of PE Y for the (5, 3) filter applied on a 5× 5 image. (b) Schedule of PE Z for the (5, 3) filter applied on a 5× 5 image.

(a)

(b)

(8)

4.6 Scheduling

A detailed schedule of the (5, 3) filter-bank has been gener-ated, as shown in Tables 4(a), (b). In the example of a 5× 5 image, the input data are x(i, j); where i and j are the verti-cal and horizontal indices, respectively, with 1≤ i, j ≤ 5. In the 11th cycle, the last element required for calculating the first second dimensional coefficient is ready for processing. Thus in the subsequent cycle, the first horizontal wavelet co-efficient, Z(2, 2), can be calculated. Afterwards, the DWT coefficients are generated at every cycle. The total compu-tational time for one level of decomposition on an N× N image, using the (5, 3) filter, is 2× [N/2] × [N/2] + 2.

5. FPGA Implementation

To realize the proposed architecture, ALTERA EPXA10 Development Board (ALTERATM_EXCALIBURTM_EPXA

10F1020C2) was utilized. Figure 14 shows the system ar-chitecture of the embedded stripe and the interfaces to the PLD portion of the devices [11]. This architecture promotes maximum integration with minimal system cost and allows the embedded stripe and PLD to be independently opti-mized for maximum performance. Two AMBA-compliant AHBs ensure that the embedded processor activity is unaf-fected by peripheral and memory operation. Three bidirec-tional AHB-to-AHB bridges enable embedded peripherals and PLD-implemented peripherals to exchange data with the embedded processor or with other peripherals. With these interfaces, the performance of the ARM922T is un-compromised, and is equivalent to an ASIC implementation on a 0.18-µm CMOS process. The implementation results are summarized in Table 5. The critical path of the system is about 22.557 ns. That means the maximum operating fre-quency is roughly 44.33 MHz. As shown in Fig. 8, the criti-cal path is the path between two pipeline registers (through

Table 5 The implementation results of the FPGA prototype.

Fig. 15 Schematic view of the whole system.

a multiplexer and a PE controller). Figure 15 depicts the schematic view of the whole system. The prototype system photo is given in Fig. 16.

The following will compare the buffer size, hardware utilization, and computational time of the proposed archi-tecture with those of others’ archiarchi-tectures. In the proposed architecture, the buffer memory is significantly reduced as shown in Table 6. From Table 6, while the block-based ar-chitectures may use more computing time, the work can be divided among many processors. In this proposed architec-ture, the first wavelet transform coefficient is generated as

Fig. 16 Photo of prototype system.

(9)

soon as possible. The total computational time can also be reduced in comparison with those of other architectures, fa-cilitating quantization in the processing of image compres-sion in JPEG2000, representing another advantage of the proposed block-based structure.

6. Conclusions

Line-based DWT architectures are eﬃcient for 1-D applica-tions. In 2-D transforms (or higher), they suﬀer from two main problems - memory requirements and latency. For ex-ample, image processing requires N2 words for storing in-termediate data may not fit on a single chip even for moder-ately sized images. Also, the latency depends on the input size. At least O(N) clock cycles are required to generate the first output. These problems are inherent in line-based architectures.

This paper oﬀers a new data processing path and per-forms a new VLSI architecture to implement the 2-D lifting scheme DWT with small memory. The DWT coeﬃcients are computed using a block fashion of data path. This ar-chitecture reduces the latency to 3N and the total required memory is also reduced. Finally, the proposed design has successfully been verified using an ARM-based ALTERA EPXA10 Development Board.

References

[1] W. Sweldens, “The lifting scheme: A new philosophy in biorthogo-nal wavelet constructions,” Proc. SPIE: Wavelet Applications in Sig-nal and Image Processing III, vol.2569, pp.68–79, 1995.

[2] I. Daubechies and W. Sweldens, “Factoring wavelet transforms into lifting schemes,” J. Fourier Analysis and Applications, vol.4, pp.247–269, 1998.

[3] K. Andra, C. Chakrabarti, and T. Acharya, “A VLSI architecture for lifting based wavelet transform,” Proc. IEEE Workshop Signal Process. Syst., pp.70–79, Oct. 2000.

[4] T. Acharya, K. Andra, and C. Chakrabarti, “A VLSI architecture for lifting-based forward and inverse wavelet transform,” IEEE Trans. Signal Process., vol.50, no.4, pp.966–977, April 2002.

[5] B.F. Cockburn, H. Liao, and M.K. Mandal, “Novel architectures for the lifting-based discrete wavelet transform,” Proc. IEEE Conf. on Electrical and Computer Engineering, vol.2, pp.1020–025, 2002. [6] C.-C. Liu, Y.-H. Shiau, and J.-M. Jou, “Design and implementation

of a progressive image coding chip based on the lifted wavelet trans-form,” Proc. 11th VLSI Design/CAD Symposium, pp.49–52, Aug. 2000.

[7] C.-J. Lian, K.-F. Chen, H.-H. Chen, and L.-G. Chen, “Lifting based discrete wavelet transform architecture for JPEG2000,” Proc. IEEE International Symposium on Circuits and Systems, vol.2, pp.445– 448, 2001.

[8] J.M. Shapiro, “Embedded imaging coding using zerotrees of wavelet coeﬃcients,” IEEE Trans. Signal Process., vol.41, no.12, pp.3445– 3462, Dec. 1993.

[9] D. Taubman, “JPEE2000 verification model vm3a,” ISO/IEC JTC1/SC29/WG1N1143, Feb. 1999.

[10] S. Movva and S. Srinivasan, “A novel architecture for lifting-based discrete wavelet transform for JPEG2000 standard suitable for VLSI implementation,” Proc. 16th International Conference on VLSI De-sign, pp.202–207, Jan. 2003.

[11] Altera Corporation, Altera Device Package Information Data Sheet, http://www.altera.com/literature/lit-index.html

[12] H. Liao, M. Mandal, and B. Cockburn, “Eﬃcient architecture for 1-D and 2-D lifting-based wavelet transforms,” IEEE Trans. Signal Process., vol.52, no.5, pp.1315–1326, May 2004.

[13] P.-C. Wu and L.-G. Chen, “An eﬃcient architecture for two-dimensional discrete wavelet transform,” IEEE Trans. Circuits Syst. Video Technol., vol.11, no.4, pp.536–545, April 2001.

[14] C.-T. Huang, P.-C. Tseng, and L.-G. Chen, “VLSI architecture for forward discrete wavelet transform based on b-spline factorization,” J. VLSI Signal Process., vol.40, no.3, pp.343–353, July 2005.

Chung-Hsien Yang received the B.S. de-gree in Computer Science and Information En-gineering from Tunghai University, Taichung, Taiwan, in 1997 and the M.S. degree in Com-puter Science and Information Engineering from National Cheng Kung University, Tainan, Taiwan, in 1999. He is a Ph.D. candidate in the Department of Electrical Engineering at Na-tional Cheng Kung University. His research ar-eas include stochastic processes and VLSI de-sign.

Jia-Ching Wang received the M.S. and Ph.D. degrees in electrical engineering from Na-tional Cheng Kung University, Tainan, Taiwan, in 1997, 2002, respectively. His research inter-ests include signal processing and VLSI archi-tecture design. Dr. Wang is an honor member of Phi Tau Phi. He is also a member of IEEE and ACM.

Jhing-Fa Wang is now a Chair Professor in National Cheng Kung University, Tainan, Tai-wan. He received his Master and Bachelor de-grees in the Department of Electrical Engineer-ing from National Cheng Kung University, Tai-wan in 1979 and 1973, respectively and Ph.D. degree in the Department of Computer Science and Electrical Engineering from Stevens Insti-tute of Technology, U.S.A. in 1983. He was elected as an IEEE Fellow in 1999 and now the Chairman of IEEE Tainan Section. He got out-standing awards from Institute of Information Industry in 1991 and Na-tional Science Council of Taiwan in 1990, 1995, and 1997, respectively. He has been invited to give keynote speech in PACLIC 12 (Pacific Asia Conference on Language, Information and Computation), Singapore and served as the general chairman of International Symposium on Commu-nication (ISCOM 2001), Taiwan. He has developed a Mandarin speech recognition system called Venus-Dictate known as a pioneering system in Taiwan. He was an associate editor for IEEE Transaction on Neural Net-works and VLSI System. He is currently leading a research group of dif-ferent disciplines for the development of Advanced Ubiquitous Media for Created Cyberspace. He has published about 91 journal papers and 217 conference papers and obtained 5 patents since 1983. His research areas include wireless content-based media processing, speech recognition and natural language understanding.

(10)

Chi-Wei Chang received the B.S. de-gree in Biomedical Engineering from Chung Yuan Christian University, Chung Li, Taiwan, in 1998, and the M.S. degree in Electrical En-gineering from National Cheng Kung Univer-sity, Tainan, Taiwan, in 2003. His research areas include discrete wavelet transform, image pro-cessing and VLSI design.