A generalized output pruning algorithm for matrix-vector multiplication and its application to compute pruning discrete cosine transform

(1)

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 48, NO. 2, FEBRUARY 2000 561

A Generalized Output Pruning Algorithm for Matrix-Vector Multiplication and Its Application to

Compute Pruning Discrete Cosine Transform Yuh-Ming Huang, Ja-Ling Wu, and Chi-Lun Chang

Abstract—In this correspondence, a generalized output pruning algo-rithm for matrix-vector multiplication is proposed. It is shown that for a given decomposition of the matrix of the transform kernel and the pruning pattern, the unnecessary operations for computing an output pruning dis-crete cosine transform (DCT) can be eliminated thoroughly by using the proposed algorithm

I. INTRODUCTION

Recently, a lot of one-dimensional (1-D) and two-dimensional (2-D) fast pruning DCT algorithms for computing only the lower frequency components have been proposed in [1]–[3]. However, to the best of our knowledge, no known generalized pruning method can be directly ap-plied to any orthogonal discrete transform (ODT), such as the DCT, the discrete Fourier transform (DFT), the discrete Hartley transform (DHT), etc. In this correspondence, a generalized output pruning al-gorithm for computing matrix-vector multiplication of any order is presented. It is shown that for a given decomposition of the matrix, the unnecessary operations can be eliminated thoroughly. An efficient pruning DCT algorithm can then be derived based on the prescribed pruning algorithm. Of course, the applicability of the proposed output pruning algorithm is not limited to the DCT; actually, it can be applied to all well-known discrete orthogonal transforms, such as the DFT, the DHT, and the discrete sine transform (DST). However, in this work, the pruning DCT algorithm is our only focus.

II. GENERALIZEDOUTPUTPRUNINGALGORITHM FOR MATRIX-VECTORMULTIPLICATION

Consider the operation of a general matrix-vector multiplication of orderN, say, DN = AN2N2 BN, and assume only partial multipli-cation outputsD_N[j] (where D_N[j] is the jth entry of the vector D_N; 1 j N) are required. It follows that we can speed up the afore-cited computation by pruning the unnecessary operations.

To reduce the computational complexity, we decompose the matrix AN2N into a product of a sequence of more-sparse matrices of the same order, that is,AN2N = 5k01i=0 ciN2N: By the associative

prop-erty of matrix-vector multiplication,DNcan be computed recursively as B0 N = BN Bi N = CN2Nk0i 2 BNi01; 1 i k: DN = BkN (1) Since there arek stages of matrix-vector multiplication of order N in (1), no matter what kind of output pruning pattern is,k 2 N bits are required to record whether eachB_Ni;jhas to be computed or not, where Bi;jN is the inner product of thejth row vector of CN2Nk0i and the output

vectorB_Ni01of the previous stage.

Manuscript received January 19, 1999; revised June 15, 1999. The associate editor coordinating the review of this paper and approving it for publication was Dr. Xiang-Gen Xia.

Y.-M. Huang is with the Department of Information Engineering, National Chi-Nan University, Puli, Taiwan, R.O.C.

J.-L. Wu and C.-L. Chang are with Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan, R.O.C.

Publisher Item Identifier S 1053-587X(00)00970-3.

In this section, a more efficient algorithm for computing output-pruning matrix-vector multiplication is presented. In this algorithm, onlydlog(k + 1)e 2 N bits are required to record whether the par-tial resultsB_Ni;jhas to be computed or not. In other words, we need an array, say,M of order N with each entry of dlog(k + 1)e bits in size, to record which operations are required or unnecessary.

If the computation ofDN[j] is necessary, then initially, let M[j] = 0; otherwise, let M[j] = 255 or a large integer. The final value of each entry ofM will evolve gradually through the computation of C_N2N0 to that of theC_N2Nk01 and will be precomputed and stored with respect to the characteristics of the concerned matrixC_N2Ni described as follows.

A. Encoding Processes

LetT be a control or threshold parameter and its value is set to be zero initially.

1) IfC_N2Ni is a permutation matrix, that is, for any vectorVN

of order N, the result of the matrix-vector multiplication Ci

N2N 2 VN is just a position swapping of VN: In this

case, the entries ofM are unchanged in value but permuted according to the inverse permutation matrix(C_N2Ni )01;, and the value ofT is unchanged.

2) C_N2Ni is a diagonal matrix, that is, all the entries ofC_N2Ni are equal to zero except the diagonal components. In this case, the values of each entry ofM and T will be unchanged. 3) C_N2Ni is a general diagonal matrix, that is, all the diagonal

components of it are not equal to zero, and no constraint is set to the nondiagonal components. In this case, the value ofT will be increased by one. The value ofT (which is denoted asTt) is used as a threshold for indicating the fact that in the matrix-vector multiplication stage, say,C_N2Ni 2 B_Nk0i01= Bk0i

N , some output entryBNk0i;sis unnecessary (i.e.M[s] = 255), whereas the entry Bk0i01;s

N of the input vectorBNk0i01

is required to compute some output entryB_Nk0i;r: That is, the sth input entry Bk0i01;s

N has to be computed correctly

be-fore dealing with the matrix-vector multiplicationC_N2Ni 2 Bk0i01

N , but after that, thesth output entry BNk0i;sis of no use

for later stages. In other words, ifM[r] < Tt; and M[s] = 255, then we set M[s] = Tt:

4) C_N2Ni can be decomposed into a product of a general diagonal matrix and a permutation matrix, or vice versa. In this case, the arrayM will be processed by using the merged methodologies presented in 1) and 3).

5) The other matrix forms that do not belong to those of the above four types are categorized as type 5). Notice that those ma-trices discussed in 1)–3) are special subsets of 4). Hence, by definition, those matrices of type 5) cannot be decomposed into a product of a general diagonal matrix and a permuta-tion matrix. Moreover, according to the following corollary, we will deduce that each matrix of type 5) is a linearly depen-dent matrix.

Corollary 1: If a matrix of sizeN 2 N cannot be decomposed into

a product of a general diagonal matrix and a permutation matrix of the same size, then its determinant is equal to 0.

By corollary 1, we know that for any well-defined discrete transform matrixAN2N, which is linearly independent, it will never be catego-rized as a type 5) matrix.

For the sake of convenience, those sets composed of the matrices discussed in 1)–4) are, respectively, denoted by P, D, GD, and PGD.

As we have obtained the final values for each entry ofM through the computation ofC_N2N0 to that of theC_N2Nk01 , then with the help ofM, all the unnecessary operations for the computation of C_N2N0 2 0018-9219/00$10.00 © 2000 IEEE

(2)

562 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 48, NO. 2, FEBRUARY 2000

TABLE I

NUMBERS OFREQUIREDMULTIPLICATIONS ANDADDITIONS FOR THERESULTANTPRUNINGDCT ALGORITHMS WITHRESPECT TO DIFFERENTMATRIXDECOMPOSITIONS ANDDIFFERENTPRUNINGPATTERNS

C1

N2N21 1 12CN2Nk01 2BNcan be eliminated thoroughly. Let the final

accumulated value ofT be denoted by Tf: This means, among those

matricesC_N2Ni 0 i k 0 1, there are Tfmatrices that belong to GD or PGD. In other words, after the matrix decomposition, the value Tf is precomputable.

Now, let us show how the unnecessary operations for the compu-tation ofC_N2N0 2 C_N2N1 2 1 1 1 2 C_N2Nk01 2 BN can be eliminated thoroughly with the aid ofM and T:

B. Decoding Process

First, letT = T_f: The elimination processes for unnecessary opera-tions are deduced gradually through the computation ofC_N2Nk01 2 BN

to that of theC_N2N0 2 Bk01_N :

1) IfC_N2Ni 2 P , then the entries of M will be swapped ac-cording to the permutation pattern defined byC_N2Ni : 2) IfC_N2Ni 2 E, then the jth output B_Nk0i;jneeds to be

com-puted only whenM[j] T ; otherwise, it can be left out for pruning the unnecessary operations.

3) IfC_N2Ni 2 GD, then the jth output B_Nk0i;jneeds to be com-puted only whenM[j] < T ; otherwise, it can be left out for pruning the unnecessary operations. Furthermore, the value of T will be decreased by one in this case.

4) IfC_N2Ni 2 PGD, it follows that C_N2Ni can be decomposed into a product of a general diagonal matrixDgand a permuta-tion matrixPp(or vice versa). Then, the entries ofM will be permuted according to thePpfirst, and thejth output B_Nk0i;j has to be computed only whenM[j] < T ; otherwise, it can be left out for pruning. Of course, the value ofT will also be decreased by one in this case.

The above statements described the detailed procedures of the proposed pruning algorithm. Because the final value ofT , i.e., Tf, will not be greater thank. Only dlog(k + 1)e 2 N bits are required to record the evolution process ofM:

Corollary 2: LetAN2N(= 5k01i=0 CN2Ni ) be a linearly

indepen-dent matrix. From the proposed algorithm, it can be deduced that in the computation ofAN2N2 BN; BN[j] is necessary only when M[j] Tf:

Lemma 1: LetAN2N(= 5k01i=0 CN2Ni ) be a linearly independent

matrix. Then, all the unnecessary operations can be eliminated thor-oughly by the above proposed pruning algorithm.

From Lemma 1, for a linearly independent matrixA_N2N, we know that the unnecessary operations can be eliminated thoroughly when only the partial outputs of the matrix-vector multiplicationA_N2N 2 BN are required. However, for a special pruning pattern, does there exist another scheme that can be used to further reduce the number of required operations. In the next corollary, we show that the number of required operations cannot be reduced by just utilizing the permutation technique. Moreover, forC_N2Ni 2 PGD, the gain of pruning will not be changed even if we apply a different decomposition toC_N2Ni : his will be shown in Lemma 2.

Corollary 3: LetAN2N be a linearly independent matrix andPC

be a permutation matrix. For any pruning pattern, on computing of the following expressionsAN2N2 BN; (AN2NPc) (Pc01BN); and Pc(Pc01AN2N)BN, the simplification gains obtained from pruning the unnecessary operations will be the same.

Lemma 2: LetAN2N be a linearly independent matrix. Based on the proposed pruning algorithm, the simplification gain will keep un-changed, even though we apply a different decomposition to the matrix Ci

N2N (2 PGD).

Therefore, more effective decomposition of the matrix AN2N is necessary if we want to obtain better simplification gain.

III. APPLICATION OF THEPROPOSEDOUTPUTPRUNINGALGORITHM TO THECOMPUTATION OFPRUNINGDCT

Since DCT is an orthogonal discrete transform, its transform kernel matrix must be a linearly independent matrix. That is, the pruning algo-rithm presented in Section II can be directly applied to derive efficient pruning DCT algorithms. Moreover, all well-known DCT algorithms (such as [4]–[6]) and pruning DCT algorithms (such as [1]–[3]) can be modeled as a matrix-vector multiplication with known decompositions of the DCT transform kernel matrix.

Since the optimism of the proposed pruning algorithm is decompo-sition dependent, we cannot only derive effective pruning DCT algo-rithms but also compare the effectiveness of matrix decomposition

(3)

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 48, NO. 2, FEBRUARY 2000 563

responding to each existing fast algorithm by checking the complexities of the so-obtained pruning algorithms.

The following data are obtained by applying the proposed output pruning algorithm to derive efficient pruning DCT algorithms, based on the matrix decompositions presented in [1], [5], and [6]. For the 1-D DCT of length 64, Table I lists the numbers of required multipli-cations and additions for the corresponding pruning DCT algorithms with respect to different pruning patterns.

The most well-known pruning DCT algorithm presented in [1] gives the same complexities as listed in the first column of Table I. This fact verifies the correctness and effectiveness of the proposed pruning algo-rithm. As for the other two algorithms (or matrix decompositions), the gain obtained from pruning is less significant. The number of pruned multiplications is larger in Winograd’s approach, whereas the number of pruned additions is larger in Lee’s approach. In fact, these charac-teristics can be observed and explained from their corresponding algo-rithm structures. In Winograd’s DCT algoalgo-rithm, the required multipli-cations are post-processing oriented, whereas in Lee’s DCT algorithm, the most post-processing oriented operations are additions. That is, if the complexity of multiplication is the major concern, then the pruning gain will be more significant when the required multiplications of the algorithm are nearly post-processing oriented.

IV. CONCLUSIONS

In this correspondence, an index-registration technique is presented to establish an effective framework for developing efficient pruning al-gorithms for various ODT’s. Moreover, with the aid of the proposed technique, an automatic optimal output pruning ODT program gener-ator can be developed. This is currently under investigation.

REFERENCES

[1] Z. Wang, “Pruning the fast discrete cosine transform,” IEEE Trans.

Commun., vol. 39, pp. 640–643, May 1991.

[2] A. N. Skodras, “Fast discrete cosine transform pruning,” IEEE Trans.

Signal Processing, vol. 42, pp. 1833–1837, July 1994.

[3] C. A. Christopoulos, J. Bormans, J. Cornelis, and A. N. Skodras, “The vector-radix fast cosine transform: Pruning and complexity analysis,”

Signal Process., vol. 43, pp. 197–205, May 1998.

[4] H. S. Hou, “A fast recursive algorithm for computing the discrete co-sine transform,” IEEE Trans. Acoust., Speech Signal Processing, vol. ASSP-35, pp. 1455–1461, Oct. 1987.

[5] B. G. Lee, “A new algorithm to compute the discrete cosine transform,”

IEEE Trans. Acoust., Speech Signal Processing, vol. ASSP-32, pp.

1243–1245, 1984.

[6] E. Feig and S. Winograd, “Fast algorithms for the discrete cosine trans-form,” IEEE Trans. Signal Processing, vol. 40, pp. 2174–2193, Sept. 1992.

[7] I. W. Selesnick and C. S. Burrus, “Automatic generation of prime length FFT programs,” IEEE Trans. Signal Processing, vol. 44, pp. 14–24, Jan. 1996.

A Novel Design Technique for Biorthogonal Filterbank Systems

Youhong Lu and Joel M. Morris

Abstract—In this correspondence, we present a design technique for the cosine-modulated FIR biorthogonal filter bank systems. The system achieves perfect reconstruction with a given analysis or synthesis prototype filter. In particular, if the analysis filter is a good approximation of an ideal lowpass filter, then so is the synthesis filter, and the difference is a measure of ideality of the lowpass analysis filter. The advantage of the technique is that we have more freedom in the choice of prototype filters.

Index Terms—Biorthogonal, cosine modulation, filterbank, Gabor ex-pansion, perfect reconstruction condition.

I. INTRODUCTION

Multirate analysis and the synthesis filter systems are useful in signal analysis and representation [1]. There are many techniques for this kind of system in which the system is designed to satisfy the perfect recon-struction property, for example, the halfband filter-based technique, the power complementary-based technique, the lapped lattice-based nique, and the paraunitary-based technique [1]. The most efficient tech-niques for implementation of this system, we believe, are the tree-struc-tured filter bank system [1].

The cosine-modulated analysis and the synthesis filter system have been studied in depth by many researchers because the design is simpler and more realizable than that of a general filter bank system [1]–[14]. Most past and current design techniques set a fixed relationship between analysis and synthesis filter banks, for example, hm(k) = fm(N 0 k 0 1), where hm(k) and fm(k) are mth band

analysis and synthesis filters, respectively, and N is the length of the filters. The system design problem, therefore, becomes a set of analysis or synthesis filter design problems. This design usually requires us to solve a nonlinear equation, and consequently, nonlinear optimization methods have to be used.

In many applications, a set of desired analysis or synthesis filters might be required. For example, in echo cancellation based on time-fre-quency techniques for telecommunication systems, the analysis filters have to be designed for maximum performance, and the synthesis fil-ters are then designed based on the designed analysis filfil-ters to maintain smallest distortion [13]; in image processing, modulated-Gaussian fil-ters are frequently used to extract image features such as edges and textures. In this work, we mainly discuss the design of the set of syn-thesis filters for a given desired set of analysis filters. Since the filter bank system still holds if we exchange the set of analysis filters with the set of synthesis filters in the system based on our biorthogonal-like sequence concept [12], this is equivalent to the design of the set of anal-ysis filters for a given desired set of synthesis filters. We will denote this filter bank system the biorthogonal filter bank.

Manuscript received October 7, 1995; revised May 12, 1999. The associate editor coordinating the review of this paper and approving it for publication was Editor-in-Chief Prof. José M. F. Moura.

Y. Lu is with the DSP Department, 3Com Corporation, Mount Prospect, IL 60056 USA.

J. M. Morris is with the Department of Computer Science and Electrical En-gineering, University of Maryland-Baltimore County, Catonsville, MD 21228 USA.