• 沒有找到結果。

Automatic Rank Determination in Projective Nonnegative Matrix Factorization

N/A
N/A
Protected

Academic year: 2022

Share "Automatic Rank Determination in Projective Nonnegative Matrix Factorization"

Copied!
8
0
0

加載中.... (立即查看全文)

全文

(1)

Automatic Rank Determination in Projective Nonnegative Matrix Factorization

Zhirong Yang, Zhanxing Zhu, and Erkki Oja Department of Information and Computer Science

Aalto University School of Science and Technology P.O.Box 15400, FI-00076, Aalto, Finland, {zhirong.yang,zhanxing.zhu,erkki.oja}@tkk.fi

Abstract. Projective Nonnegative Matrix Factorization (PNMF) has demonstrated advantages in both sparse feature extraction and cluster- ing. However, PNMF requires users to specify the column rank of the approximative projection matrix, the value of which is unknown before- hand. In this paper, we propose a method called ARDPNMF to auto- matically determine the column rank in PNMF. Our method is based on automatic relevance determination (ARD) with Jeffrey’s prior. After de- riving the multiplicative update rule using the expectation-maximization technique for ARDPNMF, we test it on various synthetic and real-world datasets for feature extraction and clustering applications to show the ef- fectiveness of our algorithm. For FERET faces and the Swimmer dataset, interpretable number of features are obtained correctly via our algorithm.

Several UCI datasets for clustering are also tested, in which we find that ARDPNMF can estimate the number of clusters quite accurately with low deviation and good cluster purity.

1 Introduction

Since its introduction by Lee and Seung [1] as a new machine learning method, Nonnegative Matrix Factorization (NMF) has been applied successfully in many applications, including signal processing, text clustering and gene expression studies, etc. (see [2] for a survey). Recently much progress for NMF has been reported both in theory and practice. Also there are several variants to ex- tend original NMF, (e.g. [3–5]). Projective Nonnegative Matrix Factorization (PNMF), introduced in [6–8], approximates a data matrix by its nonnegative subspace projection. Compared with NMF, the PNMF has a number of benefits such as better generalization, a sparser factorizing matrix without ambiguity, and close relation to principal component analysis, which are advantageous in both feature extraction and clustering [8].

However, a remaining difficult problem is how to determine the dimension- ality of the approximating subspace in PNMF in practical applications. In most

Supported by the Academy of Finland in the project Finnish Centre of Excellence in Adaptive Informatics Research.

(2)

cases, one has to guess a suitable component number, e.g. the number of features needed to encode facial images. Such trial-and-error procedures can be tedious in practice. In this work, we propose a variant of PNMF called ARDPNMF that can automatically determine dimensionality of factorizing matrix. Our method is based on the automatic relevance determination (ARD) [9] technique which has been used in Bayesian PCA [10] and adaptive sparse supervised learning [11]. The proposed algorithm is free of user-specified parameters. Such property is especially desired for exploratory analysis of the data structure. Empirical re- sults on several synthetic and real-world datasets demonstrate that our method can effectively discover the number of features or clusters.

This paper is organized as follows. In Section 2, we summarize the essence of PNMF and model selection in NMF. Then, we derive our algorithm ARDPNMF in Section 3. In Section 4, the experimental results of the proposed algorithm on a variety of synthetic and real datasets for feature extraction and clustering are presented. Section 5 concludes the paper.

2 Related Work

2.1 Projective Nonnegative Matrix Factorization

Given a nonnegative input matrix X ∈ Rm×n+ , Projective Nonnegative Matrix Factorization (PNMF) seeks a nonnegative matrix W∈ Rm×r+ such that

X≈ WWTX. (1)

Compared with the NMF approximation scheme, X≈ WH, PNMF replaces H matrix with WTX. As a result, PNMF has a number of advantages over NMF [8], including high sparseness in the factorizing matrix W, closer equivalence to clustering, easy nonlinear extension to a kernel version, and fast approximation of newly coming samples without heavy re-computation. The name “projective”

comes from the fact that WWT is very close to a projection matrix because the W learned by PNMF is highly orthogonal. It can be made fully orthogonal by post-processing.

PNMF based on the Euclidean distance solves the following optimization problem:

minimize

W≥0 JF(W) = 1 2

X

ij

hXij− WWTX

ij

i2

. (2)

Previously, Yuan and Oja [6] presented a multiplicative algorithm that iteratively applies the following update rule for the above minimization:

Wik ← WikAik

Bik

(3)

Wnew← W/kWk, (4)

where A = 2XXTW, B = WWTXXTW+ XXTWWTW, andkWk calcu- lates the square root of the maximal eigenvalue of W′TW.

(3)

2.2 Model Selection in NMF

In NMF, Tan and F´evotte [12] addressed the model selection problem based on automatic relevance determination. First, a prior is added on the columns and rows of matrix W and H. A Bayesian NMF model with the prior is then built.

After maximizing the posterior, they obtain a multiplicative update rule to do both factorization and determination of component number simultaneously. The limitation of this method is that the prior distribution still depends on the hyper- parameters. For real-world applications, the hyper-parameters must be chosen suitably in advance to obtain reasonable results. In this sense, this method is not totally automatic for determining the component number.

In the following section, we overcome this problem and apply the ARD method to PNMF by selecting Jeffrey’s prior [13] to get rid of hyper-parameters.

Then our algorithm is totally automatic without any user-specified parameters.

3 ARDPNMF

Firstly, we construct a generative model for PNMF based on the Euclidean distance, where the likelihood function is a normal distribution.

p(Xij|W) = N

Xij| WWTX

ij, I

(5) Following the approach of Bayesian PCA [10], we give a normal prior on the kth column of W with variance γk. Due to the nonnegativity in PNMF, we treat the distribution of each column of W as half-normal distribution.

p(Wikk) =HN (Wik|0, γk) =

√2

√πγk

exp



−Wik2k



(6) for Wik≤ 0, and zero otherwise.

Similar to [13], we impose a non-informative Jeffreys’ hyper prior on the variances γ to control the sparseness of W:

p(γk)∝ 1 γk

(7) We choose this prior because it expresses ignorance with respect to scale and the resulting model is parameter-free, which plays a significant role in determining the component number automatically.

The posterior of W for the above model is given by

p(W|X, γ) ∝ p(X|W)p(W|γ) (8)

Because γ is unobserved, we apply the Expectation-Maximization (EM) algo- rithm by regarding γ as a hidden variable.

E-step.Given the current parameter estimates and observed data, E-step com- putes the expectation of the complete log-posterior, which is known as Q-function:

Q(W|W(t)) = Z

log p(W|X, γ)p(γ|W(t), X)dγ (9)

(4)

Thanks to the property of Jeffrey’s prior, we have a concise form of Q-function following the derivation in [13]:

Q(W|W(t)) =−JF(W)−1

2Tr(WV(t)WT), (10) where JF(W) is the original objective function in PNMF (see Equation (2)), V(t) is a diagonal matrix with Vii(t) =

wi(t)

−2

, and w(t)i

is the L2-norm of the ith column of matrix W(t). Note that we ignore the constants independent of W to present a simplified version of the Q-function.

M-step.This step maximizes the Q-function w.r.t parameters.

W(t+1)= arg max

W Q(W|W(t)), (11)

which is equivalent to minimizing its negative form Qard(W|W(t)) =−Q(W|W(t)) = JF(W) +1

2Tr(WV(t)WT). (12) The derivative of Qard(W|W(t)) with respect to W is

∂QARD(W|W(t))

∂Wik

=− Aik+ Bik+

WV(t)

ik. (13)

For A, B, see eq. (3).

A commonly used principle that forms multiplicative update rule in NMF is Wik ← Wik(t)

ik

+ik

, (14)

where ∇ and∇+ denote the negative and positive parts of the derivative [1].

Applying this principle to the gradient given in Equation (13), we obtain the multiplicative update rule for ARDPNMF:

Wik ← Wik(t)

A(t)ik Bik(t)+ W(t)V(t)

ik

. (15)

The ARDPNMF algorithm is summarized in Algorithm 1. After the algorithm converges, we apply a simple thresholding to keep the W columns whose norm is larger than a small constant ǫ. In practice such thresholding is insensitive to ǫ be- cause the ARD prior forces these norms towards two extremes, as demonstrated in Section 4.1.

4 Experimental Results

We have implemented the ARDPNMF algorithm and tested it on various syn- thetic and real-world datasets to find out the effectiveness of our algorithm. The focus is on feature extraction and clustering.

(5)

Algorithm 1ARDPNMF based on Euclidean distance

Usage: W ← ARDPNMF(X, r), where r < m is a large initial component number.

Initialize W(0), t ← 0.

repeat

V(t)← diag(kw1(t)k−2, . . . , kw(t)r k−2) Wik ← Wik(t)

A(t)ik

Bik(t)+ (W(t)V(t))ik

W(t+1)← W/kWk t ← t + 1

untilconvergent conditions are satisfied

Check the diagonal elements in matrix V, and keep the columns of W with large L2-norms as the effective components.

Fig. 1.Some sample images of Swimmer dataset

4.1 Swimmer Dataset

Swimmer dataset [14] consists of 256 images, each of which depicts a figure with one static part (torso)and four moving parts (limbs) with size 32× 32. Each moving part has four different positions. Four of the 256 images are displayed in Figure 1. The task here is to extract the 16 limb positions and 1 torso position.

Firstly, we vectorized each image matrix and treated it as one column of input matrix X. The initial component number was set to r = 36. Each column of W learned by ARDPNMF has the same dimensionality as the input column vectors and thus can be displayed as base images in Figure 2. We found that our algorithm can correctly extract all the 17 desired features. The L2-norms of all the columns of W are shown in Figure 3. We can easily see that the L2-norms of ineffective basis images are equal to zero or very close to zero. The three values between 0 and 1 correspond to three duplicates of the torsos.

4.2 FERET Faces Dataset

The FERET face dataset [15] for feature extraction consists of the inner part of 2409 faces with size of 32× 32. We normalized the images via dividing the pixel values by their maximal value 255. In ARDPNMF, the initial component number was chosen as r = 64. Figure 4 shows the resulting base images, which demonstrates high sparseness in the factorizing matrix W and captures nearly all facial parts.

(6)

Fig. 2. 36 basis images of Swimmer dataset. The gray cells correspond to columns whose L2-norms are zero or very close to zero.

0 5 10 15 20 25 30 35 40

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

basis index

L2 norm

Fig. 3.L2-norm of 36 basis images in Swimmer dataset.

4.3 Clustering for UCI Datasets

Clustering is another important application of PNMF. We construct the input matrix X by treating each sample vector as a row. Then the index of the maximal value in a row of W indicates the cluster membership of the corresponding sample [8]. We have adopted a widely-used measurement called purity [8] for quantitatively analyzing clustering results, which is defined as follows:

purity = 1 n

r

X

k=1

1≤l≤qmaxnlk, (16)

where q is the true number of classes, r is the effective number of components (clusters), nlk is the number of samples in the cluster k that belongs to original

(7)

Fig. 4.64 basis images of FERET dataset. 55 of them are effective basis. Remaining gray ones’ L2-norms are zero or close to zero.

Table 1.Clustering Performance

Datasets iris ecoli glass wine parkinsons

Number of classes 3 5 6 3 2

Estimated cluster number 4.34 ± 0.71 2.74 ± 0.60 3.34 ± 0.61 3 ± 0.40 4.37 ± 0.58 Purity 0.95 ± 0.01 0.68 ± 0.06 0.67 ± 0.05 0.9 ± 0.09 0.77 ± 0.02

class l, and n is the total number of samples. Larger purity value indicates better clustering results, and value 1 indicates total agreement with the ground truth.

We chose several commonly used datasets in the UCI repository1 as exper- imental data. In each dataset, ARDPNMF was run 100 times with different random seeds for W initialization, and we set the initial cluster number r as 36. Table 1 shows the mean and standard deviation of the number of clusters and purities, as well as the numbers of ground truth classes. ARDPNMF can automatically estimate the cluster number which is not far from the true class number, with small deviations. Furthermore, our method can achieve reasonably good clustering performance especially when the estimated r value is close to the ground truth.

5 Conclusion

In this paper, using Bayesian construction and EM algorithm, we have presented the ARDPNMF algorithm which can automatically determine the rank of the projection matrix in PNMF. By using Jeffreys’ prior as the model prior, we

1 http://www.ics.uci.edu/˜mlearn/MLRepository.html

(8)

have made our algorithm totally free of human tuning in finding algorithm pa- rameters. Through experiments on various synthetic and real-world datasets for feature extraction and clustering, ARDPNMF demonstrates its effectiveness in model selection for PNMF. Moreover, our algorithm is readily extended to other dissimilarity measures, such as the α or β divergences [2]. Our method however could be sensitive to the initialization of the factorizing matrix in some cases, which we should improve in the future for a more robust estimate of the rank.

References

1. Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative matrix factorization. Nature 401 (1999) 788–791

2. Cichocki, A., Zdunek, R., Phan, A.H., Amari, S.: Nonnegative Matrix and Tensor Factorizations: Applications to Exploratory Multi-way Data Analysis. John Wiley (2009)

3. Dhillon, I.S., Sra, S.: Generalized nonnegative matrix approximations with breg- man divergences. In: Advances in Neural Information Processing Systems. Vol- ume 18. (2006) 283–290

4. Choi, S.: Algorithms for orthogonal nonnegative matrix factorization. In: Proceed- ings of IEEE International Joint Conference on Neural Networks. (2008) 1828–1832 5. Ding, C., Li, T., Jordan, M.I.: Convex and semi-nonnegative matrix factorizations.

IEEE Transactions on Pattern Analysis and Machine Intelligence 32(1) (2010) 45–

55

6. Yuan, Z., Oja, E.: Projective nonnegative matrix factorization for image compres- sion and feature extraction. In: Proc. of 14th Scandinavian Conference on Image Analysis (SCIA 2005), Joensuu, Finland (June 2005) 333–342

7. Yang, Z., Yuan, Z., Laaksonen, J.: Projective non-negative matrix factorization with applications to facial image processing. International Journal on Pattern Recognition and Artificial Intelligence 21(8) (2007) 1353–1362

8. Yang, Z., Oja, E.: Linear and nonlinear projective nonnegative matrix factorization.

IEEE Transaction on Neural Networks (2010) In press.

9. Mackay, D.J.C.: Probable networks and plausible predictions – a review of practical bayesian methods for supervised neural networks. Network: Computation in Neural Systems 6(3) (1995) 469–505

10. Bishop, C.M.: Bayesian pca. In: Advances in Neural Information Processing Sys- tems. (1999) 382–388

11. Tipping, M.E.: Sparse bayesian learning and the relevance vector machine. Journal of Machine Learning Research 1 (2001) 211–244

12. Tan, V.Y.F., F´evotte, C.: Automatic relevance determination in nonnegative ma- trix factorization. In: Proceedings of 2009 Workshop on Signal Processing with Adaptive Sparse Structured Representations (SPARS’09. (2009)

13. Figueiredo, M.A.: Adaptive sparseness for supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 25(9) (2003) 1150–1159

14. Donoho, D., Stodden, V.: When does non-negative matrix factorization give a correct decomposition into parts? In: Advances in Neural Information Processing Systems 16. (2003) 1141–1148

15. Phillips, P.J., Moon, H., Rizvi, S.A., Rauss, P.J.: The FERET evaluation method- ology for face recognition algorithms. IEEE Trans. Pattern Analysis and Machine Intelligence 22 (October 2000) 1090–1104

參考文獻

相關文件

6 《中論·觀因緣品》,《佛藏要籍選刊》第 9 冊,上海古籍出版社 1994 年版,第 1

After students have had ample practice with developing characters, describing a setting and writing realistic dialogue, they will need to go back to the Short Story Writing Task

Robinson Crusoe is an Englishman from the 1) t_______ of York in the seventeenth century, the youngest son of a merchant of German origin. This trip is financially successful,

fostering independent application of reading strategies Strategy 7: Provide opportunities for students to track, reflect on, and share their learning progress (destination). •

Strategy 3: Offer descriptive feedback during the learning process (enabling strategy). Where the

Please create a timeline showing significant political, education, legal and social milestones for women of your favorite country.. Use the timeline template to record key dates

In this paper, we characterize the projection formula of element z onto p-order cone, and establish one type of spectral factorization associated with p-order cone.. We believe

In this paper, we have successfully extended the concept of Schatten p- norm on matrices space to the setting of R n space via Euclidean Jordan algebra.. Two types of Schatten p-norm