• 沒有找到結果。

Projected Barzilai-Borwein Methods for Non-negative Matrix Factorization

N/A
N/A
Protected

Academic year: 2022

Share "Projected Barzilai-Borwein Methods for Non-negative Matrix Factorization"

Copied!
29
0
0

加載中.... (立即查看全文)

全文

(1)

Projected Barzilai-Borwein Methods for Non-negative Matrix Factorization

Chen-Tse Tsai

Dept. of Computer Science National Taiwan University,

Taipei 106, Taiwan r98922028@csie.ntu.edu.tw

Chih-Jen Lin

Dept. of Computer Science National Taiwan University,

Taipei 106, Taiwan cjlin@csie.ntu.edu.tw

Abstract Non-negative matrix factorization (NMF) is a useful dimension reduc- tion technique. Currently, the most effective way to minimize NMF optimiza- tion problems is by alternatively solving non-negative least square sub-problems.

Some recent studies have shown that projected Barzilai-Borwein methods are very efficient for solving each sub-problem. In this article, we study variants of the projected Barzilai-Borwein methods and discuss some useful implementation techniques. We provide an efficient implementation to succeed our popular NMF code via a projected gradient method.

1 Introduction

Non-negative matrix factorization (NMF) (Paatero and Tapper, 1994; Lee and Seung, 1999) is a useful dimension reduction technique. Given an n × m data matrix V with Vij ≥ 0 and a pre-specified positive integer r < min(n, m), NMF finds two non-negative matrices W ∈ Rn×r and H ∈ Rr×m such that

V ≈ W H.

If each column of V represents an object, NMF approximates it by a linear com- bination of r “basis” columns in W .

A standard approach to find W and H is by minimizing the difference between

(2)

V and W H:

min

W,H f (W, H) ≡ kV − W Hk2F

subject to Wia ≥ 0, Hbj ≥ 0, ∀i, a, b, j, (1) where k · kF is the Frobenius norm.

Most existing methods solve (1) by an alternative non-negative least-square approach. At the kth iteration, W is updated while H is fixed, and then H is updated while W is fixed:

Wk+1 = an exact or approximate solution of min

W ≥0f (W, Hk), (2) Hk+1 = an exact or approximate solution of min

H≥0f (Wk+1, H). (3) That is, at each iteration we solve two non-negative least square sub-problems.

Lee and Seung (2001) consider a very rough approximation so that (2)-(3) become simple and effective multiplicative update rules. Lin (2007) proposes more accu- rately solving the sub-problems by a projected gradient method. The convergence is shown to be faster than that of Lee and Seung. Lin’s method takes the gradient direction and conducts a projected line search to ensure the non-negativity and the convergence. Recently, some new studies (e.g., Han et al. (2009); Bonettini (2011); Zdunek and Cichocki (2008)) have shown that applying Barzilai-Borwein step lengths on the gradient direction converges faster than Lin’s projected gra- dient method.

The above mentioned methods are popular because they can be easily imple- mented by simple matrix operations. Instead of using languages such as C, these methods can be easily implemented by tools like MATLAB without sacrificing the running time. Many other methods have been applied to solve sub-problem (2) or (3). For example, Kim et al. (2008) propose a quasi Newton method, while Kim and Park (2008a), Kim and Park (2008b) consider active-set methods. However, these methods are usually more sophisticated and may require additional assump- tions. While an implementation using MATLAB is still possible, one must be careful in avoiding slow nested loops in such an environment.

For example, Kim and Park (2008b) assume that Wk and Hk have full column and row rank, respectively.

(3)

While the projected Barzilai-Borwein method is promising for NMF, it has many possible variants. Further, some subtle implementation issues may signifi- cantly affect the performance. This article aims to give a comprehensive study on the projected Barzilai-Borwein method. We will propose an efficient and easy-to- implement setting.

This paper is organized as follows. Section 2 discusses projected gradient meth- ods and projected Barzilai-Borwein methods for solving NMF problems. Section 3 discusses algorithmic and implementation issues in detail. Experiments on image and text data are presented in Section 4. Finally, Section 5 concludes this work.

2 Projected Gradient and Projected Barzilai-Borwein Methods

The two least-square sub-problems in (2) and (3) are bound-constrained optimiza- tion problems. Both can be written in the following general form.

x∈Rminn f (x)

subject to li ≤ xi ≤ ui, i = 1, . . . , n, (4) where f (x) : Rn → R is a continuously differentiable convex function, and l and u are lower and upper bounds, respectively. A commonly used method to minimize (4) is by iteratively finding a search direction and projecting a new point back to the feasible region. Algorithm 1 describes a generalized projected method, in which we use the following projection

P [xi] =





xi if li < xi < ui, ui if xi ≥ ui, li if xi ≤ li,

to map a point back to the bounded feasible region. In Algorithm 1, a line search procedure is usually needed for ensuring the convergence.

2.1 Slow Convergence of Projected Gradient Methods

If the negative gradient direction is taken, then Algorithm 1 becomes the projected gradient method (Bertsekas, 1976). That is,

dk = −∇f (xk). (6)

(4)

Algorithm 1 A generalized projected method for bound-constrained minimiza- tion

1. Initialize any feasible x1. 2. For k = 1, 2, . . .

(a) Find a direction dk

(b) Conduct line search to find a λk such that

xk+1 = P [xk+ λkdk] (5) satisfies certain sufficient decrease condition.

This method has been used for NMF in Lin (2007), which employs a backtrack line search by trying λk= 1, β, β2, . . ., where 0 < β < 1, until that

f (xk+1) ≤ f (xk) + σ∇f (xk)T(xk+1− xk). (7) The inequality in (7) with 0 < σ < 1 is often called the sufficient decrease condition in optimization. This projected gradient procedure is closely related to a steepest descent approach by taking

λk = arg min

λ f (P [xk+ λdk]). (8)

Note that (8) is often expensive because P [xk+ λdk] is a piecewise function of λ. Thus, we use a backtrack line search instead. The convergence of projected gradient methods can seen in, for example, Calamai and Mor´e (1987).

It is well known in optimization that a steepest descent approach suffers from slow local convergence. We can replace the gradient direction in (8) with quasi Newton or Newton directions, but the implementation is more complicated. The Barzilai-Borwein in the next section still considers the gradient direction, but finds a suitable step length for faster convergence.

2.2 Variants of Projected Barzilai-Borwein Methods

Barzilai and Borwein (1988) proposed using dk= −αk∇f (xk),

(5)

where

αk = sk Tsk

sk Tyk (9)

or

αk= sk Tyk

yk Tyk, (10)

and

sk = xk− xk−1, yk= ∇f (xk) − ∇f (xk−1).

Barzilai and Borwein consider only unconstrained problems. The above αk values are obtained using the secant condition of quasi Newton methods. The method is designed to do a direct update without line search, but convergence holds only for certain cases. Thus, in general a line search procedure is still needed. For the con- vergence, we restrict αk ∈ [αmin, αmax], where αmin and αmax are two pre-specified positive values. Many works have either studied the theoretical properties of the Barzilai-Borwein method or applied it to solve real problems; see, for example, Fletcher (2005) and references therein.

Extensions of the Barzilai-Borwein method for bound-constrained problems have been well studied. A detailed report for quadratic bound-constrained min- imization is in Dai and Fletcher (2005). As projection is often needed in the line search procedure, we refer to the method as the projected Barzilai-Borwein method. Many variants of the projected Barzilai-Borwein methods are available and they differ in the following aspects:

• The selection of αk. We can use (9), (10) or others.

• The setting of the direction dk; see the two possibilities in (11) and (12).

• The line search procedure for finding the step size λk. We summarize some conclusions emerged from recent studies:

1. An alternative use of (9) and (10) is helpful. Grippo and Sciandrone (2002) proposed using

αk =

sTksk

sTkyk if k is odd,

sTkyk

yTkyk if k is even.

They showed that this setting is more efficient than using only (9) or (10).

(6)

For strictly convex quadratic objective functions, Serafini et al. (2005) proved that the α value by (9) is larger than that by (10). Therefore, a step in (10) may be too conservative, while (9) may result in too many line search steps.

They thus design certain mechanisms to decide when to use which one.

2. Using

dk= P [xk− αk∇f (xk)] − xk (11) is a competitive alternative to

dk= (xk− αk∇f (xk)) − xk = −αk∇f (xk). (12) If using (12), P [xk + λdk] is a piecewise linear function of λ. At each step of the backtracking line search, not only is the projection operation P [xk+ λdk] needed, but also the function value f (P [xk + λdk]) must be calculated. Birgin et al. (2000) propose using (11) instead. Because dk becomes feasible, P [xk+ λdk] = xk+ λdk is a linear function of λ ∈ [0, 1].

For a quadratic problem like (2) or (3),

f (xk+ λdk) = f (xk) + λ∇f (xk)Tdk+ λ2

2 dk T2f (xk)dk (13) can be easily obtained because ∇f (xk)Tdk and dk T2f (xk)dk are pre- calculated before trying various λ. Birgin et al. (2000) reported that for general bound-constrained problems, if Barzilai-Borwein step size in (9) is taken, using (11) yields comparable results to (12). Interestingly, tradi- tional projected gradient methods (e.g., Bertsekas (1999)) always use (12) with αk = 1. This is a subtle difference between projected gradient and Barzilai-Borwein methods because it seems that the former can use only (12), while the latter can use both. We suspect that because αk = 1 is a cruel estimate of the step size, if using (11) for projected gradient methods, dk may not be a good direction regardless of λ. In contrast, αk by the Barzilai-Borwein method may have given a reasonable direction in (11), so we can conduct line search along the same direction.

Among existing projected Barzilai-Borwein methods for NMF, Bonettini (2011); Zdunek and Cichocki (2008) have used (11). Han et al. (2009) have

(7)

tried both and reported that using (11) is slightly better than (12) on some randomly generated data. In Section 4, we will conduct some experiments to compare the two approaches of using (11) and (12).

3. A non-monotone line search seems to be better than a monotone one. As step sizes by the Barzilai-Borwein method are considered useful, we should avoid modifying them by the line search procedure, which is employed only for the theoretical convergence. Many (e.g., Grippo et al. (1986); Raydan (1997); Birgin et al. (2000)) have proposed replacing f (xk) on the right- hand side of (7) with a bigger value. For example, Raydan (1997) and Grippo et al. (1986) used the following value respectively for unconstrained and bound-constrained situations.

max{f (xk−i), 0 ≤ i ≤ min(k, M − 1)}, (14) where M is a pre-specified positive integer. That is, f (xk) in (7) is replaced by the maximal function value of the previous M iterations.

Dai and Fletcher (2005) point out that if M is too small, then the reference value on the right-hand side of (7) may not be large enough. Therefore, they propose an adaptive setting: Let fr be the reference function value on the right-hand side of (7) and fbest be the best function value over all past iterations. The value fr is reduced to a candidate function value fc if the method fails to improve fbest in at most L iterations, where L is a small predefined positive integer and fc is the maximal function value since fbest

was found. The strategy of updating fr is described in Algorithm 2.

Dai and Fletcher (2005) initialize fr = ∞, fbest = fc = f (x1), and l = 0.

Different from (14), this choice of frallows f (xk) ≥ f (x1) in early iterations;

thus, the step size in line search is less frequently modified.

4. For NMF, solving the non-negative least square problem (3) as a whole seems to be better than separating (3) to m independent problems. The jth problem is

min

H:,j

1

2kV:,j − W H:,jk2

subject to Hbj ≥ 0, ∀b, (15)

(8)

Algorithm 2 Updating the reference function value fr in the adaptive non- monotone line search (Dai and Fletcher, 2005)

If f (xk+1) < fbest

fbest← f (xk+1), fc← f (xk+1), l ← 0 Else

fc← max{fc, f (xk+1)}, l ← l + 1 If l = L

fr← fc, fc← f (xk+1), l ← 0

where V:,j ∈ Rn and H:,j ∈ Rr are the jth columns of V and H, respectively.

A possible advantage of this approach is that we can tune the step size for each problem (15). However, it is slightly more complicated to have a matrix-based implementation, especially if line search is conducted. Zdunek and Cichocki (2008, Algorithm 4) used a projected Barzilai-Borwein method to solve m independent problems (15) without line search. Han et al. (2009, Section 5) reported that on some randomly generated data, solving (3) as a whole is much faster. The same result was obtained in the extensive comparison of Bonettini (2011). We have also conducted some experiments.

Results show that solving (3) by m problems is not better. We give the following possible explanation. By obtaining a step size for each problem (15), we get a better step size to reduce the function value. Then, the algorithm is closer to the steepest descent algorithm, and therefore, may suffer from slow convergence. Interestingly, the suggestion here to stay away from the steepest descent method is also the motivation of Barzilai-Borwein methods.

2.3 The Proposed Procedure

For the selection of αk in a Barzilai-Borwein method, we empirically observe that using (9) only is usually better than using (10) only. An exception occurs when the projected gradient norm k∇Pf (x)k at the current iteration is larger than the

Bonettini (2011) attributes the inferior results of Zdunek and Cichocki (2008) to using only the first Barzilai-Borwein rule (9). However, Han et al. (2009) use only (9) as well, so we think another reason is that Zdunek and Cichocki (2008) solve m independent problems.

(9)

previous one, where the projected gradient is defined as

Pf (x)i





∇f (x)i if li < xi < ui, min(0, ∇f (x)i) if xi = li, max(0, ∇f (x)i) if xi = ui.

According to the optimality condition, x is a stationary point of (4) if and only if k∇Pf (x)k = 0.

We argue that when the projected gradient norm increases, a conservative step size like (10) is more suitable than (9). First, a larger projected gradient norm may imply a longer direction dk. In this case, a smaller step size is suitable.

Taking (9) for αk may cause many line search steps. Second, because we hope

Pf (xk) gradually approaches zero, the increase of ∇Pf (xk) means that the algorithm is not running well. Thus, we should more carefully update xk by taking a conservative step size. Therefore, we use (9) to obtain αk when the projected gradient norm decreases and choose (10) instead once the projected gradient norm rises.

Regarding the direction used in line search, following the earlier discussion, we use (11) instead of (12). To avoid reducing Barzilai-Borwein step sizes in line search, we use the non-monotone line search strategy proposed by Dai and Fletcher (2005). In our experiments, we observe that in almost all cases, λ = 1 succeeds and no line search steps are needed. The proposed procedure to solve (4) is summarized in Algorithm 3.

3 Algorithmic and Implementation Issues

The overall procedure of alternately solving sub-problems is described in Algo- rithm 4. Because solving each sub-problem also requires an iterative procedure, we have two levels of iterations. We refer to the procedure of updating (Wk, Hk) to (Wk+1, Hk+1) as an outer iteration, while each step in solving a sub-problem as an inner iteration. At each outer iteration, Algorithm 5, which is a specific implementation of Algorithm 3, is used to solve sub-problems (2) and (3).

In this section, we discuss stopping conditions and other implementation issues.

Note that in Section 2.2, we mentioned that step size in (9) is larger than (10).

(10)

Algorithm 3 A projected Barzilai-Borwein method

1. Given 0 < β < 1, 0 < σ < 1, an integer L, and αmax > αmin > 0. Initialize any feasible x1. Set fbest = fc= f (x1), fr = ∞, and α1 = 1.

2. For k = 1, 2, . . .

(a) dk ← P [xk− αk∇f (xk)] − xk (b) λ ← 1, xk+1 = xk+ λdk

While f (xk+1) > fr+ σλ∇f (xk)Tdk λ ← λ · β

xk+1= xk+ λdk

(c) If k∇Pf (xk+1)k > k∇Pf (xk)k Get αk+1 by (10)

Else

Get αk+1 by (9)

(d) αk+1 = min(max(αk+1, αmin), αmax) (e) Update fr by Algorithm 2

3.1 Stopping Conditions

We adopt the stopping conditions suggested by Lin (2007).

k∇Pf (Wk, Hk)kF ≤ k∇Pf (W1, H1)kF. (16) Note that although gradient is a vector, we represent it as a matrix. For approx- imately solving the sub-problems (2) and (3), the stopping conditions are

k∇PWf (Wk+1, Hk)kF ≤ ¯W and k∇PHf (Wk+1, Hk+1)kF ≤ ¯H, respectively, where we set

¯

W = ¯H ≡ max(10−3, )

Pf (W1, H1) F

in the beginning and  is the tolerance in (16). If the number of inner iterations for solving (2) is zero, we decrease the stopping tolerance by

¯

W ← ¯W/10.

For the sub-problem (3), ¯H is reduced in a similar way.

(11)

Algorithm 4 A projected Barzilai-Borwein method to solve the NMF problem (1).

Input: V, W1, and H1.

1. Initialize f (W1, H1) = 12kV − W1H1k2F, αW1 ← 1, and αH1 ← 1.

2. For k = 1, 2, . . . (outer iterations)

(a) If the stopping condition (16) is satisfied, terminate the procedure.

(b) Run Algorithm 5 with

input: VT, Hk T, Wk T, αWk , and f (Wk, Hk), output: Wk+1, αWk+1, and f (Wk+1, Hk).

(c) Run Algorithm 5 with

input: V, Wk+1, Hk, αHk, and f (Wk+1, Hk), output: Hk+1, αHk+1, and f (Wk+1, Hk+1).

Output: Wk and Hk

3.2 Implementation Details

We discuss Algorithm 5 for solving (3) in detail, where some tricks effectively re- duce the computational time. Assume W is fixed and H is updated, the objective function of (3) can be rewritten as

minH

f (H) ≡¯ 1

2kV − W Hk2F

subject to Hbj ≥ 0, ∀b, j, (17)

where V and W are constant matrices. The objective function of (2) can be represented in a similar form.

At the first inner iteration of solving a sub-problem, the lack of information makes us not able to use a Barzilai-Borwein step size. We found that a good initial step size may lead to faster convergence. Therefore, we pass the step size at the previous outer iteration to the next sub-problem as the initial step size. Notice that the step size of min

W ≥0f (W, Hk) is passed to the sub-problem min

W ≥0f (W, Hk+1) instead of min

H≥0f (Wk+1, H). See αWk and αHk in steps (b) and (c) of Algorithm 4, respectively.

At each iteration of Algorithm 5, the computational bottleneck is on line search and updating the function value. Because the direction in (11) is used, following

(12)

Algorithm 5 Applying Algorithm 3 to solve the sub-problem (17). Note that h·, ·i indicates the sum of the component-wise product of two matrices.

Input: V, W, H1, α1, ¯f (H1).

1. Set β = 0.1, σ = 0.01, L = 10, αmin= 10−30, αmax= 1030, fbest= fc = ¯f (H1), and fr = ∞.

2. ∇ ¯f (H1) = (WTW )H1− WTV 3. For t = 1, 2, . . . (inner iterations)

(a) If a stopping condition is satisfied, terminate the procedure.

(b) d ← P [Ht− αt∇ ¯f (Ht)] − Ht (c) g ← h∇ ¯f (Ht), di

y ← (WTW )d h ← hy, di (d) λ ← 1

f (H¯ t+1) = ¯f (Ht) + λg + 0.5λ2h While ¯f (Ht+1) > fr+ σg

λ ← λ · β

f (H¯ t+1) = ¯f (Ht) + λg + 0.5λ2h (e) d ← λd, y ← λy, h ← λ2h

Ht+1= Ht+ d

∇ ¯f (Ht+1) = ∇ ¯f (Ht) + y (f) If k∇Pf (H¯ t+1)k > k∇Pf (H¯ t)k

αt+1← min(max(h/hy, yi, αmin), αmax) Else

αt+1← min(max(hd, di/h, αmin), αmax) (g) If ¯f (Ht+1) < fbest

fbest ← ¯f (Ht+1), fc← ¯f (Ht+1), l ← 0 Else

fc ← max{fc, ¯f (Ht+1)}, l ← l + 1 If l = L

fr ← fc, fc ← ¯f (Ht+1), l ← 0 Output: Ht+1, αt+1, ¯f (Ht+1).

(13)

(13), line search is very cheap if h∇ ¯f (Ht), di and hd, (WTW )di are available, where d is defined in step (b) and h·, ·i is the sum of the component-wise product of two matrices. The cost of (WTW )d is O(mr2) if WTW has been pre-calculated in the beginning of Algorithm 5. For the gradient ∇ ¯f (Ht) needed in h∇ ¯f (Ht), di, if WTW and WTV have been pre-calculated as constants independent of H, then

∇ ¯f (Ht) = (WTW )Ht− WTV

needs O(mr2) operations. Because (WTW )d has been obtained for updating function values, using

∇ ¯f (Ht+ d) = ∇ ¯f (Ht) + (WTW )d, (18) in step (e), we can reduce the cost to O(mr).

The proceeding discussion has shown how to update the function value. For the initial function value, because the two sub-problems (2) and (3) have the same form of the objective function, we can reuse the last function value of the previous sub-problem. For example, in Algorithm 4, the f (Wk+1, Hk) obtained at step (b) is directly passed to step (c).

To further save the running time, we find that some values obtained for cal- culating function values in line search can be passed to update gradient and the Barzilai-Borwein step size; see steps (c), (e), and (f) of Algorithm 5. In summary, the complexity of using Algorithm 5 to solve a sub-problem is

O(nmr) + #iterations × O(mr2), where O(nmr) is for calculating WTV in step (2).§

The convergence of Algorithm 5 easily follows from the proof in Dai and Fletcher (2005) because ¯f (H) is twice-continuously differentiable in H and is bounded below. For Algorithm 4, if sub-problems are exactly solved, the conver- gence proof has been established; see the discussion in Section 2 of Lin (2007), which cites the proof of Grippo and Sciandrone (2000). However, we do not have a convergence proof yet if sub-problems are approximately solved.

§ We assume that the number of line search steps is no more than a constant.

(14)

4 Experiments

We compare the following five methods:

1. pbba: the proposed projected Barzilai-Borwein method with adaptive non- monotone line search. The implementation details are described in Algo- rithms 4 and 5.

2. projgrad: the projected gradient method proposed by Lin (2007). The code is available at http://www.csie.ntu.edu.tw/~cjlin/nmf/. This method differs from pbba in the following aspects:

• A projected gradient direction is used.

• The trick to maintain and update the gradient by (18) is not used.

3. blockpivot: an active-set method proposed by Kim and Park (2008b). The code is available at http://www.cc.gatech.edu/~jingu/nmf/nmf_bpas.

zip.

4. pbbnls: a projected Barzilai-Borwein method proposed by Han et al. (2009).

Their method is different from pbba in the following points:

• The initial α1 of each sub-problem is 1/k∇f (H1)kmax, where k · kmax is the largest absolute value of the matrix.

• They only consider (9) as the Barzilai-Borwein step size.

• They use (14) as the reference function value in the non-monotone line search.

• Their implementation is not as optimized as ours in Section 3.2.

The code is available at http://www.math.uconn.edu/~neumann/nmf/PBBNLS2.

m.

5. cbgp: a projected Barzilai-Borwein method proposed by Bonettini (2011).

Their method is different from pbba in the following points:

• They use the standard sufficient decrease condition in (7) instead of a non-monotone line search procedure.

(15)

• They choose the Barzilai-Borwein step size by the following rule.

If α2k+11k+1 ≤ τk

αk+1 ← min{α2j | j = max{1, k + 1 − M }, . . . , k + 1}, τk+1← 0.9τk

Else

αk+1 ← α1k+1, τk+1← 1.1τk,

where α1k+1 is calculated by (9), α2k+1 is calculated by (10), M is a pre-specified non-negative integer, and τk ∈ (0, 1). That is, they select the Barzilai-Borwein step size by comparing the value α2k+11k+1 and a threshold τk. Changing the threshold τk dynamically can avoid using the same step size rule in too many consecutive iterations, and make the choice of the initial τ0 less important.

The code is not publicly available, so we implement it by ourselves. All implementation tricks mentioned in Section 3.2 are incorporated. We follow Bonettini (2011) to set M = 2 and τ1 = 0.5.

We do not consider the implementation by Zdunek and Cichocki (2008) be- cause pbbnls and cbgp have shown better performances in earlier studies. All implementations were written in MATLAB (http://www.mathworks.com). We conduct experiments on a 64-bit machine with Intel Xeon 2.5GHz CPU and 16GB main memory. Results of using image data and text data are presented in the following subsections. Our code for experiments is available at http:

//www.csie.ntu.edu.tw/~cjlin/nmf/.

4.1 Image Data

We consider six image problems:

1. CBCL face image database.

http://cbcl.mit.edu/cbcl/software-datasets/FaceData2.html.

2. ORL face image database.

http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html.

(16)

Table 1: Data statistics. The column “nnz” indicates the number of non-zero elements in the matrix V

(a) Image data

Problem n m r nnz

CBCL 361 2,429 49 876,869

ORL 10,304 400 25 4,121,478 NAT 288 10,000 72 2,880,000 UMIST 10,304 300 20 3,091,200 YALE 77,760 165 15 12,830,400 ESSEX 9,000 150 40 1,350,000

(b) Text data: RCV1

n m r nnz

10,374 1,100 6 84,881 17,899 6,524 12 333,626 25,843 10,608 24 634,852 36,786 15,544 43 1,025,997

3. NAT Natural image data set.

http://redwood.berkeley.edu/bruno/sparsenet/.

4. UMIST face image database.

http://www.sheffield.ac.uk/eee/research/iel/research/face.html.

5. YALE face image database.

http://cvc.yale.edu/projects/yalefaces/yalefaces.html.

6. ESSEX face image database.

http://cswww.essex.ac.uk/mv/allfaces/faces94.html.

The settings of the first three problems are the same as those in Hoyer (2004).

The preprocessing of UMIST and YALE are done by the same method for ORL:

first, each image is reshaped as a column vector of the matrix V . Second, each column of V is linearly scaled to the values from 0 to 255 and each element is divided by 10,000. Thus, n is the size of an image and m is the number of images in the database. The value r is set to be the number of people in the database.

The preprocessing of ESSEX is done by the code provided by Bonettini (2011).

We present data statistics in Table 1a.

(17)

Table 2: The average of f (W, H) values using 20 initial points pbba blockpivot pbbnls projgrad cbgp

CBCL 793.519 792.189 792.034 792.346 792.835

ORL 13.601 13.543 26.263 13.612 13.604

NAT 344085.896 344183.559 346773.46 343618.664 343987.146

UMIST 9.239 9.181 12.679 9.258 9.276

YALE 55.197 54.446 102.277 55.167 55.203

ESSEX 2.057 1.874 8.278 2.057 1.990

Since the NMF problem (1) is not convex, different methods may converge to different local minima. Under the same initial point, we compare these methods in term of the speed to reduce function values to their own optimum:

f (Wk, Hk) − f (W, H)

f (W1, H1) − f (W, H). (19) We obtain f (W, H) of each method by running it until a strict stopping con- dition (16) holds for three consecutive iterations. Here, we set  in (16) to be 10−7. The subtraction of f (W, H) in the denominator of (19) avoids favoring a method with a larger f (W, H).

Due to the non-convexity, one method may converge faster than another method with an initial point but converge slower with a different initial point.

Therefore, we compare these methods by averaging results of 20 runs. For each problem, we generate 20 initial (W1, H1) by the MATLAB function randn and take the absolute value to make them feasible. We then run each method 20 times to obtain the reference (W, H) points. Table 2 descries the average f (W, H) of each method on each problem. We can see that f (W, H) values are slightly different, and pbbnls gives much larger values on most problems. To see the con- vergence of a method, at each outer iteration, we average the 20 values of CPU time and relative function reduction in (19). Because the numbers of iterations depend on initial points, we only report results of iterations having values in all 20 experiments. Figure 1 presents the results of CPU time versus relative function value (19).

Clearly, pbba converges faster than other methods on CBCL, NAT, and ESSEX data sets. The method cbgp performs almost the same as pbba in early iterations but becomes slower in the end. This slower final convergence is because in line

(18)

search, cbgp enforces the function-value decrease by the condition in (7). We observe some line search steps for cbgp, but throughout experiments, the adaptive non-monotone line search in pbba always succeeds at λ = 1. For pbbnls, although it is also a projected Barzilai-Borwein method, the performance is clearly worse than cbgp and pbba. Among all the differences between pbbnls and pbba listed in the beginning of this section, the value of α1 affects performance the most.

Although pbbnls applies a non-monotone line search procedure, many line search steps are observed. We have tried to set α1 in pbbnls as the Barzilai-Borwein step size in the end of the previous outer iteration. The number of line search steps is reduced and the performance is much better. See some detailed discussion in Section 4.3.

The active-set method blockpivot outperforms other methods on ORL, UMIST, and YALE data sets. While blockpivot is generally very competitive, surprisingly, it is the slowest on NAT. Later we will conduct further analysis to explain this result.

The method projgrad converges slowly on most of problems. Because we use the same experimental framework for pbba and projgrad, and they differ only in the step size, this result clearly indicates the superiority of the Barzilai-Borwein method. However, from the fact that projgrad performs better than pbbnls on ORL, UMIST, and YALE, we see that implementation details are equally important.

Next, we investigate how fast these methods decrease the norm of projected gradients. Figure 2 shows running time versus the normalized projected-gradient norm

k∇Pf (Wk, Hk)kF k∇Pf (W1, H1)kF.

Different from Figure 1, blockpivot is less competitive. On ORL, UMIST, and YALE, blockpivot is only similar to pbba, while on CBCL, NAT, and ESSEX, it is worse. Another notable difference is that pbbnls shows better final convergence than projgrad. This result is consistent with Figure 4.3 in Han et al. (2009).

Therefore, different evaluation criteria (e.g., function, gradient, or others) can easily affect the conclusions of comparisons.

In Figure 1, results on NAT are different from others because blockpivot is the slowest. NAT has the largest r among all problems, so we suspect that different

(19)

r values may affect the running speed. We experiment with different r values on ORL by choosing r = 25, 50, 75, 100, 125, and 150. Other settings are the same as those of the previous experiments. The results of time versus function values are shown in Figure 3. As r increases, blockpivot becomes slower than pbba and cbgp. We give the following explanation. When W is fixed, blockpivot applies an active-set method on m independent problems (15). At each inner iteration, the follwing m linear systems are solved.

(W:,FT jW:,Fj)HFj,j = W:,FT jVFj,j, j = 1, . . . , m,

where Fj is the active set. Because problems having the same active sets share the same matrix W:,FT W:,F, blockpivot groups them together and applies only one Cholesky factorization on the matrix W:,FT W:,F. As r increases, the chance that some of the m problems (15) have the same active sets decreases. Thus, blockpivot may be less competitive when r is large.

4.2 Text Data

NMF is useful for document clustering, so we next consider a text set RCV1 (Lewis et al., 2004). This set is an archive of manually categorized newswire stories from Reuters Ltd. We use the training set prepared by Bekkerman and Scholz (2008), where label hierarchy is reorganized and multi-labelled instances are removed.

More details as well as the data set are available at http://www.csie.ntu.edu.

tw/~cjlin/libsvmtools/datasets/.

We further remove classes which have less than five documents. Finally, we obtain a set of 15,544 instances in 43 classes. Using r = 6, 12, 24, and 43, we ran- domly permute the classes and select the first r classes of documents to construct the n × m matrix V , where n is the number of the vocabulary set and m is the number of documents. This setting ensures that more documents are included as r increases. The parameter r is the number of clusters that we intend to as- sign documents to. Because some words never appear in the selected documents and cause zero rows in V , we remove these rows before experiments. Table 1b describes the size of selected subsets. This document set is much sparser than image sets used in Section 4.1, so it is important to store V as a sparse matrix.

(20)

Otherwise, not only does V consumes a lot of memory, but also the calculation of WTV is time consuming.

Similar to previous experiments, by setting  in (16) to be 10−7 and averaging 20 experimental results, the relationship between relative function values and execution time is shown in Figure 4. The two methods pbba and blockpivot are the fastest. Contrary to image problems, the difference between pbba and cbgp is now more significant. We observe that the difference mainly comes from the strategy of choosing the Barzilai-Borwein step size in (9) and (10). More precisely, if cbgp uses the rule in step (c) of Algorithm 3 to choose between (9) and (10), then the performance becomes closer to pbba. For other methods, results are similar to those of image data. In Section 4.1, we point out that blockpivot may be less competitive as r increases. This result is under the situation that m and n are fixed. Here, both m and n also increase along with r, so the grouping strategy of blockpivot is still effective.

4.3 The Importance of α

1

In Section 4.1 we report that using the Barzilai-Borwein step size in the end of the previous outer iteration as the initial α1 can improve the performance of pbbnls.

In this section, we give detailed experimental results. Figure 5 shows the relative function value versus the running time on ORL and UMIST. All the experimental settings here are the same as those in Section 4.1. In Figure 5, to see the difference at final iterations, the x-axis is linearly scaled. The method pbbnls1 uses the new setting for α1. Its performance is much better than pbbnls and is very close to pbba. We observe that not only is the number of line search steps in the first inner iteration reduced, but also subsequent inner iterations require fewer line search steps. It is interesting that the step size of the first inner iteration significantly influences the overall performance of the projected Barzilai-Borwein method.

4.4 A Comparison Between (11) and (12)

Section 4 discusses the use of (11) or (12) as the direction at each inner iteration.

Here, we conduct some comparisons on pbba, projgrad, and cbgp.

See step 2 of Algorithm 5. In blockpivot, WTV is also needed.

(21)

For pbba, in our experiments, λ = 1 always succeeds for the adaptive line search, so using (11) is the same as (12). For projgrad, using (11) leads to very slow convergence and often reaches a specified maximal number of inner iterations.

For cbgp, it works well with both (11) and (12). However, we observe minor differences because contrary to pbba, λ = 1 sometimes fails and several line search steps are needed. We present a comparison in Figure 6, where cbgp and cbgp1 use (11) and (12), respectively. Among six image problems, excepts CBCL and YALE, the performances of cbgp and cbgp1 are almost the same. Thus, only the results on CBCL and YALE are shown. In Figure 6, cbgp is faster than cbgp1 on CBCL, whereas the opposite result occurs on YALE. The overall difference is not significant.

In summary, from our experiments, both (11) and (12) are applicable for pro- jected Barzilai-Borwein methods, but only (12) is useful for projected gradient methods.

5 Conclusions

This paper has comprehensively studied the projected Barzilai-Borwein method for NMF problems. We investigate some important differences between variants of projected Barzilai-Borwein methods. The experimental settings in this paper are designed for comparing different methods fairly. The results of experiments con- cretely confirm that a carefully implemented projected Barzilai-Borwein method is better than projected gradient methods. We point out several important imple- mentation techniques. Experiments also indicate that state-of-the-art active-set methods are efficient in reducing function values, but may not be better in reduc- ing the gradient norm.

The pbba implementation proposed in this paper is available at http://www.

csie.ntu.edu.tw/~cjlin/nmf/nmf.m. It replaces our earlier code implementing a projected gradient method.

(22)

References

Jonathan Barzilai and Jonathan M. Borwein. Two-point step size gradient meth- ods. IMA Journal of Numerical Analysis, 8:141–148, 1988.

Ron Bekkerman and Martin Scholz. Data weaving: Scaling up the state-of-the-art in data clustering. In Proceedings of CIKM, pages 1083–1092, 2008.

Dimitri P. Bertsekas. On the Goldstein-Levitin-Polyak gradient projection method. IEEE Transactions on Automatic Control, 21:174–184, 1976.

Dimitri P. Bertsekas. Nonlinear Programming. Athena Scientific, Belmont, MA 02178-9998, second edition, 1999.

Ernesto G. Birgin, Jos´e M. Mart´ınez, and Marcos Raydan. Nonmonotone spectral projected gradient methods on convex sets. SIAM Journal on Optimization, 10:

1196–1211, 2000.

Silvia Bonettini. Inexact block coordinate descent methods with application to non-negative matrix factorization. IMA Journal of Numerical Analysis, 2011.

To appear.

Paul H. Calamai and Jorge J. Mor´e. Projected gradient methods for linearly constrained problems. Mathematical Programming, 39:93–116, 1987.

Yu-Hong Dai and Roger Fletcher. Projected Barzilai-Borwein methods for large- scale box-constrained quadratic programming. Numerische Mathematik, 100:

21–47, 2005.

Roger Fletcher. On the Barzilai-Borwein method. Optimization and Control with Applications, 96:235–256, 2005.

L. Grippo and M. Sciandrone. On the convergence of the block nonlinear Gauss- Seidel method under convex constraints. Operations Research Letters, 26:127–

136, 2000.

(23)

Luigi Grippo and Marco Sciandrone. Nonmonotone globalization techniques for the Barzilai-Borwein gradient method. Computational Optimization and Appli- cations, 23:143–169, 2002.

Luigi Grippo, Francesco Lampariello, and Stefano Lucidi. A nonmonotone line search technique for Newton’s method. SIAM Journal on Numerical Analysis, 23:707–716, 1986.

Lixing Han, Michael Neumann, and Upendra Prasad. Alternating projected Barzilai-Borwein methods for nonnegative matrix factorization. Electronic Transactions on Numerical Analysis, pages 54–82, 2009.

Patrik O. Hoyer. Non-negative matrix factorization with sparseness constraints.

Journal of Machine Learning Research, 5:1457–1469, 2004.

Dongmin Kim, Suvrit Sra, and Inderjit S. Dhillon. Fast Newton-type methods for the least squares nonnegative matrix approximation problem. Statistical Analysis and Data Mining, pages 38–51, 2008.

Hyunsoo Kim and Haesun Park. Non-negative matrix factorization based on al- ternating non-negativity constrained least squares and active set method. SIAM Journal on Matrix Analysis and Applications, 30(2):713–730, 2008a.

Jingu Kim and Haesun Park. Toward faster nonnegative matrix factorization: A new algorithm and comparisons. In Eighth IEEE International Conference on Data Mining, pages 353–362. IEEE, 2008b.

Daniel D. Lee and H. Sebastian Seung. Learning the parts of objects by non- negative matrix factorization. Nature, 401:788–791, 1999.

Daniel D. Lee and H. Sebastian Seung. Algorithms for non-negative matrix fac- torization. In Todd K. Leen, Thomas G. Dietterich, and Volker Tresp, editors, Advances in Neural Information Processing Systems 13, pages 556–562. MIT Press, 2001.

(24)

David D. Lewis, Yiming Yang, Tony G. Rose, and Fan Li. RCV1: A new bench- mark collection for text categorization research. Journal of Machine Learning Research, 5:361–397, 2004.

Chih-Jen Lin. Projected gradient methods for non-negative matrix factorization.

Neural Computation, 19:2756–2779, 2007. URL http://www.csie.ntu.edu.

tw/~cjlin/papers/pgradnmf.pdf.

Pentti Paatero and Unto Tapper. Positive matrix factorization: A non-negative factor model with optimal utilization of error. Environmetrics, 5:111–126, 1994.

Macros Raydan. The Barzilai and Borwein gradient method for the large scale unconstrained minimization problem. SIAM Journal on Optimization, 7:26–33, 1997.

Thomas Serafini, Gaetano Zanghirati, and Luca Zanni. Gradient projection meth- ods for quadratic programs and applications in training support vector ma- chines. Optimization Methods and Software, 20:353–378, 2005.

Rafal Zdunek and Andrzej Cichocki. Fast nonnegative matrix factorization algo- rithms using projected gradient approaches for large-scale problems. Computa- tional Intelligence and Neuroscience, 2008:1–13, 2008.

(25)

(a) CBCL (b) ORL

(c) NAT (d) UMIST

(e) YALE (f) ESSEX

Figure 1: Averaged relative function value (19) versus execution time for image problems. Both x-axis and y-axis are log scaled.

(26)

(a) CBCL (b) ORL

(c) NAT (d) UMIST

(e) YALE (f) ESSEX

Figure 2: Averaged norm of projected gradients versus execution time for image problems. Both x-axis and y-axis are log scaled.

(27)

(a) r = 25 (b) r = 50

(c) r = 75 (d) r = 100

(e) r = 125 (f) r = 150

Figure 3: Averaged relative function value (19) versus execution time with differ- ent r values for ORL data set. Both x-axis and y-axis are log scaled.

(28)

(a) r = 6 (b) r = 12

(c) r = 24 (d) r = 43

Figure 4: Averaged relative function value (19) versus execution time for RCV1 text data. Both x-axis and y-axis are log scaled.

(a) ORL (b) UMIST

Figure 5: Experiments on the importance of α1. The method pbbnls1 uses the new setting for α1. We report the averaged relative function value (19) versus execution time for ORL and UMIST. Only y-axis is log scaled.

(29)

(a) CBCL (b) YALE

Figure 6: Averaged relative function value (19) versus execution time for CBCL and YALE. The method cbgp uses (11) to decide the direction, whereas cbgp1 uses (12). Only y-axis is log scaled.

參考文獻

相關文件

For categorical features like IDs, we have ID: field ID index: feature Each field has many 0/1 features. But how about

Matrix Factorization is an effective method for recommender systems (e.g., Netflix Prize and KDD Cup 2011).. But training

Matrix factorization (MF) and its extensions are now widely used in recommender systems.. In this talk I will briefly discuss three research works related to

The hashCode method for a given class can be used to test for object equality and object inequality for that class. The hashCode method is used by the java.util.SortedSet

Then, we tested the influence of θ for the rate of convergence of Algorithm 4.1, by using this algorithm with α = 15 and four different θ to solve a test ex- ample generated as

Section 3 is devoted to developing proximal point method to solve the monotone second-order cone complementarity problem with a practical approximation criterion based on a new

A derivative free algorithm based on the new NCP- function and the new merit function for complementarity problems was discussed, and some preliminary numerical results for

Hence on occupation category, total manpower requirement for managers and administrators, professionals and associate professionals taken together is projected to grow at an