Projected Barzilai-Borwein Methods for Non-negative Matrix Factorization

(1)

Projected Barzilai-Borwein Methods for Non-negative Matrix Factorization

Chen-Tse Tsai

Dept. of Computer Science National Taiwan University,

Taipei 106, Taiwan [email protected]

Chih-Jen Lin

Dept. of Computer Science National Taiwan University,

Taipei 106, Taiwan [email protected]

Abstract Non-negative matrix factorization (NMF) is a useful dimension reduction technique. Currently, the most effective way to minimize NMF optimization problems is by alternatively solving non-negative least square sub-problems.

Some recent studies have shown that projected Barzilai-Borwein methods are very efficient for solving each sub-problem. In this article, we study variants of the projected Barzilai-Borwein methods and discuss some useful implementation techniques. We provide an efficient implementation to succeed our popular NMF code via a projected gradient method.

1 Introduction

Non-negative matrix factorization (NMF) (Paatero and Tapper, 1994; Lee and Seung, 1999) is a useful dimension reduction technique. Given an n × m data matrix V with V_ij ≥ 0 and a pre-specified positive integer r < min(n, m), NMF finds two non-negative matrices W ∈ R^n×r and H ∈ R^r×m such that

V ≈ W H.

If each column of V represents an object, NMF approximates it by a linear com- bination of r “basis” columns in W .

A standard approach to find W and H is by minimizing the difference between

(2)

V and W H:

min

W,H f (W, H) ≡ kV − W Hk²_F

subject to Wia ≥ 0, Hbj ≥ 0, ∀i, a, b, j, (1) where k · k_F is the Frobenius norm.

Most existing methods solve (1) by an alternative non-negative least-square approach. At the kth iteration, W is updated while H is fixed, and then H is updated while W is fixed:

W^k+1 = an exact or approximate solution of min

W ≥0f (W, H^k), (2) H^k+1 = an exact or approximate solution of min

H≥0f (W^k+1, H). (3) That is, at each iteration we solve two non-negative least square sub-problems.

Lee and Seung (2001) consider a very rough approximation so that (2)-(3) become simple and effective multiplicative update rules. Lin (2007) proposes more accu- rately solving the sub-problems by a projected gradient method. The convergence is shown to be faster than that of Lee and Seung. Lin’s method takes the gradient direction and conducts a projected line search to ensure the non-negativity and the convergence. Recently, some new studies (e.g., Han et al. (2009); Bonettini (2011); Zdunek and Cichocki (2008)) have shown that applying Barzilai-Borwein step lengths on the gradient direction converges faster than Lin’s projected gradient method.

The above mentioned methods are popular because they can be easily implemented by simple matrix operations. Instead of using languages such as C, these methods can be easily implemented by tools like MATLAB without sacrificing the running time. Many other methods have been applied to solve sub-problem (2) or (3). For example, Kim et al. (2008) propose a quasi Newton method, while Kim and Park (2008a), Kim and Park (2008b) consider active-set methods. However, these methods are usually more sophisticated and may require additional assump- tions.^∗ While an implementation using MATLAB is still possible, one must be careful in avoiding slow nested loops in such an environment.

∗For example, Kim and Park (2008b) assume that W^k and H^k have full column and row rank, respectively.

(3)

While the projected Barzilai-Borwein method is promising for NMF, it has many possible variants. Further, some subtle implementation issues may significantly affect the performance. This article aims to give a comprehensive study on the projected Barzilai-Borwein method. We will propose an efficient and easy-to- implement setting.

This paper is organized as follows. Section 2 discusses projected gradient methods and projected Barzilai-Borwein methods for solving NMF problems. Section 3 discusses algorithmic and implementation issues in detail. Experiments on image and text data are presented in Section 4. Finally, Section 5 concludes this work.

2 Projected Gradient and Projected Barzilai-Borwein Methods

The two least-square sub-problems in (2) and (3) are bound-constrained optimization problems. Both can be written in the following general form.

x∈Rminⁿ f (x)

subject to l_i ≤ x_i ≤ u_i, i = 1, . . . , n, (4) where f (x) : Rⁿ → R is a continuously differentiable convex function, and l and u are lower and upper bounds, respectively. A commonly used method to minimize (4) is by iteratively finding a search direction and projecting a new point back to the feasible region. Algorithm 1 describes a generalized projected method, in which we use the following projection

P [x_i] =







x_i if l_i < x_i < u_i, u_i if x_i ≥ u_i, l_i if x_i ≤ l_i,

to map a point back to the bounded feasible region. In Algorithm 1, a line search procedure is usually needed for ensuring the convergence.

2.1 Slow Convergence of Projected Gradient Methods

If the negative gradient direction is taken, then Algorithm 1 becomes the projected gradient method (Bertsekas, 1976). That is,

d^k = −∇f (x^k). (6)

(4)

Algorithm 1 A generalized projected method for bound-constrained minimization

1. Initialize any feasible x¹. 2. For k = 1, 2, . . .

(a) Find a direction d^k

(b) Conduct line search to find a λ_k such that

x^k+1 = P [x^k+ λ_kd^k] (5) satisfies certain sufficient decrease condition.

This method has been used for NMF in Lin (2007), which employs a backtrack line search by trying λ_k= 1, β, β², . . ., where 0 < β < 1, until that

f (x^k+1) ≤ f (x^k) + σ∇f (x^k)^T(x^k+1− x^k). (7) The inequality in (7) with 0 < σ < 1 is often called the sufficient decrease condition in optimization. This projected gradient procedure is closely related to a steepest descent approach by taking

λ_k = arg min

λ f (P [x^k+ λd^k]). (8)

Note that (8) is often expensive because P [x^k+ λd^k] is a piecewise function of λ. Thus, we use a backtrack line search instead. The convergence of projected gradient methods can seen in, for example, Calamai and Mor´e (1987).

It is well known in optimization that a steepest descent approach suffers from slow local convergence. We can replace the gradient direction in (8) with quasi Newton or Newton directions, but the implementation is more complicated. The Barzilai-Borwein in the next section still considers the gradient direction, but finds a suitable step length for faster convergence.

2.2 Variants of Projected Barzilai-Borwein Methods

Barzilai and Borwein (1988) proposed using d^k= −αk∇f (x^k),

(5)

where

α_k = s^{k T}s^k

s^{k T}y^k (9)

or

α_k= s^{k T}y^k

y^{k T}y^k, (10)

and

s^k = x^k− x^k−1, y^k= ∇f (x^k) − ∇f (x^k−1).

Barzilai and Borwein consider only unconstrained problems. The above α_k values are obtained using the secant condition of quasi Newton methods. The method is designed to do a direct update without line search, but convergence holds only for certain cases. Thus, in general a line search procedure is still needed. For the convergence, we restrict αk ∈ [αmin, αmax], where αmin and αmax are two pre-specified positive values. Many works have either studied the theoretical properties of the Barzilai-Borwein method or applied it to solve real problems; see, for example, Fletcher (2005) and references therein.

Extensions of the Barzilai-Borwein method for bound-constrained problems have been well studied. A detailed report for quadratic bound-constrained minimization is in Dai and Fletcher (2005). As projection is often needed in the line search procedure, we refer to the method as the projected Barzilai-Borwein method. Many variants of the projected Barzilai-Borwein methods are available and they differ in the following aspects:

• The selection of α_k. We can use (9), (10) or others.

• The setting of the direction d^k; see the two possibilities in (11) and (12).

• The line search procedure for finding the step size λk. We summarize some conclusions emerged from recent studies:

1. An alternative use of (9) and (10) is helpful. Grippo and Sciandrone (2002) proposed using

α_k =







s^T_ksk

s^T_ky_k if k is odd,

s^T_ky_k

y^T_ky_k if k is even.

They showed that this setting is more efficient than using only (9) or (10).

(6)

For strictly convex quadratic objective functions, Serafini et al. (2005) proved that the α value by (9) is larger than that by (10). Therefore, a step in (10) may be too conservative, while (9) may result in too many line search steps.

They thus design certain mechanisms to decide when to use which one.

2. Using

d^k= P [x^k− α_k∇f (x^k)] − x^k (11) is a competitive alternative to

d^k= (x^k− α_k∇f (x^k)) − x^k = −α_k∇f (x^k). (12) If using (12), P [x^k + λd^k] is a piecewise linear function of λ. At each step of the backtracking line search, not only is the projection operation P [x^k+ λd^k] needed, but also the function value f (P [x^k + λd^k]) must be calculated. Birgin et al. (2000) propose using (11) instead. Because d^k becomes feasible, P [x^k+ λd^k] = x^k+ λd^k is a linear function of λ ∈ [0, 1].

For a quadratic problem like (2) or (3),

f (x^k+ λd^k) = f (x^k) + λ∇f (x^k)^Td^k+ λ²

2 d^{k T}∇²f (x^k)d^k (13) can be easily obtained because ∇f (x^k)^Td^k and d^{k T}∇²f (x^k)d^k are pre- calculated before trying various λ. Birgin et al. (2000) reported that for general bound-constrained problems, if Barzilai-Borwein step size in (9) is taken, using (11) yields comparable results to (12). Interestingly, tradi- tional projected gradient methods (e.g., Bertsekas (1999)) always use (12) with α_k = 1. This is a subtle difference between projected gradient and Barzilai-Borwein methods because it seems that the former can use only (12), while the latter can use both. We suspect that because αk = 1 is a cruel estimate of the step size, if using (11) for projected gradient methods, d^k may not be a good direction regardless of λ. In contrast, α_k by the Barzilai-Borwein method may have given a reasonable direction in (11), so we can conduct line search along the same direction.

Among existing projected Barzilai-Borwein methods for NMF, Bonettini (2011); Zdunek and Cichocki (2008) have used (11). Han et al. (2009) have

(7)

tried both and reported that using (11) is slightly better than (12) on some randomly generated data. In Section 4, we will conduct some experiments to compare the two approaches of using (11) and (12).

3. A non-monotone line search seems to be better than a monotone one. As step sizes by the Barzilai-Borwein method are considered useful, we should avoid modifying them by the line search procedure, which is employed only for the theoretical convergence. Many (e.g., Grippo et al. (1986); Raydan (1997); Birgin et al. (2000)) have proposed replacing f (x^k) on the right- hand side of (7) with a bigger value. For example, Raydan (1997) and Grippo et al. (1986) used the following value respectively for unconstrained and bound-constrained situations.

max{f (x^k−i), 0 ≤ i ≤ min(k, M − 1)}, (14) where M is a pre-specified positive integer. That is, f (x^k) in (7) is replaced by the maximal function value of the previous M iterations.

Dai and Fletcher (2005) point out that if M is too small, then the reference value on the right-hand side of (7) may not be large enough. Therefore, they propose an adaptive setting: Let f_r be the reference function value on the right-hand side of (7) and f_best be the best function value over all past iterations. The value f_r is reduced to a candidate function value f_c if the method fails to improve fbest in at most L iterations, where L is a small predefined positive integer and fc is the maximal function value since fbest

was found. The strategy of updating f_r is described in Algorithm 2.

Dai and Fletcher (2005) initialize f_r = ∞, f_best = f_c = f (x¹), and l = 0.

Different from (14), this choice of f_rallows f (x^k) ≥ f (x¹) in early iterations;

thus, the step size in line search is less frequently modified.

4. For NMF, solving the non-negative least square problem (3) as a whole seems to be better than separating (3) to m independent problems. The jth problem is

min

H:,j

1

2kV_:,j − W H_:,jk²

subject to H_bj ≥ 0, ∀b, (15)

(8)

Algorithm 2 Updating the reference function value f_r in the adaptive non- monotone line search (Dai and Fletcher, 2005)

If f (x^k+1) < fbest

f_best← f (x^k+1), f_c← f (x^k+1), l ← 0 Else

f_c← max{f_c, f (x^k+1)}, l ← l + 1 If l = L

f_r← f_c, f_c← f (x^k+1), l ← 0

where V_:,j ∈ Rⁿ and H_:,j ∈ R^r are the jth columns of V and H, respectively.

A possible advantage of this approach is that we can tune the step size for each problem (15). However, it is slightly more complicated to have a matrix-based implementation, especially if line search is conducted. Zdunek and Cichocki (2008, Algorithm 4) used a projected Barzilai-Borwein method to solve m independent problems (15) without line search. Han et al. (2009, Section 5) reported that on some randomly generated data, solving (3) as a whole is much faster. The same result was obtained in the extensive comparison of Bonettini (2011).^† We have also conducted some experiments.

Results show that solving (3) by m problems is not better. We give the following possible explanation. By obtaining a step size for each problem (15), we get a better step size to reduce the function value. Then, the algorithm is closer to the steepest descent algorithm, and therefore, may suffer from slow convergence. Interestingly, the suggestion here to stay away from the steepest descent method is also the motivation of Barzilai-Borwein methods.

2.3 The Proposed Procedure

For the selection of α_k in a Barzilai-Borwein method, we empirically observe that using (9) only is usually better than using (10) only. An exception occurs when the projected gradient norm k∇^Pf (x)k at the current iteration is larger than the

†Bonettini (2011) attributes the inferior results of Zdunek and Cichocki (2008) to using only the first Barzilai-Borwein rule (9). However, Han et al. (2009) use only (9) as well, so we think another reason is that Zdunek and Cichocki (2008) solve m independent problems.

(9)

previous one, where the projected gradient is defined as

∇^Pf (x)_i ≡







∇f (x)_i if l_i < x_i < u_i, min(0, ∇f (x)_i) if x_i = l_i, max(0, ∇f (x)_i) if x_i = u_i.

According to the optimality condition, x is a stationary point of (4) if and only if k∇^Pf (x)k = 0.

We argue that when the projected gradient norm increases, a conservative step size like (10) is more suitable than (9).^‡ First, a larger projected gradient norm may imply a longer direction d^k. In this case, a smaller step size is suitable.

Taking (9) for α_k may cause many line search steps. Second, because we hope

∇^Pf (x^k) gradually approaches zero, the increase of ∇^Pf (x^k) means that the algorithm is not running well. Thus, we should more carefully update x^k by taking a conservative step size. Therefore, we use (9) to obtain α_k when the projected gradient norm decreases and choose (10) instead once the projected gradient norm rises.

Regarding the direction used in line search, following the earlier discussion, we use (11) instead of (12). To avoid reducing Barzilai-Borwein step sizes in line search, we use the non-monotone line search strategy proposed by Dai and Fletcher (2005). In our experiments, we observe that in almost all cases, λ = 1 succeeds and no line search steps are needed. The proposed procedure to solve (4) is summarized in Algorithm 3.

3 Algorithmic and Implementation Issues

The overall procedure of alternately solving sub-problems is described in Algo- rithm 4. Because solving each sub-problem also requires an iterative procedure, we have two levels of iterations. We refer to the procedure of updating (W^k, H^k) to (W^k+1, H^k+1) as an outer iteration, while each step in solving a sub-problem as an inner iteration. At each outer iteration, Algorithm 5, which is a specific implementation of Algorithm 3, is used to solve sub-problems (2) and (3).

In this section, we discuss stopping conditions and other implementation issues.

‡Note that in Section 2.2, we mentioned that step size in (9) is larger than (10).

(10)

Algorithm 3 A projected Barzilai-Borwein method

1. Given 0 < β < 1, 0 < σ < 1, an integer L, and α_max > α_min > 0. Initialize any feasible x¹. Set f_best = f_c= f (x¹), f_r = ∞, and α₁ = 1.

2. For k = 1, 2, . . .

(a) d^k ← P [x^k− α_k∇f (x^k)] − x^k (b) λ ← 1, x^k+1 = x^k+ λd^k

While f (x^k+1) > f_r+ σλ∇f (x^k)^Td^k λ ← λ · β

x^k+1= x^k+ λd^k

(c) If k∇^Pf (x^k+1)k > k∇^Pf (x^k)k Get α_k+1 by (10)

Else

Get α_k+1 by (9)

(d) α_k+1 = min(max(α_k+1, α_min), α_max) (e) Update fr by Algorithm 2

3.1 Stopping Conditions

We adopt the stopping conditions suggested by Lin (2007).

k∇^Pf (W^k, H^k)k_F ≤ k∇^Pf (W¹, H¹)k_F. (16) Note that although gradient is a vector, we represent it as a matrix. For approximately solving the sub-problems (2) and (3), the stopping conditions are

k∇^P_Wf (W^k+1, H^k)k_F ≤ ¯_W and k∇^P_Hf (W^k+1, H^k+1)k_F ≤ ¯_H, respectively, where we set

¯

_W = ¯_H ≡ max(10⁻³, )

∇^Pf (W¹, H¹) F

in the beginning and is the tolerance in (16). If the number of inner iterations for solving (2) is zero, we decrease the stopping tolerance by

¯

_W ← ¯_W/10.

For the sub-problem (3), ¯H is reduced in a similar way.

(11)

Algorithm 4 A projected Barzilai-Borwein method to solve the NMF problem (1).

Input: V, W¹, and H¹.

1. Initialize f (W¹, H¹) = ¹₂kV − W¹H¹k²_F, α^W₁ ← 1, and α^H₁ ← 1.

2. For k = 1, 2, . . . (outer iterations)

(a) If the stopping condition (16) is satisfied, terminate the procedure.

(b) Run Algorithm 5 with

input: V^T, H^{k T}, W^{k T}, α^W_k , and f (W^k, H^k), output: W^k+1, α^W_k+1, and f (W^k+1, H^k).

(c) Run Algorithm 5 with

input: V, W^k+1, H^k, α^H_k, and f (W^k+1, H^k), output: H^k+1, α^H_k+1, and f (W^k+1, H^k+1).

Output: W^k and H^k

3.2 Implementation Details

We discuss Algorithm 5 for solving (3) in detail, where some tricks effectively reduce the computational time. Assume W is fixed and H is updated, the objective function of (3) can be rewritten as

minH

f (H) ≡¯ 1

2kV − W Hk²_F

subject to H_bj ≥ 0, ∀b, j, (17)

where V and W are constant matrices. The objective function of (2) can be represented in a similar form.

At the first inner iteration of solving a sub-problem, the lack of information makes us not able to use a Barzilai-Borwein step size. We found that a good initial step size may lead to faster convergence. Therefore, we pass the step size at the previous outer iteration to the next sub-problem as the initial step size. Notice that the step size of min

W ≥0f (W, H^k) is passed to the sub-problem min

W ≥0f (W, H^k+1) instead of min

H≥0f (W^k+1, H). See α^W_k and α^H_k in steps (b) and (c) of Algorithm 4, respectively.

At each iteration of Algorithm 5, the computational bottleneck is on line search and updating the function value. Because the direction in (11) is used, following

(12)

Algorithm 5 Applying Algorithm 3 to solve the sub-problem (17). Note that h·, ·i indicates the sum of the component-wise product of two matrices.

Input: V, W, H¹, α₁, ¯f (H¹).

1. Set β = 0.1, σ = 0.01, L = 10, α_min= 10⁻³⁰, α_max= 10³⁰, f_best= f_c = ¯f (H¹), and f_r = ∞.

2. ∇ ¯f (H¹) = (W^TW )H¹− W^TV 3. For t = 1, 2, . . . (inner iterations)

(a) If a stopping condition is satisfied, terminate the procedure.

(b) d ← P [H^t− α_t∇ ¯f (H^t)] − H^t (c) g ← h∇ ¯f (H^t), di

y ← (W^TW )d h ← hy, di (d) λ ← 1

f (H¯ ^t+1) = ¯f (H^t) + λg + 0.5λ²h While ¯f (H^t+1) > f_r+ σg

λ ← λ · β

f (H¯ ^t+1) = ¯f (H^t) + λg + 0.5λ²h (e) d ← λd, y ← λy, h ← λ²h

H^t+1= H^t+ d

∇ ¯f (H^t+1) = ∇ ¯f (H^t) + y (f) If k∇^Pf (H¯ ^t+1)k > k∇^Pf (H¯ ^t)k

α_t+1← min(max(h/hy, yi, α_min), α_max) Else

α_t+1← min(max(hd, di/h, α_min), α_max) (g) If ¯f (H^t+1) < f_best

fbest ← ¯f (H^t+1), fc← ¯f (H^t+1), l ← 0 Else

f_c ← max{f_c, ¯f (H^t+1)}, l ← l + 1 If l = L

f_r ← f_c, f_c ← ¯f (H^t+1), l ← 0 Output: H^t+1, α_t+1, ¯f (H^t+1).

(13)

(13), line search is very cheap if h∇ ¯f (H^t), di and hd, (W^TW )di are available, where d is defined in step (b) and h·, ·i is the sum of the component-wise product of two matrices. The cost of (W^TW )d is O(mr²) if W^TW has been pre-calculated in the beginning of Algorithm 5. For the gradient ∇ ¯f (H^t) needed in h∇ ¯f (H^t), di, if W^TW and W^TV have been pre-calculated as constants independent of H, then

∇ ¯f (H^t) = (W^TW )H^t− W^TV

needs O(mr²) operations. Because (W^TW )d has been obtained for updating function values, using

∇ ¯f (H^t+ d) = ∇ ¯f (H^t) + (W^TW )d, (18) in step (e), we can reduce the cost to O(mr).

The proceeding discussion has shown how to update the function value. For the initial function value, because the two sub-problems (2) and (3) have the same form of the objective function, we can reuse the last function value of the previous sub-problem. For example, in Algorithm 4, the f (W^k+1, H^k) obtained at step (b) is directly passed to step (c).

To further save the running time, we find that some values obtained for calculating function values in line search can be passed to update gradient and the Barzilai-Borwein step size; see steps (c), (e), and (f) of Algorithm 5. In summary, the complexity of using Algorithm 5 to solve a sub-problem is

O(nmr) + #iterations × O(mr²), where O(nmr) is for calculating W^TV in step (2).^§

The convergence of Algorithm 5 easily follows from the proof in Dai and Fletcher (2005) because ¯f (H) is twice-continuously differentiable in H and is bounded below. For Algorithm 4, if sub-problems are exactly solved, the convergence proof has been established; see the discussion in Section 2 of Lin (2007), which cites the proof of Grippo and Sciandrone (2000). However, we do not have a convergence proof yet if sub-problems are approximately solved.

§ We assume that the number of line search steps is no more than a constant.

(14)

4 Experiments

We compare the following five methods:

1. pbba: the proposed projected Barzilai-Borwein method with adaptive non- monotone line search. The implementation details are described in Algo- rithms 4 and 5.

2. projgrad: the projected gradient method proposed by Lin (2007). The code is available at http://www.csie.ntu.edu.tw/~cjlin/nmf/. This method differs from pbba in the following aspects:

• A projected gradient direction is used.

• The trick to maintain and update the gradient by (18) is not used.

3. blockpivot: an active-set method proposed by Kim and Park (2008b). The code is available at http://www.cc.gatech.edu/~jingu/nmf/nmf_bpas.

zip.

4. pbbnls: a projected Barzilai-Borwein method proposed by Han et al. (2009).

Their method is different from pbba in the following points:

• The initial α₁ of each sub-problem is 1/k∇f (H¹)k_max, where k · k_max is the largest absolute value of the matrix.

• They only consider (9) as the Barzilai-Borwein step size.

• They use (14) as the reference function value in the non-monotone line search.

• Their implementation is not as optimized as ours in Section 3.2.

The code is available at http://www.math.uconn.edu/~neumann/nmf/PBBNLS2.

m.

5. cbgp: a projected Barzilai-Borwein method proposed by Bonettini (2011).

Their method is different from pbba in the following points:

• They use the standard sufficient decrease condition in (7) instead of a non-monotone line search procedure.

(15)

• They choose the Barzilai-Borwein step size by the following rule.

If α²_k+1/α¹_k+1 ≤ τ_k

αk+1 ← min{α²_j | j = max{1, k + 1 − M }, . . . , k + 1}, τk+1← 0.9τk

Else

α_k+1 ← α¹_k+1, τ_k+1← 1.1τ_k,

where α¹_k+1 is calculated by (9), α²_k+1 is calculated by (10), M is a pre-specified non-negative integer, and τ_k ∈ (0, 1). That is, they select the Barzilai-Borwein step size by comparing the value α²_k+1/α¹_k+1 and a threshold τ_k. Changing the threshold τ_k dynamically can avoid using the same step size rule in too many consecutive iterations, and make the choice of the initial τ0 less important.

The code is not publicly available, so we implement it by ourselves. All implementation tricks mentioned in Section 3.2 are incorporated. We follow Bonettini (2011) to set M = 2 and τ₁ = 0.5.

We do not consider the implementation by Zdunek and Cichocki (2008) because pbbnls and cbgp have shown better performances in earlier studies. All implementations were written in MATLAB (http://www.mathworks.com). We conduct experiments on a 64-bit machine with Intel Xeon 2.5GHz CPU and 16GB main memory. Results of using image data and text data are presented in the following subsections. Our code for experiments is available at http:

//www.csie.ntu.edu.tw/~cjlin/nmf/.

4.1 Image Data

We consider six image problems:

1. CBCL face image database.

http://cbcl.mit.edu/cbcl/software-datasets/FaceData2.html.

2. ORL face image database.

http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html.

(16)

Table 1: Data statistics. The column “nnz” indicates the number of non-zero elements in the matrix V

(a) Image data

Problem n m r nnz

CBCL 361 2,429 49 876,869

ORL 10,304 400 25 4,121,478 NAT 288 10,000 72 2,880,000 UMIST 10,304 300 20 3,091,200 YALE 77,760 165 15 12,830,400 ESSEX 9,000 150 40 1,350,000

(b) Text data: RCV1

n m r nnz

10,374 1,100 6 84,881 17,899 6,524 12 333,626 25,843 10,608 24 634,852 36,786 15,544 43 1,025,997

3. NAT Natural image data set.

http://redwood.berkeley.edu/bruno/sparsenet/.

4. UMIST face image database.

http://www.sheffield.ac.uk/eee/research/iel/research/face.html.

5. YALE face image database.

http://cvc.yale.edu/projects/yalefaces/yalefaces.html.

6. ESSEX face image database.

http://cswww.essex.ac.uk/mv/allfaces/faces94.html.

The settings of the first three problems are the same as those in Hoyer (2004).

The preprocessing of UMIST and YALE are done by the same method for ORL:

first, each image is reshaped as a column vector of the matrix V . Second, each column of V is linearly scaled to the values from 0 to 255 and each element is divided by 10,000. Thus, n is the size of an image and m is the number of images in the database. The value r is set to be the number of people in the database.

The preprocessing of ESSEX is done by the code provided by Bonettini (2011).

We present data statistics in Table 1a.

(17)

Table 2: The average of f (W^∗, H^∗) values using 20 initial points pbba blockpivot pbbnls projgrad cbgp

CBCL 793.519 792.189 792.034 792.346 792.835

ORL 13.601 13.543 26.263 13.612 13.604

NAT 344085.896 344183.559 346773.46 343618.664 343987.146

UMIST 9.239 9.181 12.679 9.258 9.276

YALE 55.197 54.446 102.277 55.167 55.203

ESSEX 2.057 1.874 8.278 2.057 1.990

Since the NMF problem (1) is not convex, different methods may converge to different local minima. Under the same initial point, we compare these methods in term of the speed to reduce function values to their own optimum:

f (W^k, H^k) − f (W^∗, H^∗)

f (W¹, H¹) − f (W^∗, H^∗). (19) We obtain f (W^∗, H^∗) of each method by running it until a strict stopping condition (16) holds for three consecutive iterations. Here, we set in (16) to be 10⁻⁷. The subtraction of f (W^∗, H^∗) in the denominator of (19) avoids favoring a method with a larger f (W^∗, H^∗).

Due to the non-convexity, one method may converge faster than another method with an initial point but converge slower with a different initial point.

Therefore, we compare these methods by averaging results of 20 runs. For each problem, we generate 20 initial (W¹, H¹) by the MATLAB function randn and take the absolute value to make them feasible. We then run each method 20 times to obtain the reference (W^∗, H^∗) points. Table 2 descries the average f (W^∗, H^∗) of each method on each problem. We can see that f (W^∗, H^∗) values are slightly different, and pbbnls gives much larger values on most problems. To see the convergence of a method, at each outer iteration, we average the 20 values of CPU time and relative function reduction in (19). Because the numbers of iterations depend on initial points, we only report results of iterations having values in all 20 experiments. Figure 1 presents the results of CPU time versus relative function value (19).

Clearly, pbba converges faster than other methods on CBCL, NAT, and ESSEX data sets. The method cbgp performs almost the same as pbba in early iterations but becomes slower in the end. This slower final convergence is because in line

(18)

search, cbgp enforces the function-value decrease by the condition in (7). We observe some line search steps for cbgp, but throughout experiments, the adaptive non-monotone line search in pbba always succeeds at λ = 1. For pbbnls, although it is also a projected Barzilai-Borwein method, the performance is clearly worse than cbgp and pbba. Among all the differences between pbbnls and pbba listed in the beginning of this section, the value of α₁ affects performance the most.

Although pbbnls applies a non-monotone line search procedure, many line search steps are observed. We have tried to set α1 in pbbnls as the Barzilai-Borwein step size in the end of the previous outer iteration. The number of line search steps is reduced and the performance is much better. See some detailed discussion in Section 4.3.

The active-set method blockpivot outperforms other methods on ORL, UMIST, and YALE data sets. While blockpivot is generally very competitive, surprisingly, it is the slowest on NAT. Later we will conduct further analysis to explain this result.

The method projgrad converges slowly on most of problems. Because we use the same experimental framework for pbba and projgrad, and they differ only in the step size, this result clearly indicates the superiority of the Barzilai-Borwein method. However, from the fact that projgrad performs better than pbbnls on ORL, UMIST, and YALE, we see that implementation details are equally important.

Next, we investigate how fast these methods decrease the norm of projected gradients. Figure 2 shows running time versus the normalized projected-gradient norm

k∇^Pf (W^k, H^k)k_F k∇^Pf (W¹, H¹)k_F.

Different from Figure 1, blockpivot is less competitive. On ORL, UMIST, and YALE, blockpivot is only similar to pbba, while on CBCL, NAT, and ESSEX, it is worse. Another notable difference is that pbbnls shows better final convergence than projgrad. This result is consistent with Figure 4.3 in Han et al. (2009).

Therefore, different evaluation criteria (e.g., function, gradient, or others) can easily affect the conclusions of comparisons.

In Figure 1, results on NAT are different from others because blockpivot is the slowest. NAT has the largest r among all problems, so we suspect that different

(19)

r values may affect the running speed. We experiment with different r values on ORL by choosing r = 25, 50, 75, 100, 125, and 150. Other settings are the same as those of the previous experiments. The results of time versus function values are shown in Figure 3. As r increases, blockpivot becomes slower than pbba and cbgp. We give the following explanation. When W is fixed, blockpivot applies an active-set method on m independent problems (15). At each inner iteration, the follwing m linear systems are solved.

(W_:,F^T _jW:,Fj)HFj,j = W_:,F^T _jVFj,j, j = 1, . . . , m,

where F_j is the active set. Because problems having the same active sets share the same matrix W_:,F^T W_:,F, blockpivot groups them together and applies only one Cholesky factorization on the matrix W_:,F^T W_:,F. As r increases, the chance that some of the m problems (15) have the same active sets decreases. Thus, blockpivot may be less competitive when r is large.

4.2 Text Data

NMF is useful for document clustering, so we next consider a text set RCV1 (Lewis et al., 2004). This set is an archive of manually categorized newswire stories from Reuters Ltd. We use the training set prepared by Bekkerman and Scholz (2008), where label hierarchy is reorganized and multi-labelled instances are removed.

More details as well as the data set are available at http://www.csie.ntu.edu.

tw/~cjlin/libsvmtools/datasets/.

We further remove classes which have less than five documents. Finally, we obtain a set of 15,544 instances in 43 classes. Using r = 6, 12, 24, and 43, we randomly permute the classes and select the first r classes of documents to construct the n × m matrix V , where n is the number of the vocabulary set and m is the number of documents. This setting ensures that more documents are included as r increases. The parameter r is the number of clusters that we intend to as- sign documents to. Because some words never appear in the selected documents and cause zero rows in V , we remove these rows before experiments. Table 1b describes the size of selected subsets. This document set is much sparser than image sets used in Section 4.1, so it is important to store V as a sparse matrix.

(20)

Otherwise, not only does V consumes a lot of memory, but also the calculation of W^TV is time consuming.^¶

Similar to previous experiments, by setting in (16) to be 10⁻⁷ and averaging 20 experimental results, the relationship between relative function values and execution time is shown in Figure 4. The two methods pbba and blockpivot are the fastest. Contrary to image problems, the difference between pbba and cbgp is now more significant. We observe that the difference mainly comes from the strategy of choosing the Barzilai-Borwein step size in (9) and (10). More precisely, if cbgp uses the rule in step (c) of Algorithm 3 to choose between (9) and (10), then the performance becomes closer to pbba. For other methods, results are similar to those of image data. In Section 4.1, we point out that blockpivot may be less competitive as r increases. This result is under the situation that m and n are fixed. Here, both m and n also increase along with r, so the grouping strategy of blockpivot is still effective.

4.3 The Importance of α

₁

In Section 4.1 we report that using the Barzilai-Borwein step size in the end of the previous outer iteration as the initial α₁ can improve the performance of pbbnls.

In this section, we give detailed experimental results. Figure 5 shows the relative function value versus the running time on ORL and UMIST. All the experimental settings here are the same as those in Section 4.1. In Figure 5, to see the difference at final iterations, the x-axis is linearly scaled. The method pbbnls1 uses the new setting for α₁. Its performance is much better than pbbnls and is very close to pbba. We observe that not only is the number of line search steps in the first inner iteration reduced, but also subsequent inner iterations require fewer line search steps. It is interesting that the step size of the first inner iteration significantly influences the overall performance of the projected Barzilai-Borwein method.

4.4 A Comparison Between (11) and (12)

Section 4 discusses the use of (11) or (12) as the direction at each inner iteration.

Here, we conduct some comparisons on pbba, projgrad, and cbgp.

¶See step 2 of Algorithm 5. In blockpivot, W^TV is also needed.

(21)

For pbba, in our experiments, λ = 1 always succeeds for the adaptive line search, so using (11) is the same as (12). For projgrad, using (11) leads to very slow convergence and often reaches a specified maximal number of inner iterations.

For cbgp, it works well with both (11) and (12). However, we observe minor differences because contrary to pbba, λ = 1 sometimes fails and several line search steps are needed. We present a comparison in Figure 6, where cbgp and cbgp1 use (11) and (12), respectively. Among six image problems, excepts CBCL and YALE, the performances of cbgp and cbgp1 are almost the same. Thus, only the results on CBCL and YALE are shown. In Figure 6, cbgp is faster than cbgp1 on CBCL, whereas the opposite result occurs on YALE. The overall difference is not significant.

In summary, from our experiments, both (11) and (12) are applicable for projected Barzilai-Borwein methods, but only (12) is useful for projected gradient methods.

5 Conclusions

This paper has comprehensively studied the projected Barzilai-Borwein method for NMF problems. We investigate some important differences between variants of projected Barzilai-Borwein methods. The experimental settings in this paper are designed for comparing different methods fairly. The results of experiments con- cretely confirm that a carefully implemented projected Barzilai-Borwein method is better than projected gradient methods. We point out several important implementation techniques. Experiments also indicate that state-of-the-art active-set methods are efficient in reducing function values, but may not be better in reducing the gradient norm.

The pbba implementation proposed in this paper is available at http://www.

csie.ntu.edu.tw/~cjlin/nmf/nmf.m. It replaces our earlier code implementing a projected gradient method.

(22)

References

Jonathan Barzilai and Jonathan M. Borwein. Two-point step size gradient methods. IMA Journal of Numerical Analysis, 8:141–148, 1988.

Ron Bekkerman and Martin Scholz. Data weaving: Scaling up the state-of-the-art in data clustering. In Proceedings of CIKM, pages 1083–1092, 2008.

Dimitri P. Bertsekas. On the Goldstein-Levitin-Polyak gradient projection method. IEEE Transactions on Automatic Control, 21:174–184, 1976.

Dimitri P. Bertsekas. Nonlinear Programming. Athena Scientific, Belmont, MA 02178-9998, second edition, 1999.

Ernesto G. Birgin, Jos´e M. Mart´ınez, and Marcos Raydan. Nonmonotone spectral projected gradient methods on convex sets. SIAM Journal on Optimization, 10:

1196–1211, 2000.

Silvia Bonettini. Inexact block coordinate descent methods with application to non-negative matrix factorization. IMA Journal of Numerical Analysis, 2011.

To appear.

Paul H. Calamai and Jorge J. Mor´e. Projected gradient methods for linearly constrained problems. Mathematical Programming, 39:93–116, 1987.

Yu-Hong Dai and Roger Fletcher. Projected Barzilai-Borwein methods for large- scale box-constrained quadratic programming. Numerische Mathematik, 100:

21–47, 2005.

Roger Fletcher. On the Barzilai-Borwein method. Optimization and Control with Applications, 96:235–256, 2005.

L. Grippo and M. Sciandrone. On the convergence of the block nonlinear Gauss- Seidel method under convex constraints. Operations Research Letters, 26:127–

136, 2000.

(23)

Luigi Grippo and Marco Sciandrone. Nonmonotone globalization techniques for the Barzilai-Borwein gradient method. Computational Optimization and Appli- cations, 23:143–169, 2002.

Luigi Grippo, Francesco Lampariello, and Stefano Lucidi. A nonmonotone line search technique for Newton’s method. SIAM Journal on Numerical Analysis, 23:707–716, 1986.

Lixing Han, Michael Neumann, and Upendra Prasad. Alternating projected Barzilai-Borwein methods for nonnegative matrix factorization. Electronic Transactions on Numerical Analysis, pages 54–82, 2009.

Patrik O. Hoyer. Non-negative matrix factorization with sparseness constraints.

Journal of Machine Learning Research, 5:1457–1469, 2004.

Dongmin Kim, Suvrit Sra, and Inderjit S. Dhillon. Fast Newton-type methods for the least squares nonnegative matrix approximation problem. Statistical Analysis and Data Mining, pages 38–51, 2008.

Hyunsoo Kim and Haesun Park. Non-negative matrix factorization based on alternating non-negativity constrained least squares and active set method. SIAM Journal on Matrix Analysis and Applications, 30(2):713–730, 2008a.

Jingu Kim and Haesun Park. Toward faster nonnegative matrix factorization: A new algorithm and comparisons. In Eighth IEEE International Conference on Data Mining, pages 353–362. IEEE, 2008b.

Daniel D. Lee and H. Sebastian Seung. Learning the parts of objects by non- negative matrix factorization. Nature, 401:788–791, 1999.

Daniel D. Lee and H. Sebastian Seung. Algorithms for non-negative matrix factorization. In Todd K. Leen, Thomas G. Dietterich, and Volker Tresp, editors, Advances in Neural Information Processing Systems 13, pages 556–562. MIT Press, 2001.

(24)

David D. Lewis, Yiming Yang, Tony G. Rose, and Fan Li. RCV1: A new bench- mark collection for text categorization research. Journal of Machine Learning Research, 5:361–397, 2004.

Chih-Jen Lin. Projected gradient methods for non-negative matrix factorization.

Neural Computation, 19:2756–2779, 2007. URL http://www.csie.ntu.edu.

tw/~cjlin/papers/pgradnmf.pdf.

Pentti Paatero and Unto Tapper. Positive matrix factorization: A non-negative factor model with optimal utilization of error. Environmetrics, 5:111–126, 1994.

Macros Raydan. The Barzilai and Borwein gradient method for the large scale unconstrained minimization problem. SIAM Journal on Optimization, 7:26–33, 1997.

Thomas Serafini, Gaetano Zanghirati, and Luca Zanni. Gradient projection methods for quadratic programs and applications in training support vector ma- chines. Optimization Methods and Software, 20:353–378, 2005.

Rafal Zdunek and Andrzej Cichocki. Fast nonnegative matrix factorization algorithms using projected gradient approaches for large-scale problems. Computa- tional Intelligence and Neuroscience, 2008:1–13, 2008.

(25)

(a) CBCL (b) ORL

(c) NAT (d) UMIST

(e) YALE (f) ESSEX

Figure 1: Averaged relative function value (19) versus execution time for image problems. Both x-axis and y-axis are log scaled.

(26)

(a) CBCL (b) ORL

(c) NAT (d) UMIST

(e) YALE (f) ESSEX

Figure 2: Averaged norm of projected gradients versus execution time for image problems. Both x-axis and y-axis are log scaled.

(27)

(a) r = 25 (b) r = 50

(c) r = 75 (d) r = 100

(e) r = 125 (f) r = 150

Figure 3: Averaged relative function value (19) versus execution time with different r values for ORL data set. Both x-axis and y-axis are log scaled.

(28)

(a) r = 6 (b) r = 12

(c) r = 24 (d) r = 43

Figure 4: Averaged relative function value (19) versus execution time for RCV1 text data. Both x-axis and y-axis are log scaled.

(a) ORL (b) UMIST

Figure 5: Experiments on the importance of α₁. The method pbbnls1 uses the new setting for α₁. We report the averaged relative function value (19) versus execution time for ORL and UMIST. Only y-axis is log scaled.

(29)

(a) CBCL (b) YALE

Figure 6: Averaged relative function value (19) versus execution time for CBCL and YALE. The method cbgp uses (11) to decide the direction, whereas cbgp1 uses (12). Only y-axis is log scaled.