Limited-memory Common-directions Method With Subsampled Newton Directions for Large-scale

(1)

Limited-memory Common-directions Method With Subsampled Newton Directions for Large-scale

Linear Classification

Jui-Nan Yen National Taiwan University

juinanyen@gmail.com

Chih-Jen Lin National Taiwan University

cjlin@csie.ntu.edu.tw

Abstract—The common-directions method is an optimization method recently proposed to utilize second-order information. It is especially efficient on large-scale linear classification problems, and it is competitive with state-of-the-art optimization methods like BFGS, LBFGS, and Nesterov’s accelerated gradient method.

The main idea of the method is to minimize the local quadratic approximation within the selected subspace. Regarding the selection of the subspace, the original authors only focused on the span of current and past gradient directions. In this work, we analyze the impact of subspace selection, and point out that the lack of direction diversity can be a potential weakness for using gradients as directions. To address this problem, we propose the use of subsampled Newton directions, which always possess diversity unless they are already close to the true Newton direction. Our experiments on large-scale linear classification problems show that our proposed methods are generally better than subsampled Newton methods and the original common-directions method.

I. INTRODUCTION

The common-directions method was proposed by Wang et al. [1] as an interpolation between first- and second-order methods for regularized empirical risk minimization problems.

The main idea of the method is to minimize the local quadratic approximation within the selected subspace. Their experiments on large-scale linear classification problems show that it is competitive with state-of-the-art optimization methods like BFGS [2] and Nesterov’s accelerated gradient method [3].

The limited-memory version of the common-directions method was then developed by Lee et al. [4]. Their theoretical results show that it has global linear convergence for convex problems and converges to stationary points for non-convex problems. A similar method called the subspace Newton method was later proposed by Gower et al. [5].

Regarding the selection of the subspace, Gower et al. [5]

simply use some randomly chosen vectors. On the other hand, inspired by the heavy-ball method and the BFGS method, Wang et al. [1] and Lee et al. [4] considered the span of current and past gradient directions.

In this work, we assume the loss function to be twice- differentiable, Lipschitz smooth, and strictly convex. We then analyze the impact of subspace selection, and point out that the lack of direction diversity can be a potential weakness for using gradients as directions. To address this problem, we propose the use of subsampled Newton directions [6], which always possess diversity unless they are already close to the

true Newton direction. Our experiments on large-scale linear classification problems show that our proposed methods are generally better than subsampled Newton methods and the original common-directions method.

The paper is organized as follows. In Section II, we introduce the common-directions method. In Section III, we analyze the impact of subspace selection and point out the lack of direction diversity can be a potential weakness for the original common-directions method. In Section IV, we propose to use subsampled Newton directions with the common-directions method, which does not possess the same weakness. We discuss the convergence of our proposed method in Section V.

We put some other algorithmic considerations in Section VI.

Empirical comparisons are conducted in Section VII. Finally, Section VIII concludes our work.

We put the code for our experiments and the additional experiment results at https://www.csie.ntu.edu.tw/^∼cjlin/papers/

commdir subsampled.

II. REVIEW OFLINEARCLASSIFICATION AND THE

COMMON-DIRECTIONSMETHOD

Given a set of training instances (yi, x_i), i = 1, . . . , l, where y_i is a label and xi ∈ Rⁿ is a feature vector, a supervised learning problem can be formulated as the following regularized empirical risk minimization problem

min

w f (w) ≡ 1

2w^Tw + C

l

X

i=1

ξ(w; x_i, y_i), (1) where w^Tw/2 is the L2-regularization term, ξ is a loss function parametrized by a weight vector w ∈ Rⁿ, and C > 0 is a parameter to balance the two terms.

In this work, we assume ξ to be twice-differentiable, Lip- schitz smooth, and strictly convex. In particular, we consider the logistic loss

ξ_LR= log(1 + exp(−yw^Tx))

for large-scale linear classification problems, where the number of instances l and/or the number of features n are large.

The common-directions method was proposed by Wang et al. [1] as an interpolation between first- and second-order methods for solving (1). The limited-memory version was later developed by Lee et al. [4], which is the focus of this work.

(2)

Let

q_k(s) ≡ 1

2s^TH_ks + g^T_ks ≈ f (w_k+ s) − f (w_k) be the quadratic approximation at the current iterate wk, where gk ≡ ∇f (wk) and Hk ≡ ∇²f (wk). The common-directions method first chooses a set of directions

Pk = [p1, . . . , pm], and then computes the update direction

uk = Pktk, where tk is the solution of

mint qk(Pkt) = 1

2(Pkt)^THk(Pkt) + g^T_k(Pkt). (2) After a suitable step size α_k was decided by line search, the next iterate is then computed by

w_k+1= w_k+ α_ku_k.

To solve (2), we consider its first-order condition,

(P_k^THkPk)tk+ P_k^Tgk = 0. (3) After we compute and store P_k^TH_kP_kand P_k^Tg_k, we can then solve the linear system (3) in O(m³).

To make the computation of P_k^THkPk more efficient, Lee et al. [4] showed that the convergence results will still hold if we replace the Hessian matrix Hk with any positive definite matrix Bk.

Furthermore, Lee et al. [4] showed that for linear classification problems, where ξ in (1) can be represented as a function of w^Tx, we can compute P_k^THkPk exactly with efficiency if Pk consists of a search direction ˜sk and m − 1 past directions.

The search direction ˜sk can be the subsampled Newton direction as in this work or the gradient gk as proposed by Wang et al. [1]. For the m − 1 past directions, they can be past search and/or past update directions. In this work, we use the Mixed strategy proposed by Lee et al. [4], where half of the past directions are past search directions and the other half are past update directions.

III. WEAKNESS FORUSINGGRADIENTS ASDIRECTIONS

In this section, we analyze the original common-directions method and point out that the lack of direction diversity can be a potential weakness for using gradients as directions. We put the proofs of the theorems in the appendix.

A. Notation

For convenience, we first define some notations which will be used in our analysis. We use

sk= −H_k⁻¹gk (4)

to denote the Newton direction, which is the minimizer of qk(s). Furthermore, we omit the subscript k when there is no confusion. In particular,

g_k→ g, H_k → H, s_k→ s.

We use the math operator mean

i [ ]

to denote taking the average over i, where i belongs to a finite set.

For each iteration with g 6= 0, we define ν(α, u) ≡ qk(αu)

q_k(s) (5)

and

µ(u) ≡ max

α ν(α, u) = minαqk(αu)

q_k(s) (6)

to measure the strength of an arbitrary direction u against the Newton direction s. Since qk(0) = 0, we always have

minα qk(αu) ≤ 0.

Furthermore, since s is the minimizer of qk, which is strongly convex, and we have g 6= 0, it follows that

qk(s) = −1

2g^TH⁻¹g < 0 and 0 ≤ µ(u) ≤ 1.

We generalize µ to a set of directions Pk as µ(Pk) ≡ mintqk(Pkt)

qk(s) . (7)

For vectors u, v 6= 0 and a positive definite matrix A, we define

τA(u, v) ≡ |v^TAu|

kA¹²uk2kA¹²vk2

(8) to measure their similarity. Due to the Cauchy inequality, we always have

0 ≤ τA(u, v) ≤ 1.

Furthermore, since A is positive definite, we have τA(u, v) = 1 if and only if u = v up to a scale factor.

We also define ∆k = w^∗− wk and H^∗ ≡ ∇²f (w^∗) to prove some convergence properties, where w^∗ is the global minimum of f .

B. Interpretation for ν, µ, and τ

Wang et al. [7] prove that ν is strongly related to convergence under the following assumption.

Assumption 1: The Hessian matrix ∇²f (w) is Lipschitz continuous with parameter ˆL, i.e.,

k∇²f (w) − ∇²f (w⁰)k2≤ ˆLkw − w⁰k2

and f is strongly convex.

More specifically, Lemma 9 of Wang et al. [7] indicates that for an arbitrary optimization method, if the update direction u and the step size α satisfy ν(α, u) > ¯ν for every iteration, then ∆^T_kH^∗∆k converges linearly locally with rate (1− ¯ν)/¯ν.

To ensure (1 − ¯ν)/¯ν < 1, one must have ¯ν > 1/2. We improve their result and give the following theorem, which has a smaller convergence rate and only requires ¯ν > 0.

Theorem 1: Let Assumption 1 hold and ¯ν ∈ (0, 1) be a fixed constant. If at every iteration, we have ν(α, u) ≥ ¯ν, then ∆^T_kH^∗∆_k converges linearly locally with rate 1 − ¯ν.

(3)

It is worth noticing that the purpose of this theorem is to show that ν and µ are good measures for the strength of our update directions. Our proposed method does not rely on this theorem to obtain convergence guarantees, so we do not require Assumption 1.

To connect µ and τ , we give the following proposition.

Proposition 1: For vector u, Hessian H, Newton direction s, we have τH(u, s)²= µ(u).

Therefore, a direction u more similar to the Newton direction s under τH leads to a larger µ(u), and by (6), this u should be a better direction.

C. Effectiveness of the Common-Directions Method

To analyze the performance of the common-directions method, we demonstrate some of its most important properties in the following theorem.

Theorem 2: Let Pk= [p1, . . . , pm] be m linearly independent directions. We have

µ(Pk) ≥ max

i µ(pi) and

µ(Pk) ≥ Γ mean

i µ(pi), (9)

where

Γ = m

1 + (m − 1)ζ, (10)

and

ζ = max

i mean

j:j6=i τ_H(p_i, p_j). (11) The first result simply states that the common-directions method should always perform better than any of its individual directions.

The second result gives a lower bound on the usefulness of the common-directions method. We find it hard to give a meaningful upper bound of µ(Pk) due to the following example. Let u be a direction orthogonal to gk. We have minαqk(αu) = 0, and thus µ(u) = 0. For p1 = u + s and p2 = u − s, we have µ(p1− p2) = µ(s) = 1, while µ(p1) and µ(p2) can be arbitrarily small.

The second result states that there are three determining factors for the lower bound

1) The average strength of selected directions mean_iµ(p_i).

2) The number of directions m.

3) The similarity of the selected directions ζ.

Since in Theorem 2 we assume the directions to be linearly independent, and H is positive definite, we have

0 ≤ τ_H(p_i, p_j) < 1 for j 6= i. Thus, we always have

0 ≤ ζ < 1.

From (9), (10), and (11), one can see that the improvement of the common-directions method over the average strength of the directions becomes larger as the selected directions [p₁, . . . , p_m] become less similar to each other, resulting in a decrease in ζ.

Besides, if ζ does not change much, then Γ slowly increases as m becomes larger. In other words, when the number of directions increases, the common-directions method should perform better.

D. The Lack of Direction Diversity for Gradient Directions From Theorem 2, we can see that for a fixed number of directions, the average strength of the selected directions and their similarity determine the performance of the common- directions method. Now we will show that under some cases, the lack of direction diversity for gradient directions can make them both weak and similar to each other, thus leading to a poor performance.

Assume that at some iteration, the combination of our selected directions [p1, . . . , pm] from past gradient and update directions is weak and gives us a very small update u. Since we assume the gradient to be Lipschitz continuous, the change in the gradient will also be small after we apply our update.

Consequently, our new gradient direction g, which is also our newly added search direction, will be very close to the previous gradient direction, and thus our next update will also be small. Repeating the above process for several iterations, our selected directions will now become not only weak but also very similar to each other. From Theorem 2, we can see that this will lead to a poor performance.

IV. BENEFIT OFSUBSAMPLEDNEWTONDIRECTIONS

In this section, we introduce subsampled Newton directions and show that they cannot be both weak and similar to each other. Thus, we believe that subsampled Newton directions are better than gradient directions when used in the common- directions method.

A. Subsampled Newton Directions

From (4), one can see that the computation of the Newton direction requires the use of the full Hessian Hk. One can instead use the subsampled Hessian [6]

H˜_k≡ I + Cl

|Sk| X

i∈S_k

∇²ξ(w; x_i, y_i)

to approximate the true Hessian, where Sk ⊆ {1, . . . , l}

is a training subset. We can then derive the subsampled Newton direction ˜sk by minimizing the subsampled quadratic approximation

˜

qk(s) = 1

2s^TH˜ks + g_k^Ts. (12) For large-scale problems, − ˜H_k⁻¹gk, the exact minimizor of (12) could be too expensive to compute. Furthermore, the subsampled Hessian matrix ˜Hk ∈ R^n×n may be too large to be stored. Thus, we would use the conjugate gradient (CG) method instead to approximately minimize (12).

The conjugate gradient method is an iterative process which involves a sequence of Hessian-vector products. Past works such as Keerthi et al. [8] and Lin et al. [9] have shown that for linear classification problems, the special form of the

(4)

Hessian allows us to conduct Hessian-vector products without explicitly forming the matrix.

Similarly, we can conduct the conjugate gradient method to minimize (12) without forming the subsampled Hessian, as it shares a similar form with the full Hessian matrix.

When Sk is chosen uniformly and all the training samples (yi, xi) are from the same distribution, we have

E[ ˜Hk] = Hk. However, one should notice that we have

E[− ˜H_k⁻¹gk] 6= −H_k⁻¹gk,

which means the subsampled Newton direction is not an unbiased estimator of the Newton direction.

Our proposal is to use subsampled Newton directions in the common-directions method. Just as the gradient descent method is a special case of the original common-directions method, the subsampled Newton method [6] is a special case of our proposed method, where the number of directions used is one. Another special case of our proposed method is the work of Wang et al. [10], where the current subsampled New- ton direction is combined with the previous update direction uk−1 to produce the current update direction uk.

B. Relation Between Strength and Similarity

To show that subsampled Newton directions cannot be both weak and similar to each other, we consider the case where g barely changes, as intuitively subsampled Newton directions will be very different when the gradient g changes a lot.

The intuition behind the use of subsampled Newton directions is that even though they are not unbiased estimators, they should still be very close to the Newton direction if multiple of them are close to each other, and the Newton direction is the strongest direction in terms of µ.

To show this, we prove the following theorem.

Theorem 3: Given subsampled Hessian ¯Hi and ¯Hj, subsampled Newton directions ¯si= − ¯Hi

−1g and ¯sj= − ¯Hj

−1g, we have

τH¯(¯si, ¯sj) ≤ min{τH¯(¯si, ¯s), τH¯(¯sj, ¯s)} (13) for all ¯s = − ¯H⁻¹g, ¯H = β ¯Hi+ (1 − β) ¯Hj, 0 ≤ β ≤ 1.

This theorem states that ¯s_iand ¯s_j are both closer to ¯s than to each other. That is to say when ¯s_i≈ ¯s_j, we have

¯

s_i≈ ¯s_j ≈ ¯s.

This implies that for multiple subsampled Newton directions, where meani[ ¯Hi] ≈ H, we should have

¯si≈ s if ¯si≈ ¯sj for every i, j.

This means the subsampled Newton directions should be strong whenever they are similar to each other. Therefore, they do not possess the same weakness as the gradient directions.

V. CONVERGENCE

To apply results in [4], we need some conditions.

Assumption 2: The objective f is Lipschitz smooth and strongly convex.

Assumption 3: For all k, at least one of the directions in P_kis a sufficient descent direction; see the explanation below.

Since we assume ξ to be Lipschitz smooth and strictly convex, and we adopt regularization, Assumption 2 holds.

Additionally, the subsampled Newton direction ˜s_k is always a sufficient descent direction. That is for all k, we have

−g^T_k˜sk

kgkk2k˜s_kk2

= g^T_kH˜_k⁻¹gk

kg_kk₂k ˜H_k⁻¹g_kk₂ ≥ δ > 0, (14) where δ is a fixed constant. Because by our design ˜sk is included in Pk, Assumption 3 also holds.

Furthermore, we adopt the backtracking line search, so the following theorem holds.

Theorem 4 (Lee et al. [4] Theorem 3.2): If Assumption 2 and Assumption 3 hold, and we use the solution of the common-directions method as the update direction uk and adopt the backtracking line search, then the function value converges linearly.

This ensures our proposed method has global linear convergence.

VI. OTHERALGORITHMICCONSIDERATIONS

To determine the maximum number of directions m, we propose the following heuristic: We select m such that the extra cost induced by the common-directions method is O(#nnz), where #nnz is the number of non-zero elements in the data set. Since the cost to compute the gradient and the function value are both Θ(#nnz), this makes our computational cost comparable to a single iteration of most optimization methods.

The extra cost for the common-directions method is O(m³+ m²l + mn) time and O(m²+ ml + mn) space. As a result, we propose to choose

m = O(p

#nnz/l).

Under the assumption that n = O(l), this makes both the extra time and space O(#nnz). In our experiment, we pick

m =

(mˆ if ˆm is odd ˆ

m + 1 otherwise,

where ˆm = bp#nnz/lc. In other words, we choose the closest odd number to p#nnz/l, as the Mixed strategy mentioned in Section II requires the number of directions to be odd.

VII. EXPERIMENTS

The binary classification data sets we used are listed in the supplementary material. All data sets except yahookr can be downloaded from the publicly available LIBSVM Data Sets.¹ We modify the publicly available software LIBLINEAR [11]

to compute subsampled Newton directions and incorporate the

1https://www.csie.ntu.edu.tw/^∼cjlin/libsvmtools/datasets

(5)

use of the common-directions method. To decide the step size, we use the backtracking line search method with the Armijo condition. That is to say, given c, β ∈ (0, 1), we find the smallest nonnegative integer i such that the step size αk= βⁱ satisfies

f (wk+ αkuk) ≤ f (wk) + cαkg^T_kuk. In our experiments, we use c = 0.01 and β = 0.5.

To compute the subsampled Newton directions, we first shuffle and partition the data set into fixed training subsets.

We then use these subsets in a cyclic manner to form the subsampled Hessian matrices. This increases the locality for us to compute the subsampled Hessian vector product. We follow Byrd et al. [6] to limit the number of CG steps (#CG) for each iteration. For simplicity, we do not consider other more complex inner stopping conditions for the CG procedure.

We conduct a detailed investigation by checking the rela- tionship between the running time and the following relative function-value reduction (f (wk) − f (w^∗))/f (w^∗), where w^∗ is obtained by LIBLINEAR under a very strict stopping condition. LIBLINEAR uses the following stopping condition,

k∇f (wk)k2≤ min(#pos, #neg)

l k∇f (w0)k2, (15) where l is the total number of instances, #pos and #neg are the numbers of positive and negative instances, w0 is the weight initialization, which is 0 in our setting, and is the specified tolerance. Horizontal lines in our figures show when (15) with tolerances 10⁻¹, 10⁻² (default), and 10⁻³ (the bottom of the figure) are reached by LIBLINEAR; such information indicates when the training algorithm should stop.

Regarding the regularization parameter C, we consider C = C_Best× {1, 64}, where CBest for each data set is the value leading to the best cross validation accuracy. We only show the figures for C = CBest due to the space limit.

A. Comparison With Other Methods

In this section, we compare our proposed method with other related optimization methods. Specifically, we compare

• SubNewtonMixed: The Mixed strategy under the common-directions framework with subsampled Newton directions as search directions.

• SubNewton: Subsampled Newton methods without the common-directions framework.

• GradientMixed: The Mixed strategy under the common- directions framework with gradients as search directions, which is proposed by Lee et al. [4].

• Newton: The preconditioned full Newton solver [12] in LIBLINEAR.

Subsampled Newton directions are computed using 5% of the training data and the number of CG steps is set to be 20.

Due to the space limit, here we do not show the results of some optimization methods which seem to be less competitive in past comparisons. For the comparison between the original common-directions method and LBFGS [13], one can see the work of Lee et al. [4]. For the comparison between Newton

and first-order methods like SAG [14] and SAGA [15], one can see the work of Galli et al. [12].

From Figure I, we can see that for C = CBest, Sub- NewtonMixed is in general better than SubNewton and GradientMixed. The only exception is news20, which is a smaller data set. The results are similar for C = 64CBest. This demonstrates the effectiveness of our proposed method.

From Figure I, we can also see that for C = CBest, SubNewtonMixed performs better than Newton. However, we observe that for C = 64CBest, SubNewtonMixed could perform slightly worse than Newton on sparse data sets like kdda and kddb. This is because when the data set is sparse and the choice of C is large, the problem is more ill-conditioned and the strength of subsampled Newton directions is weaker.

The full Newton method can be useful in such cases.

To conclude, our proposed method is an improvement upon the original common-directions method. While it can be slower than Newton under specific settings, its overall performance is competitive across sparse and dense data sets and different choices of C.

VIII. CONCLUSIONS

In this work, we analyze the impact of subspace selection for the common-directions method, and we point out that the lack of direction diversity can be a potential weakness for using gradients as directions. To address this problem, we propose the use of subsampled Newton directions, which always possess diversity unless they are already close to the true Newton direction. Our experiments on large-scale linear classification problems show that our proposed methods are generally better than the original common-directions method.

REFERENCES

[1] P.-W. Wang, C.-P. Lee, and C.-J. Lin. The common- directions method for regularized empirical risk minimization. JMLR, 20:1–49, 2019.

[2] J. E. Dennis Jr and J. J. Mor´e. Quasi-Newton methods, motivation and theory. SIAM Review, 19(1):46–89, 1977.

[3] Y. E. Nesterov. A method of solving a convex programming problem with convergence rate O(1/k²). Soviet Mathematics Doklady, 27:372–376, 1983.

[4] C.-P. Lee, P.-W. Wang, and C.-J. Lin. Limited-memory common-directions method for large-scale optimization:

convergence, parallelization, and distributed optimization, 2022. Under minor revision for Mathematical Programming Computation.

[5] R. Gower, D. Koralev, F. Lieder, and P. Richt´arik. RSN:

randomized subspace Newton. In NIPS, 2019.

[6] R. H. Byrd, G. M. Chin, W. Neveitt, and J. Nocedal.

On the use of stochastic Hessian information in optimization methods for machine learning. SIAM J. Optim., 21(3):977–995, 2011.

[7] S. Wang, F. Roosta-Khorasani, P. Xu, and M. W. Ma- honey. GIANT: globally improved approximate Newton method for distributed optimization. In NIPS. 2018.

(6)

(a) epsilon normalized (b) HIGGS (c) rcv1 test

(d) news20 (e) webspam trigram (f) yahoojp

(g) yahookr (h) url combined (i) avazu-site

(j) kdda (k) kddb (l) kdd12

Fig. I: Training time of logistic regression with C = CBest

[8] S. S. Keerthi and D. DeCoste. A modified finite Newton method for fast solution of large scale linear SVMs.

JMLR, 6:341–361, 2005.

[9] C.-J. Lin, R. C. Weng, and S. S. Keerthi. Trust re- gion Newton method for large-scale logistic regression.

JMLR, 9:627–650, 2008.

[10] C.-C. Wang, C.-H. Huang, and C.-J. Lin. Subsampled Hessian Newton methods for supervised learning. Neu- ral Comput., 27:1766–1795, 2015.

[11] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: a library for large linear classification. JMLR, 9:1871–1874, 2008.

[12] L. Galli and C.-J. Lin. Truncated Newton methods for linear classification. IEEE TNNLS, 2021. To appear.

[13] D. C. Liu and J. Nocedal. On the limited memory BFGS method for large scale optimization. Math. Program., 45(1):503–528, 1989.

[14] M. Schmidt, N. Le Roux, and F. Bach. Minimizing finite sums with the stochastic average gradient. Math.

Program., 162(1-2):83–112, 2017.

[15] A. Defazio, F. Bach, and S. Lacoste-Julien. SAGA: a fast incremental gradient method with support for non- strongly convex composite objectives. In NIPS, 2014.