All combined, we summarize the overall procedures in Algorithm 4. The complexity of each Newton iteration is
O(#CG × (|O|k + (m + n)k2+ 2mF (f ) + 2nF (g)) +(#LS + 1)× (|O|k + (m + n)k2+ mF (f ) + nF (g))),
(4.28)
where the term ‘1’ in (#LS + 1) comes from the cost of∇L(θ), and #CG and #LS are numbers of CG steps and line search iterations, respectively.
So far we used F (f ) and F (g) to represent the cost of operations on nonlinear em-beddings. Here we discuss more details if automatic differentiation is employed for neu-ral networks. This setting is useful for trying different network architectures. To be-gin, transposed Jacobian-vector products are used in both the gradient evaluation and the Gauss-Newton-vector product. In fact, (4.19) and (4.27) are almost in the same form.
It is known that for such an operation, the reverse-mode automatic differentiation (i.e., back-propagation) should be used so that only one pass of the network is needed [2]. For
feedforward networks, including extensions such as convolutional networks, the cost of a backward process is within a constant factor of the forward process for computing the ob-jective value (e.g., Section 5.2 of [27]). The remaining major operation to be discussed is the Jacobian-vector product in (4.22) in each CG step. This operation can be implemented with the forward-mode automatic differentiation in just one pass [2], and the cost is also within a constant factor of the cost of forward objective evaluation.1 We conclude that each CG step takes a similar cost to that of one objective/gradient evaluation.
Neural networks are routinely trained on GPU instead of CPU for better efficiency, but the smaller memory of GPU causes difficulties for larger problems or network archi-tectures. This situation is even more serious for our case, because some other matrices besides the two networks are involved (see (4.19) and (4.27)). We develop some tech-niques to make large-scale training feasible.
In evaluating L(θ), the required P , Q, ˆPc, ˆQc, Pc, Qc, and Frobenius inner products can be directly computed by the forward and other built-in functions. However, when we use GPUs for large data sets, we face two difficulties. First, as the memory consumption of a forward process is proportional to the number of data, calculating the whole P and Q at once may be infeasible for GPUs. Second, for later computations, we need to cache P and Q inO((m + n)k) space, which may also be infeasible.
For the first difficulty, fortunately, as the computation of each row of P and Q is inde-pendent of others, we follow [26] to have a mini-batch setting. Specifically, by splitting U and V into multiple subsets, we sequentially calculate rows of P or Q corresponding to indices in each subset. For ˜Pc, ˜Qc, ˆPc, ˆQc, Pc, and Qc, each of which is the sum of m or n matrices of size k× k, we can calculate the partial results on each subset and accumulate them for the final output.
To overcome the second difficulty, we apply a heterogeneous computing scheme by storing P and Q in CPU, which has larger memory capacity than GPU. After finishing the above computation in GPU on each subset, we move the resulting partial P and Q to the CPU memory. Some subsequent sparse operations such as L+(θ) in (4.2) are
con-1See, for example, the discussion at https://www.csie.ntu.edu.tw/~cjlin/courses/optdl2020/slides/
newton_gauss_newton.pdf.
ducted in CPU. As the advantage of GPU over CPU is more significant on dense opera-tions, our setting of having sparse operations on CPU is appropriate. On the contrary, for P˜c, ˜Qc, ˆPc, ˆQc, Pc, and Qc, we store these k× k matrices in the GPU memory. Then we compute L−(θ) involving dense operations of Frobenius inner products on GPU.
To compute ∇L(θ) in (4.19), similar to L+(θ), XQ and X⊤P are computed on CPUs. Then by applying the mini-batch setting, for each subset, we feed the corresponding part of XQ, X⊤P , A, B, P , Q, ˜P and ˜Q into GPUs. With the cached ˆPc, ˆQc, Pc, and Qc, we first conduct dense operations to compute XQ + ω(AP Qc− A ˜P ˆQc) and X⊤P + ω(BQPc−B ˜Q ˆPc) on GPU. Then we call the automatic differentiation function (e.g., tf.gradients in TensorFlow) to finish the transposed Jacobian-vector products.
To compute Gd in (4.27), as we mentioned, the Jacobian-vector products can be im-plemented with the forward-mode automatic differentiation. However, for compatibility with older versions of libraries where the forward-mode automatic differentiation is not supported, we apply the trick provided in [24] to compute W and H with the reverse-mode automatic differentiation. Similar to P and Q, we apply the mini-batch setting and cache W and H in the CPU memory. Then with the cached W , H, P , and Q, we compute ZQ and Z⊤P on CPUs. Next, similar to ∇L(θ), by the mini-batch setting, for each subset, we feed the required partial matrices to compute AW Qc+ AP Hcand BHPc+ BQWcin GPU. Finally, we call the automatic differentiation function to finish the remaining transposed Jacobian-vector products.
Chapter 5
Convergence Analysis
In this chapter, we show that a convergence rate guarantee to a stationary point can be provided for our algorithm.
For an empirical risk minimization such as (2.4), [17] developed a framework that provides global convergence results for the common-directions methods described in As-sumption 5.1.
Assumption 5.1. In each iteration, a common-directions algorithm computes a direction
s by solving
min
t∈RC ∇L(θ)⊤s + 1
2s⊤Hsˆ s.t. s = XC
j=1
tjdj (5.1)
and updates θ ← θ + δs, where ∇L(θ) ∈ {d1, . . . , dC} and δ is a step size obtained by backtracking line search. Here ˆH is any positive definite matrix such that there exist M1 ≥ M2 > 0 satisfying
M1I ⪰ ˆH ⪰ M2I. (5.2)
This framework can be applied for any algorithm that forms the search direction s with the directions{d1, . . . , dC}.
In Theorem 3.2 of [17], they show the following theorem.
Theorem 5.2. If an algorithm satisfies Assumption 5.1 and uses the solutions to (5.1)
as the updated directions, then the minimum gradient norm of the iterates{θ , . . . , θ }
vanishes as the number of iterations K increases:
0min≤i≤K∥∇L(θi)∥ = O(1/√
K), (5.3)
and thus∇L(θK) also converges to zero as K approaches infinity.
We next show Algorithm 4 satisfies Assumption 5.1.
Lemma 5.3. The Gauss-Newton approximation G in (3.4) satisfies (5.2).
Proof. Given that Ψ is strongly convex, there exists c > 0 such that ∇2Ψ(θ) ⪰ cI.
Therefore, G in (3.4) is lower-bounded by M2 = λc > 0.
On the other hand, because ℓij in (4.1) is non-negative and a descent method is used, given any initial point θ0, the sequence of points generated by Algorithm 3 is confined in the compact level set
{θ | λΨ(θ) ≤ λΨ(θ0) + Xm
i=1
Xn j=1
ℓij(Yij, y(θ0; ui, vj))}. (5.4)
Therefore, as a continuous function of θ,∥∇2L(θ)∥ is upper-bounded in this compact set, which is equivalent to that G in (3.4) is upper-bounded by some finite value M1.
Therefore, G in (3.4) satisfies (5.2).
In [9], they point that the number of directions C in Assumption 5.1 is not required to be fixed. Instead, Theorem 5.2 holds as long as C in every iteration has a common upper bound. Then they show that because the number of CG steps in Algorithm 3 is bounded by the number of parameters D, Algorithm 3 can be treated as a common-directions method in (5.1) by the following lemma.
Lemma 5.4. Let s be the direction returned by Algorithm 3 and assume that it terminates
after C iterations. Then s =PC
j=1tjdj, where t1, . . . , tC and d1, . . . , dC are iterates in Step 8 and Step 13 of Algorithm 3 respectively, and t minimizes (5.1).
With Lemma 5.3-5.4, Assumption 5.1 is satisfied. Therefore Theorem 5.2 guarantees the convergence rate of (5.3) for Algorithm 4.
Chapter 6
Related Work
In this chapter, after a brief review on related works, we discuss two optimization methods that will be included in our experiments.
As we mentioned in Chapter 1, existing works of extreme similarity learning can be broadly classified into two groups respectively considering linear and nonlinear embed-ding models.
For works considering linear embedding models, the main line is to incorporate more linear models into extreme similarity learning, which can range from simple matrix fac-torization [20, 11, 31] to complex models, e.g., matrix facfac-torization with side information and (field-aware) factorization machines [3, 32, 35]. As we mentioned in Chapter 1, by taking advantage of the multi-block convexity of linear models, they propose to solve the problem by alternating least squares [20], (block) coordinate descent [11, 31, 3, 35], and alternating Newton method [32, 34].
On the other hand, for extreme similarity learning with nonlinear embedding models, because the problem is no longer multi-block convex, applying the above algorithms to such complex nonlinear models as neural networks is nontrivial. Instead, existing training algorithms that can be applied for nonlinear embedding models are extended from SG methods. In the subsequent content, we give a brief review on the sampling method and stochastic online Gramian method [16].
6.1 Sampling Method
The naive way to apply SG methods for minimizing (4.2) is to subsample a small subsetB of mn pairs, and then yield a direction s by estimating the gradient onB in O(|B|(F (f) + F (g))) time. However, a data pass iterating all mn pairs requiresO(mn(F (f) + F (g))) time. The failure of SG methods caused by theO(mn) cost has been reported in [31].
To alleviate the issue, an alternative way, referred to as the sampling method, has been proposed in [36, 5, 4, 16]. Though in [36, 5, 4], the method is only applied for linear embeddings, it has been extended for nonlinear embeddings in [16]. The main idea of the sampling method is to subsample ˆU ⊆ U and ˆV ⊆ V in each step respectively and let B = ˆU × ˆV. If ˆm = |ˆU| and ˆn = |ˆV|, by slightly modifying the derivation in (4.19), we can reduce theO( ˆmˆn(F (f ) + F (g))) computation toO( ˆmF (f ) + ˆnF (g)). The gradient onB can be calculated in O(| ˆO|k + ( ˆm + ˆn)k2+ ˆmF (f ) + ˆnF (g)), where ˆO is the subset of observed pairs falling in ˆU × ˆV. Then its complexity of each data pass including mnmˆˆn steps is
Though the O(mn) cost is not totally removed, applying large sampling sizes ˆm and ˆn can reduce the complexity. A special case is the full-batch gradient method proposed by [36]. By setting ˆV = V and ˆU = U, this method applies the exact gradient ∇L(θ) for each step, which achieves the optimal complexity per data pass.