m
˜
Xm i=1
1
K| ˜Oi,: ∩ ˜Osortedi,:K |, (7.2)
where ˜m is the number of left entities in the test set. Because computing (7.2) involving a O( ˜mn) cost is expensive, we periodically compute it every several points of printing the objective value.
7.2 A Comparison on the Convergence Speed
We first investigate the relation of the running time to relative difference in objective value (L(θ)− L∗)/L∗, where L∗is the lowest objective value reached among all settings. From the result presented in Figure 7.1a, we can observe Newton consistently converges faster than other methods. This observation confirms our conjecture that Newton can take advan-tage of the second-order information and the lower complexity per data pass. Specifically, we first focus on Newton, fBG, and GD, which have the same complexity per data pass.
0 2500 5000 7500 10000 12500 15000 17500
Relative diffe ence in objective value
ml1m
Newton Sampling SOG am
0 2500 5000 7500 10000 12500 15000 17500
Time (s)
Figure 7.2: Results of Sampling and SOGram on ml1m with the extended running time.
We keep Newton for the reference.
Between Newton and GD, the key difference is that Newton additionally considers the second-order information for yielding each direction. The great superiority of Newton confirms the importance of leveraging the second-order information. By comparing fBG and GD, we can observe that fBG is much more efficient than GD. Though both compute the same direction s =−∇L(θ) for each step, fBG leverages partial second-order infor-mation from AdaGrad’s diagonal scaling. However, the effect of the diagonal scaling is limited, thus fBG is slower than Newton.
On the other hand, different from the situation of training general neural networks where SG methods dominate on the efficiency, Sampling and SOGram are slower than Newton and fBG. The reason is that though Sampling and SOGram take more steps in each data pass, from Table 6.1, they have much higher complexity per data pass.
Next, we present in Figure 7.1b the relation of the running time to MAP@5 evaluated on test sets. A similar trend can be observed, that is, a method with faster convergence in objective value is also faster in MAP@5. For Sampling and SOGram that involve subsampling, a clear gap exists between their MAP@5 scores and those of Newton and fBG. To investigate whether the gap will vanish as running time increases, we extend the running time of Sampling and SOGram on the smallest data set ml1m. From the results presented in Figure 7.2, we can observe that the objective value of Sampling is almost stuck in the middle. Therefore, SOGram finally outperforms Sampling. This
observation is also reported in [16], where they explain that the superiority of SOGram over Sampling comes from their proposed variance reduction scheme in (6.7). However, because both methods face the slow convergence issue of SG methods, even we greatly extend the running time of SOGram and Sampling, the gap between Newton does not vanish.
Chapter 8 Conclusion
In this work, we study extreme similarity learning with nonlinear embeddings. Existing optimization methods extended from SG methods suffer from slow convergence and high complexity per data pass. The competitive performance of Newton methods on training certain types of neural networks motivates us to investigate the feasibility of applying Newton methods for this problem. However, this task turns out to be challenging. In particular, a prohibitiveO(mn) cost occurs if we directly apply the Newton Method. To avoid theO(mn) cost, we develop an efficient algorithm and analyze its convergence. The experiments conducted on large-scale data sets show our proposed algorithm outperforms existing methods.
Bibliography
[1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghe-mawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G.
Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng. Tensorflow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 265–283, 2016.
[2] A. G. Baydin, B. A. Pearlmutter, A. A. Radul, and J. M. Siskind. Automatic dif-ferentiation in machine learning: a survey. Journal of Machine Learning Research, 18(153):1–43, 2018.
[3] I. Bayer, X. He, B. Kanagal, and S. Rendle. A generic coordinate descent frame-work for learning from implicit feedback. In Proceedings of the 26th International Conference on World Wide Web, pages 1341–1350, 2017.
[4] C. Chen, M. Zhang, W. Ma, Y. Liu, and S. Ma. Efficient non-sampling factorization machines for optimal context-aware recommendation. In Proceedings of The Web Conference, pages 2400–2410, 2020.
[5] C. Chen, M. Zhang, Y. Zhang, Y. Liu, and S. Ma. Efficient neural matrix factorization without sampling for recommendation. ACM Transactions on Information Systems, 38(2):1–28, 2020.
[6] D.-A. Clevert, T. Unterthiner, and S. Hochreiter. Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289, 2015.
[7] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121–2159, 2011.
[8] A. M. Elkahky, Y. Song, and X. He. A multi-view deep learning approach for cross domain user modeling in recommendation systems. In Proceedings of the 24th In-ternational Conference on World Wide Web, pages 278–288, 2015.
[9] L. Galli and C.-J. Lin. Truncated Newton methods for linear classification. Technical report, National Taiwan University, 2020.
[10] C. R. Harris, K. J. Millman, S. J. van der Walt, R. Gommers, P. Virtanen, D. Cour-napeau, E. Wieser, J. Taylor, S. Berg, N. J. Smith, et al. Array programming with numpy. Nature, 585(7825):357–362, 2020.
[11] X. He, H. Zhang, M.-Y. Kan, and T.-S. Chua. Fast matrix factorization for online rec-ommendation with implicit feedback. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pages 549–558, 2016.
[12] M. R. Hestenes and E. Stiefel. Methods of conjugate gradients for solving linear systems. Journal of Research of the National Bureau of Standards, 49(1):409–436, 1952.
[13] Y. Hu, Y. Koren, and C. Volinsky. Collaborative filtering for implicit feedback datasets. In Proceedings of the IEEE International Conference on Data Mining (ICDM), pages 263–272, 2008.
[14] J.-T. Huang, A. Sharma, S. Sun, L. Xia, D. Zhang, P. Pronin, J. Padmanabhan, G. Ot-taviano, and L. Yang. Embedding-based retrieval in facebook search. In Proceed-ings of the 26th ACM SIGKDD International Conference on Knowledge Discovery
& Data Mining, pages 2553–2561, 2020.
[15] P.-S. Huang, X. He, J. Gao, L. Deng, A. Acero, and L. Heck. Learning deep struc-tured semantic models for web search using clickthrough data. In Proceedings of
the 22nd ACM international conference on Information & Knowledge Management, pages 2333–2338, 2013.
[16] W. Krichene, N. Mayoraz, S. Rendle, L. Zhang, X. Yi, L. Hong, E. Chi, and J. Ander-son. Efficient training on very large corpora via gramian estimation. In International Conference on Learning Representations, 2018.
[17] C.-P. Lee, P.-W. Wang, W. Chen, and C.-J. Lin. Limited-memory common-directions method for distributed optimization and its application on empirical risk minimiza-tion. In Proceedings of SIAM International Conference on Data Mining (SDM), 2017.
[18] C.-J. Lin, R. C. Weng, and S. S. Keerthi. Trust region Newton method for large-scale logistic regression. In Proceedings of the 24th International Conference on Machine Learning (ICML), 2007. Software available at http://www.csie.ntu.edu.
tw/~cjlin/liblinear.
[19] J. Martens. Deep learning via Hessian-free optimization. In Proceedings of the 27th International Conference on Machine Learning (ICML), 2010.
[20] R. Pan and M. Scholz. Mind the gaps: Weighting the unknown in large-scale one-class collaborative filtering. In Proceedings of the 15th ACM SIGKDD Interna-tional Conference on Knowledge Discovery and Data Mining (KDD), pages 667–
676, 2009.
[21] R. Pan, Y. Zhou, B. Cao, N. N. Liu, R. Lukose, M. Scholz, and Q. Yang. One-class collaborative filtering. In IEEE International Conference on Data Mining (ICDM), pages 502–511, 2008.
[22] S. Rendle, W. Krichene, L. Zhang, and J. Anderson. Neural collaborative filtering vs. matrix factorization revisited. Fourteenth ACM Conference on Recommender Systems, 2020.
[23] N. N. Schraudolph. Fast curvature matrix-vector products for second-order gradient descent. Neural Computation, 14(7):1723–1738, 2002.
[24] J. Townsend. A new trick for calculating Jacobian vector products, 2017.
[25] C.-C. Wang, C.-H. Huang, and C.-J. Lin. Subsampled Hessian Newton methods for supervised learning. Neural Computation, 27(8):1766–1795, 2015.
[26] C.-C. Wang, K.-L. Tan, C.-T. Chen, Y.-H. Lin, S. S. Keerthi, D. Mahajan, S. Sun-dararajan, and C.-J. Lin. Distributed Newton methods for deep learning. Neural Computation, 30(6):1673–1724, 2018.
[27] C.-C. Wang, K. L. Tan, and C.-J. Lin. Newton methods for convolutional neural networks. ACM Transactions on Intelligent Systems and Technology, 11(2):19:1–
19:30, 2020.
[28] X. Wang, R. Zhang, Y. Sun, and J. Qi. Doubly robust joint learning for recom-mendation on data missing not at random. In International Conference on Machine Learning, pages 6638–6647, 2019.
[29] Y. Yang, S. Yuan, D. Cer, S.-y. Kong, N. Constant, P. Pilar, H. Ge, Y.-H. Sung, B. Strope, and R. Kurzweil. Learning semantic textual similarity from conversa-tions. In Proceedings of The Third Workshop on Representation Learning for NLP.
Association for Computational Linguistics, 2018.
[30] X. Yi, J. Yang, L. Hong, D. Z. Cheng, L. Heldt, A. A. Kumthekar, Z. Zhao, L. Wei, and E. Chi, editors. Sampling-Bias-Corrected Neural Modeling for Large Corpus Item Recommendations, 2019.
[31] H.-F. Yu, M. Bilenko, and C.-J. Lin. Selection of negative samples for one-class matrix factorization. In Proceedings of SIAM International Conference on Data Mining (SDM), 2017.
[32] H.-F. Yu, H.-Y. Huang, I. S. Dihillon, and C.-J. Lin. A unified algorithm for one-class structured matrix factorization with side information. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI), 2017.
[33] H.-F. Yu, P. Jain, P. Kar, and I. S. Dhillon. Large-scale multi-label learning with missing labels. In Proceedings of the Thirty First International Conference on Ma-chine Learning (ICML), pages 593–601, 2014.
[34] B. Yuan, J.-Y. Hsia, M.-Y. Yang, H. Zhu, C. Chang, Z. Dong, and C.-J. Lin. Improv-ing Ad click prediction by considerImprov-ing non-displayed events. In ProceedImprov-ings of the 28th ACM International Conference on Conference on Information and Knowledge Management (CIKM), 2019.
[35] B. Yuan, M.-Y. Yang, J.-Y. Hsia, H. Zhu, Z. Liu, Z. Dong, and C.-J. Lin. One-class field-aware factorization machines for recommender systems with implicit feedbacks. Technical report, National Taiwan University, 2019.
[36] F. Yuan, X. Xin, X. He, G. Guo, W. Zhang, C. Tat-Seng, and J. M. Jose. fBGD:
Learning embeddings from positive unlabeled data with BGD. In Proceedings of the Thirty-Fourth Conference on Uncertainty in Artificial Intelligence (UAI), 2018.
[37] K. Zhong, Z. Song, P. Jain, and I. S. Dhillon. Provable non-linear inductive matrix completion. In Advances in Neural Information Processing Systems, pages 11439–
11449, 2019.