IV. Our Approaches
4.4 Implementation Issues
FPSG uses the standard thread class in C++ implemented by pthread to do the parallelization. For the data set Yahoo!Music of about 250M ratings, using a typical machine (details specified in Chapter 5.1), FPSG finishes processing all ratings once in 6 seconds and takes only about 8 minutes to converge to a reasonable RMSE. Here we describe some techniques employed in our implementation. First, empirically we find that using single-precision floating-point computation does not suffer from numerical error accumulation. For the data set Netflix, using single precision runs 1.1 times faster then using double precision. Second, modern CPU provides SSE instructions that can concurrently run floating-point multiplications and additions. We apply SSE instructions for vector inner products and additions. For Yahoo!Music data set, the speed up is 2.4 times. Figure 4.81 shows the speedup after these techniques are applied in two data sets.
1For experimental settings, see Chapter 5.1.
0 1000 2000 3000 4000 5000
0 1000 2000 3000 4000 5000 20
Figure 4.6: A comparison between the ordered method, the random method, and the partial random method on the set Yahoo!Music. One thread is used.
1
Rating Matrix After the random shuffle Sort row indices in each block
Figure 4.7: An illustration of the partial random method. After the random shuffle of data, some indices (in red color) are not ordered within each block. We make the row indices ordered (in blue color) by a sorting procedure.
0 50 100 150 200 250
0 500 1000 1500 2000
22
Figure 4.8: A comparison between two implementations of FPSG in Netflix and
Ya-Algorithm 5 Scheduler of FPSG
1: procedure get job
2: Initialize an empty list b and b.utmin = ∞ . ut: number of updates
3: for all b in B do
4: if b is non-free then
5: continue
6: else
7: if b.ut == b.utmin then
8: Add b into b
9: else if b.ut < b.utmin then
10: b.utmin = b.ut
11: Make b empty and add b into b
12: end if
13: end if
14: end for
15: Randomly select an element denoted by bx from b
16: return bx
17: end procedure
18: procedure put job(b)
19: b.ut = b.ut + 1
20: end procedure
Algorithm 6 Working thread of FPSG
1: while true do
2: get a block b from scheduler → get job()
3: process elements orderly in this block
4: scheduler → put job(b)
5: end while
CHAPTER V
Experiments
In this chapter, we provide the details about our experimental settings, and compare FPSG with other parallel matrix factorization algorithms mentioned in Chapter II.
5.1 Settings
Data Sets: Four data sets, MovieLens,1 Netflix, Yahoo!Music, and Hugewiki,2 are used for the experiments. For reproducibility, we consider the original training/test sets in our experiments if they are available (for MovieLens, we use Part B of the original data set generated by the official script). Because the test set of Yahoo!Music is not available, we consider the last four ratings of each user for testing, while the remaining ratings for training set. The data set Hugewiki is too large to fit in our machines, so we sample one quarter of the data randomly, and split them into training/test sets.
The statistics of each data set is in Table 5.1.
Platform: We use a server with two Intel Xeon E5-2620 2.0GHz processors and 64 GB memory. There are six cores in each processor.
Parameters: Table 5.1 lists the parameters used for each data set. The parameters k, λP, λQ may be chosen by a validation procedure although here we mainly borrow
Data Set MovieLens Netflix Yahoo!Music Hugewiki
m 71,567 2,649,429 1,000,990 39,706
n 65,133 17,770 624,961 25,034,863
#Training 9,301,274 99,072,112 252,800,275 761,429,411
#Test 698,780 1,408,395 4,003,960 100,000,000
k 40 40 100 100
λP 0.05 0.05 1 0.01
λQ 0.05 0.05 1 0.01
γ 0.003 0.002 0.0001 0.004
Table 5.1: The statistics and parameters for each data set. Note that the Hugewiki set used here contains only one quarter of the original set.
values from earlier works to obtain comparable results. For Netflix and Yahoo!Music, we use the parameters in (Yu et al., 2012); see values listed in Table 5.1. Although (Yu et al., 2012) have considered MovieLens, we use a different setting of λP = λQ = 0.05 for a better RMSE. For Hugewiki, we consider the same parameters as in (Yun et al., 2014). The initial values of P and Q are chosen randomly under a uniform distribution.
This setting is the same as that in (Yu et al., 2012). The learning rate is determined by an ad hoc parameter selection. Because we focus on the running speed rather than RMSE in this thesis, we do not apply an adaptive learning rate.
In our platform, 12 physical cores are available, so we use 12 threads in all exper-iments. For FPSG, even though Chapter IV shows that (s + 1) × (s + 1) blocks are already enough for s threads, we use more blocks to ensure the randomness of blocks that are simultaneously processed. For Netflix, Yahoo!Music and Hugewiki, R is grided into 32 × 32 blocks; for MovieLens, R is grided into 16 × 16 blocks because the number of non-zeros is smaller.
Evaluation: As most recommender systems do, the metric adopted as our evalu-ation is RMSE on the test set, which is disjoint with the training set; see Eq. (1.4).
In addition, the time in each figure refers to the training time.
Implementations: Among methods included for comparison, HogWild3and CCD++4 are publicly available. We reimplement HogWild under the same framework of our FPSG and DSGD implementations for a fairer comparison. In the official HogWild package, the formulation includes the average value of training ratings. After trying different settings, the program still fails to converge. Therefore, we present only results of our HogWild implementation in the experiments.
The publicly available CCD++ code uses double precision. Because ours uses single precision following the discussion in Chapter 4.4, for a fair comparison, we obtain a singles-precision version of CCD++ from its authors. Note that OpenMP5 is used in their implementation.
5.2 Comparison of Methods on Training Time versus RMSE
We first illustrate the effectiveness of our solutions for data imbalance and mem-ory discontinuity. Then, we compare parallel matrix factorization methods including DSGD, CCD++, HogWild and our FPSG.
5.2.1 The effectiveness of addressing the locking problem
In Chapter 3.1, we mentioned that updating several blocks in a batch may suffer from the locking problem if the data is unbalanced. To verify the effectiveness of FPSG, in Figure 5.1, we compare it with a modification where the scheduler processes a batch of independent blocks as DSGD (Algorithm 2) does. We call the modified algorithm as FPSG**. It can be clearly seen that FPSG runs much faster than FPSG** because it does not suffer from the locking problem.
3http://hazy.cs.wisc.edu/hazy/victor/
4http://www.cs.utexas.edu/~rofuyu/libpmf/
5http://openmp.org/
0 5 10 15 20 25 Figure 5.1: A comparison between FPSG** and FPSG.
5.2.2 The effectiveness of having better memory locality
We conduct experiments to investigate if the proposed partial random method can not only avoid memory discontinuity, but also keep good convergence. In Figure 5.2, we select rating instances in each block orderly (the partial random method) or randomly (the random method). Both methods converge to a similar RMSE, but the training time of the partial random method is obviously shorter than that of the random method.
5.2.3 Comparison with the state-of-the-art methods
Figure 5.3 presents the test RMSE and training time of various parallel matrix factorization methods. Among the three parallel SG methods, FPSG is faster than DSGD and HogWild. We believe that this result is because FPSG is designed to effec-tively address issues mentioned in Chapter III. However, we must note that for DSGD, it is also easy to incorporate similar techniques (e.g., the partial random method) to
0 50 100 150
0 1000 2000 3000 4000 5000 6000 22
0 1000 2000 3000 4000 5000
0.55
Figure 5.2: A comparison between the partial random method and the random method.
improve its performance.
As shown in Figure 5.3, CCD++ is the fastest in the beginning, but becomes slower than FPSG. Because the optimization problem of matrix factorization is non-convex and CCD++ is a more greedy setting than SG by accurately minimizing the objective function over certain variables at each step, we suspect that CCD++ may converge to some local minimum pre-maturely. On the contrary, SG-based methods may be able to escape from a local minimum because of the randomness. Furthermore, for the Hugewiki in Figure 5.3, CCD++ does not give a satisfactory RMSE. Note that in addition to the regularization parameter used in this experiment, (Yun et al., 2014) have applied larger parameters for Hugewiki. The resulting RMSE can be improved.
5.2.4 Comparison with CCD++ for Non-negative Matrix Factorization
0 20 40 60 80 100 120 140
0 500 1000 1500 2000 2500
0.92
0 500 1000 1500 2000 2500 3000 22
0 500 1000 1500 2000 2500 3000 0.55
Figure 5.3: A comparison among the state-of-the-art parallel matrix factorization methods.
other matrix factorization problems. We consider non-negative matrix factorization (NMF) that requires the non-negativity of P and Q.
minP,Q
X
(u,v)∈R
(ru,v− pTuqv)2+ λPkpuk2+ λQkqvk2, (5.1)
subject to Piu≥ 0, Qiv ≥ 0, ∀i ∈ {1, · · · , k}.
It is straightforward to warp FPSG for solving (5.1) with a simple projection (Gemulla et al., 2011), and the corresponding update rules are
pu ← max(0, pu+ γ (eu,vqv− λPpu)) qv ← max(0, qv + γ (eu,vpu− λQqv)),
(5.2)
where max(·, ·) is an element-wise maximum operation.
For CCD++, a modification for NMF has been proposed in (Hsieh and Dhillon, 2011). Like (5.2), it projects negative values back to zero during the coordinate descent
0 2 4 6 8 10
Figure 5.4: A comparison between CCD++ and FPSG for non-negative matrix fac-torization.
method. Our experimental comparison on CCD++ and FPSG is presented in Figure 5.4. Similar to Figure 5.3, FPSG outperforms CCD++ on NMF.
5.3 Speedup of FPSG
Speedup is an indicator on the effectiveness of a parallel algorithm. On a shared memory system, it refers to the time reduction from using one core to several cores.
In this chapter, we compare the speedup of FPSG with other methods. From Figure 5.5, FPSG outperforms DSGD and HogWild. This result is expected because FPSG aims at improving some shortcomings of these two methods.
Compared with CCD++, FPSG is better on two data sets, while CCD++ is bet-ter on the others. As Algorithm 3 and Algorithm 4 show, FPSG and CCD++ are
1 2 3 4 5 6 7 8 9 10 11 12 Figure 5.5: Speedup of different matrix factorization methods.
the other on some problems. Nevertheless, even though CCD++ gives better speedup in some occasions, its overall performance (running time and RMSE) is still worse than FPSG in Figure 5.3. Thus parallel SG remains a compelling method for matrix factorization.
CHAPTER VI
Discussion
We discuss some miscellaneous issues in this chapter. Chapter 6.1 demonstrates that taking the advantage of data locality can further improve the proposed FPSG method. In Chapter 6.2, the selection of the number of blocks is discussed.
6.1 Data Locality and the Update Order
In our partial random method, ratings in each block are ordered. We can consider a user-oriented or item-oriented ordering. Interestingly, these two ways may cause different costs on the data access. For example, in Figure 4.3, we consider a user-oriented setting, so under a given u
Ru,v and qv, ∀(u, v) ∈ R
must be accessed. While Ru,v is a scalar, qv, ∀(u, v) ∈ R involve many columns of the dense matrix Q. Therefore, for going through all users, Q is needed many times.
Alternatively, if an item-oriented setting is used, for every item, PT is needed. Now if m n, P ’s size (k × m) is much larger than Q (k × n). Under the user-oriented setting, it is possible that Q (or a significant portion of Q) can be stored in a higher
XXXX
XXXX
XX
Order Data set MovieLens Netflix Yahoo!Music Hugewiki
User 2.27 22.50 173.34 1531.14
Item 2.91 43.26 294.19 1016.19
Table 6.1: Execution time (in seconds) of 50 iterations of FPSG
# item blocks 16 16 16 16 32 32 32 32
# user blocks 16 32 48 64 16 32 48 64
iterations 50 48 49 48 48 49 49 49
time 3.25 3.37 3.66 3.94 3.28 3.86 4.74 6.98
# item blocks 48 48 48 48 64 64 64 64
# item blocks 16 32 48 64 16 32 48 64
iterations 49 49 49 49 49 49 48 49
time 3.43 4.83 8.15 13.30 3.67 6.94 12.87 21.16
Table 6.2: The performance of FPSG on MovieLens with different number of blocks.
The target RMSE is 0.858. Time is in seconds.
may have to be swapped out to lower-level memory several times. Thus the cost for data movements is higher. Based on this discussion, we conjecture that
m n ⇒ user-oriented access should be used, m n ⇒ item-oriented access should be used.
(6.1)
We compare the two update orders in Table 6.1. For Netflix and Yahoo!Music, the user-wise approach is much faster. From Table 5.1, these two data sets have m n. On the contrary, because n m, the item-oriented approach is much better for Hugewiki.
This experiment fully confirms our conclusion in (6.1).
6.2 Number of Blocks
Recall that in FPSG, R is separated to at least (s + 1) × (s + 1) blocks, where s is the number of threads. We conduct experiments to see how the number of blocks affects the performance of FPSG. The results on three data sets are listed in Table 6.2, Table 6.3, and Table 6.4. In these tables, “iterations” and “time” respectively mean
# item blocks 16 16 16 16 16 16 16
# user blocks 16 32 48 64 80 96 112
iterations 50 50 49 50 49 50 50
time 28.92 28.39 30.73 30.69 29.66 32.34 30.61
# item blocks 32 32 32 32 32 32 32
# user blocks 16 32 48 64 80 96 112
iterations 50 50 50 50 50 50 50
time 33.45 32.10 32.58 31.38 33.51 32.56 34.73
# item blocks 48 48 48 48 48 48 48
# user blocks 16 32 48 64 80 96 112
iterations 50 50 50 49 50 50 50
time 39.29 33.99 37.12 31.91 39.08 39.36 43.25
# item blocks 64 64 64 64 64 64 64
# user blocks 16 32 48 64 80 96 112
iterations 50 50 50 50 50 50 50
time 34.25 39.99 35.37 35.92 47.98 49.11 59.18
# item blocks 80 80 80 80 80 80 80
# user blocks 16 32 48 64 80 96 112
iterations 49 50 50 50 50 50 50
time 45.54 35.98 43.60 51.92 53.88 66.30 87.09
# item blocks 96 96 96 96 96 96 96
# user blocks 16 32 48 64 80 96 112
iterations 50 49 50 50 50 50 50
time 49.62 56.06 40.43 48.70 67.52 91.23 124.01
# item blocks 112 112 112 112 112 112 112
# user blocks 16 32 48 64 80 96 112
iterations 50 50 50 50 50 50 50
time 54.21 39.97 50.68 65.97 87.69 123.57 170.75
Table 6.3: The performance of FPSG on Netflix with different number of blocks. The target RMSE is 0.941. Time is in seconds.
# item blocks 32 32 32 32 32 32
# user blocks 32 64 96 128 160 192
iterations 50 50 50 50 50 50
time 188.13 180.21 184.57 183.51 186.66 187.02
# item blocks 64 64 64 64 64 64
# user blocks 32 64 96 128 160 192
iterations 50 50 50 50 50 50
time 172.74 182.90 218.67 196.95 189.23 211.93
# item blocks 96 96 96 96 96 96
# user blocks 32 64 96 128 160 192
iterations 50 50 50 50 50 50
time 188.57 179.78 205.17 203.49 227.86 254.71
# item blocks 128 128 128 128 128 128
# user blocks 32 64 96 128 160 192
iterations 50 50 49 50 50 50
time 192.20 188.69 218.00 242.98 379.63 565.25
# item blocks 160 160 160 160 160 160
# user blocks 32 64 96 128 160 192
iterations 50 50 50 50 50 50
time 195.24 201.39 254.64 374.44 606.47 842.62
# item blocks 192 192 192 192 192 192
# user blocks 32 64 96 128 160 192
iterations 50 50 50 50 50 50
time 196.41 217.05 246.15 563.86 842.81 1170.79
Table 6.4: The performance of FPSG on Yahoo!Music with different number of blocks.
The target RMSE is 22.40. Time is in seconds.
the number of iterations and time used to achieve a target RMSE value.1 We use 12 threads for the experiments.
On each data set, different numbers of blocks achieve the target RMSE in a similar number of iterations. Clearly, the number of blocks does not seem to affect the conver-gence. However, when many blocks are used, the running time dramatically increases.
Taking MovieLens as an example, FPSG takes only 3.25 seconds if 16 × 16 blocks are used, while 21.16 seconds are required if 64 × 64 blocks are used. To explain this result, let us check what happens when the number of blocks increases. First, the overhead of getting a job increases because the selection is from a pool of more blocks. Second, the execution time per block decreases as a block contains less ratings. Third, the scheduler is executed more frequently because the execution time per block decreases.
The overall impact is that the scheduling becomes cost-ineffective. That is, we spend innegligible time to select a block, but the block is quickly processed. Further, we explain that the CPU utilization may be lowered when too many blocks are used. In this situation, the scheduler takes more time to assign a block to a thread, but during this process, another thread that needs to get a block must wait.
The above discussion suggests that we should avoid splitting R to too many blocks.
However, whether the number of blocks is too many or not depends on the data set. The 64 × 64 setting is too many for MovieLens, but seems adequate for Netflix. Therefore, the selection of the number of blocks is not easy. From the experimental results, using (s + 1) × (s + 1) to 2s × 2s blocks may be a suitable choice.
CHAPTER VII
Conclusions and Future Works
To provide a more complete SG solver for recommender systems, we will extend our algorithm to solve variants of matrix-factorization problems. In addition, to further reduce the cache-miss rate, we will investigate non-uniform splits of the rating matrix or other permutation methods such as Cuthill-McKee ordering. Very recently a new parallel matrix factorization method NOMAD (Yun et al., 2014) has been released.
It uses an asynchronization setting to reduce the waiting time at any thread. This technique is related to our non-blocking scheduling. Another parallel solver for matrix factorization is in GraphChi (Kyrola et al., 2012), which is a framework for graph computation.1 Their method divides R into m blocks, where each block contains the ratings of a particular user, and these blocks are updated in parallel. An important difference between ours and theirs is that they do not require blocks being processed are mutually independent. Therefore, the over-writing problem discussed in Chapter 2.1 may occur. We plan to conduct comparisons between our method, NOMAD, and GraphChi.
In conclusion, we point out some computational bottlenecks in existing parallel SG methods for shared-memory systems. We propose FPSG to address these issues and
1In their thesis, they use alternative least square method (ALS) as the solver. However, an SG implementation has been included in their latest release available at https://github.com/GraphChi/
graphchi-cpp. We discuss their SG-based method in the context.
confirm its effectiveness by experiments. The comparison shows that FPSG is more efficient than state-of-the-art methods. Programs used for experiments in this thesis can be found at
http://www.csie.ntu.edu.tw/~cjlin/libmf/exps/
Further, based on this study, we develop an easy-to-use package LIBMF available at
http://www.csie.ntu.edu.tw/~cjlin/libmf for public use. Appendix A describes the formulation used in LIBMF.
Acknowledgement
This is a joint work with Wei-Sheng Chin and Yong Zhuang.
BIBLIOGRAPHY
R. M. Bell and Y. Koren. Lessons from the Netflix prize challenge. ACM SIGKDD Explorations Newsletter, 9(2):75–79, 2007.
K.-W. Chang, C.-J. Hsieh, and C.-J. Lin. Coordinate descent method for large-scale L2-loss linear SVM. Journal of Machine Learning Research, 9:1369–1398, 2008. URL http://www.csie.ntu.edu.tw/~cjlin/papers/cdl2.pdf.
G. Dror, N. Koenigstein, Y. Koren, and M. Weimer. The Yahoo! music dataset and KDD-Cup 11. In JMLR Workshop and Conference Proceedings: Proceedings of KDD Cup 2011, volume 18, pages 3–18, 2012.
R. Gemulla, E. Nijkamp, P. J. Haas, and Y. Sismanis. Large-scale matrix factor-ization with distributed stochastic gradient descent. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 69–77, 2011.
K. B. Hall, S. Gilpin, and G. Mann. MapReduce/Bigtable for distributed op-timization. In Neural Information Processing Systems Workshop on Leaning on Cores, Clusters, and Clouds, 2010.
C.-J. Hsieh and I. S. Dhillon. Fast coordinate descent methods with variable se-lection for non-negative matrix factorization. In Proceedings of the Seventeenth ACM SIGKDD International Conference on Knowledge Discovery and Data Min-ing, 2011.
J. Kiefer and J. Wolfowitz. Stochastic estimation of the maximum of a regression function. The Annals of Mathematical Statistics, 23(3):462–466, 1952.
Y. Koren, R. M. Bell, and C. Volinsky. Matrix factorization techniques for rec-ommender systems. Computer, 42(8):30–37, 2009.
A. Kyrola, G. Blelloch, and C. Guestrin. Graphchi: Large-scale graph computa-tion on just a pc. In Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’12), Hollywood, October 2012.
G. Mann, R. McDonald, M. Mohri, N. Silberman, and D. Walker. Efficient large-scale distributed training of conditional maximum entropy models. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems 22, pages 1231–1239. 2009.
R. McDonald, K. Hall, and G. Mann. Distributed training strategies for the struc-tured perceptron. In Proceedings of the 48th Annual Meeting of the Association of Computational Linguistics (ACL), pages 456–464, 2010.
F. Niu, B. Recht, C. R´e, and S. J. Wright. HOGWILD!: A lock-free ap-proach to parallelizing stochastic gradient descent. In J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K. Weinberger, editors, Advances in Neural Informa-tion Processing Systems 24, pages 693–701, 2011.
I. Pil´aszy, D. Zibriczky, and D. Tikk. Fast ALS-based matrix factorization for explicit and implicit feedback datasets. In Proceedings of the Fourth ACM Con-ference on Recommender Systems, pages 71–78, 2010.
H. Robbins and S. Monro. A stochastic approximation method. The Annals of Mathematical Statistics, 22(3):400–407, 1951.
H.-F. Yu, C.-J. Hsieh, S. Si, and I. S. Dhillon. Scalable coordinate descent ap-proaches to parallel matrix factorization for recommender systems. In Proceedings of the IEEE International Conference on Data Mining, pages 765–774, 2012.
H. Yun, H.-F. Yu, C.-J. Hsieh, S. Vishwanathan, and I. S. Dhillon. Nomad:
Non-locking, stochastic multi-machine algorithm for asynchronous and decentral-ized matrix completion. In International Conference on Very Large Data Bases (VLDB), 2014.
Y. Zhou, D. Wilkinson, R. Schreiber, and R. Pan. Large-scale parallel collabo-rative filtering for the Netflix prize. In Proceedings of the Fourth International Conference on Algorithmic Aspects in Information and Management, pages 337–
348, 2008.
Y. Zhuang, W.-S. Chin, Y.-C. Juan, and C.-J. Lin. A fast parallel SGD for matrix factorization in shared memory systems. In Proceedings of the ACM Rec-ommender Systems, 2013. URL http://www.csie.ntu.edu.tw/~cjlin/papers/
libmf.pdf.
M. Zinkevich, M. Weimer, A. Smola, and L. Li. Parallelized stochastic gradient de-scent. In J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems 23, pages 2595–2603.
2010.
APPENDICES
APPENDIX A. Formulation Used in LIBMF
In LIBMF, in addition to P and Q, we add user bias, item bias, and average terms,
In LIBMF, in addition to P and Q, we add user bias, item bias, and average terms,