• 沒有找到結果。

V. Experiments

5.3 Speedup of FPSG

Speedup is an indicator on the effectiveness of a parallel algorithm. On a shared memory system, it refers to the time reduction from using one core to several cores.

In this chapter, we compare the speedup of FPSG with other methods. From Figure 5.5, FPSG outperforms DSGD and HogWild. This result is expected because FPSG aims at improving some shortcomings of these two methods.

Compared with CCD++, FPSG is better on two data sets, while CCD++ is bet-ter on the others. As Algorithm 3 and Algorithm 4 show, FPSG and CCD++ are

1 2 3 4 5 6 7 8 9 10 11 12 Figure 5.5: Speedup of different matrix factorization methods.

the other on some problems. Nevertheless, even though CCD++ gives better speedup in some occasions, its overall performance (running time and RMSE) is still worse than FPSG in Figure 5.3. Thus parallel SG remains a compelling method for matrix factorization.

CHAPTER VI

Discussion

We discuss some miscellaneous issues in this chapter. Chapter 6.1 demonstrates that taking the advantage of data locality can further improve the proposed FPSG method. In Chapter 6.2, the selection of the number of blocks is discussed.

6.1 Data Locality and the Update Order

In our partial random method, ratings in each block are ordered. We can consider a user-oriented or item-oriented ordering. Interestingly, these two ways may cause different costs on the data access. For example, in Figure 4.3, we consider a user-oriented setting, so under a given u

Ru,v and qv, ∀(u, v) ∈ R

must be accessed. While Ru,v is a scalar, qv, ∀(u, v) ∈ R involve many columns of the dense matrix Q. Therefore, for going through all users, Q is needed many times.

Alternatively, if an item-oriented setting is used, for every item, PT is needed. Now if m  n, P ’s size (k × m) is much larger than Q (k × n). Under the user-oriented setting, it is possible that Q (or a significant portion of Q) can be stored in a higher

XXXX

XXXX

XX

Order Data set MovieLens Netflix Yahoo!Music Hugewiki

User 2.27 22.50 173.34 1531.14

Item 2.91 43.26 294.19 1016.19

Table 6.1: Execution time (in seconds) of 50 iterations of FPSG

# item blocks 16 16 16 16 32 32 32 32

# user blocks 16 32 48 64 16 32 48 64

iterations 50 48 49 48 48 49 49 49

time 3.25 3.37 3.66 3.94 3.28 3.86 4.74 6.98

# item blocks 48 48 48 48 64 64 64 64

# item blocks 16 32 48 64 16 32 48 64

iterations 49 49 49 49 49 49 48 49

time 3.43 4.83 8.15 13.30 3.67 6.94 12.87 21.16

Table 6.2: The performance of FPSG on MovieLens with different number of blocks.

The target RMSE is 0.858. Time is in seconds.

may have to be swapped out to lower-level memory several times. Thus the cost for data movements is higher. Based on this discussion, we conjecture that

m  n ⇒ user-oriented access should be used, m  n ⇒ item-oriented access should be used.

(6.1)

We compare the two update orders in Table 6.1. For Netflix and Yahoo!Music, the user-wise approach is much faster. From Table 5.1, these two data sets have m  n. On the contrary, because n  m, the item-oriented approach is much better for Hugewiki.

This experiment fully confirms our conclusion in (6.1).

6.2 Number of Blocks

Recall that in FPSG, R is separated to at least (s + 1) × (s + 1) blocks, where s is the number of threads. We conduct experiments to see how the number of blocks affects the performance of FPSG. The results on three data sets are listed in Table 6.2, Table 6.3, and Table 6.4. In these tables, “iterations” and “time” respectively mean

# item blocks 16 16 16 16 16 16 16

# user blocks 16 32 48 64 80 96 112

iterations 50 50 49 50 49 50 50

time 28.92 28.39 30.73 30.69 29.66 32.34 30.61

# item blocks 32 32 32 32 32 32 32

# user blocks 16 32 48 64 80 96 112

iterations 50 50 50 50 50 50 50

time 33.45 32.10 32.58 31.38 33.51 32.56 34.73

# item blocks 48 48 48 48 48 48 48

# user blocks 16 32 48 64 80 96 112

iterations 50 50 50 49 50 50 50

time 39.29 33.99 37.12 31.91 39.08 39.36 43.25

# item blocks 64 64 64 64 64 64 64

# user blocks 16 32 48 64 80 96 112

iterations 50 50 50 50 50 50 50

time 34.25 39.99 35.37 35.92 47.98 49.11 59.18

# item blocks 80 80 80 80 80 80 80

# user blocks 16 32 48 64 80 96 112

iterations 49 50 50 50 50 50 50

time 45.54 35.98 43.60 51.92 53.88 66.30 87.09

# item blocks 96 96 96 96 96 96 96

# user blocks 16 32 48 64 80 96 112

iterations 50 49 50 50 50 50 50

time 49.62 56.06 40.43 48.70 67.52 91.23 124.01

# item blocks 112 112 112 112 112 112 112

# user blocks 16 32 48 64 80 96 112

iterations 50 50 50 50 50 50 50

time 54.21 39.97 50.68 65.97 87.69 123.57 170.75

Table 6.3: The performance of FPSG on Netflix with different number of blocks. The target RMSE is 0.941. Time is in seconds.

# item blocks 32 32 32 32 32 32

# user blocks 32 64 96 128 160 192

iterations 50 50 50 50 50 50

time 188.13 180.21 184.57 183.51 186.66 187.02

# item blocks 64 64 64 64 64 64

# user blocks 32 64 96 128 160 192

iterations 50 50 50 50 50 50

time 172.74 182.90 218.67 196.95 189.23 211.93

# item blocks 96 96 96 96 96 96

# user blocks 32 64 96 128 160 192

iterations 50 50 50 50 50 50

time 188.57 179.78 205.17 203.49 227.86 254.71

# item blocks 128 128 128 128 128 128

# user blocks 32 64 96 128 160 192

iterations 50 50 49 50 50 50

time 192.20 188.69 218.00 242.98 379.63 565.25

# item blocks 160 160 160 160 160 160

# user blocks 32 64 96 128 160 192

iterations 50 50 50 50 50 50

time 195.24 201.39 254.64 374.44 606.47 842.62

# item blocks 192 192 192 192 192 192

# user blocks 32 64 96 128 160 192

iterations 50 50 50 50 50 50

time 196.41 217.05 246.15 563.86 842.81 1170.79

Table 6.4: The performance of FPSG on Yahoo!Music with different number of blocks.

The target RMSE is 22.40. Time is in seconds.

the number of iterations and time used to achieve a target RMSE value.1 We use 12 threads for the experiments.

On each data set, different numbers of blocks achieve the target RMSE in a similar number of iterations. Clearly, the number of blocks does not seem to affect the conver-gence. However, when many blocks are used, the running time dramatically increases.

Taking MovieLens as an example, FPSG takes only 3.25 seconds if 16 × 16 blocks are used, while 21.16 seconds are required if 64 × 64 blocks are used. To explain this result, let us check what happens when the number of blocks increases. First, the overhead of getting a job increases because the selection is from a pool of more blocks. Second, the execution time per block decreases as a block contains less ratings. Third, the scheduler is executed more frequently because the execution time per block decreases.

The overall impact is that the scheduling becomes cost-ineffective. That is, we spend innegligible time to select a block, but the block is quickly processed. Further, we explain that the CPU utilization may be lowered when too many blocks are used. In this situation, the scheduler takes more time to assign a block to a thread, but during this process, another thread that needs to get a block must wait.

The above discussion suggests that we should avoid splitting R to too many blocks.

However, whether the number of blocks is too many or not depends on the data set. The 64 × 64 setting is too many for MovieLens, but seems adequate for Netflix. Therefore, the selection of the number of blocks is not easy. From the experimental results, using (s + 1) × (s + 1) to 2s × 2s blocks may be a suitable choice.

CHAPTER VII

Conclusions and Future Works

To provide a more complete SG solver for recommender systems, we will extend our algorithm to solve variants of matrix-factorization problems. In addition, to further reduce the cache-miss rate, we will investigate non-uniform splits of the rating matrix or other permutation methods such as Cuthill-McKee ordering. Very recently a new parallel matrix factorization method NOMAD (Yun et al., 2014) has been released.

It uses an asynchronization setting to reduce the waiting time at any thread. This technique is related to our non-blocking scheduling. Another parallel solver for matrix factorization is in GraphChi (Kyrola et al., 2012), which is a framework for graph computation.1 Their method divides R into m blocks, where each block contains the ratings of a particular user, and these blocks are updated in parallel. An important difference between ours and theirs is that they do not require blocks being processed are mutually independent. Therefore, the over-writing problem discussed in Chapter 2.1 may occur. We plan to conduct comparisons between our method, NOMAD, and GraphChi.

In conclusion, we point out some computational bottlenecks in existing parallel SG methods for shared-memory systems. We propose FPSG to address these issues and

1In their thesis, they use alternative least square method (ALS) as the solver. However, an SG implementation has been included in their latest release available at https://github.com/GraphChi/

graphchi-cpp. We discuss their SG-based method in the context.

confirm its effectiveness by experiments. The comparison shows that FPSG is more efficient than state-of-the-art methods. Programs used for experiments in this thesis can be found at

http://www.csie.ntu.edu.tw/~cjlin/libmf/exps/

Further, based on this study, we develop an easy-to-use package LIBMF available at

http://www.csie.ntu.edu.tw/~cjlin/libmf for public use. Appendix A describes the formulation used in LIBMF.

Acknowledgement

This is a joint work with Wei-Sheng Chin and Yong Zhuang.

BIBLIOGRAPHY

R. M. Bell and Y. Koren. Lessons from the Netflix prize challenge. ACM SIGKDD Explorations Newsletter, 9(2):75–79, 2007.

K.-W. Chang, C.-J. Hsieh, and C.-J. Lin. Coordinate descent method for large-scale L2-loss linear SVM. Journal of Machine Learning Research, 9:1369–1398, 2008. URL http://www.csie.ntu.edu.tw/~cjlin/papers/cdl2.pdf.

G. Dror, N. Koenigstein, Y. Koren, and M. Weimer. The Yahoo! music dataset and KDD-Cup 11. In JMLR Workshop and Conference Proceedings: Proceedings of KDD Cup 2011, volume 18, pages 3–18, 2012.

R. Gemulla, E. Nijkamp, P. J. Haas, and Y. Sismanis. Large-scale matrix factor-ization with distributed stochastic gradient descent. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 69–77, 2011.

K. B. Hall, S. Gilpin, and G. Mann. MapReduce/Bigtable for distributed op-timization. In Neural Information Processing Systems Workshop on Leaning on Cores, Clusters, and Clouds, 2010.

C.-J. Hsieh and I. S. Dhillon. Fast coordinate descent methods with variable se-lection for non-negative matrix factorization. In Proceedings of the Seventeenth ACM SIGKDD International Conference on Knowledge Discovery and Data Min-ing, 2011.

J. Kiefer and J. Wolfowitz. Stochastic estimation of the maximum of a regression function. The Annals of Mathematical Statistics, 23(3):462–466, 1952.

Y. Koren, R. M. Bell, and C. Volinsky. Matrix factorization techniques for rec-ommender systems. Computer, 42(8):30–37, 2009.

A. Kyrola, G. Blelloch, and C. Guestrin. Graphchi: Large-scale graph computa-tion on just a pc. In Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’12), Hollywood, October 2012.

G. Mann, R. McDonald, M. Mohri, N. Silberman, and D. Walker. Efficient large-scale distributed training of conditional maximum entropy models. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems 22, pages 1231–1239. 2009.

R. McDonald, K. Hall, and G. Mann. Distributed training strategies for the struc-tured perceptron. In Proceedings of the 48th Annual Meeting of the Association of Computational Linguistics (ACL), pages 456–464, 2010.

F. Niu, B. Recht, C. R´e, and S. J. Wright. HOGWILD!: A lock-free ap-proach to parallelizing stochastic gradient descent. In J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K. Weinberger, editors, Advances in Neural Informa-tion Processing Systems 24, pages 693–701, 2011.

I. Pil´aszy, D. Zibriczky, and D. Tikk. Fast ALS-based matrix factorization for explicit and implicit feedback datasets. In Proceedings of the Fourth ACM Con-ference on Recommender Systems, pages 71–78, 2010.

H. Robbins and S. Monro. A stochastic approximation method. The Annals of Mathematical Statistics, 22(3):400–407, 1951.

H.-F. Yu, C.-J. Hsieh, S. Si, and I. S. Dhillon. Scalable coordinate descent ap-proaches to parallel matrix factorization for recommender systems. In Proceedings of the IEEE International Conference on Data Mining, pages 765–774, 2012.

H. Yun, H.-F. Yu, C.-J. Hsieh, S. Vishwanathan, and I. S. Dhillon. Nomad:

Non-locking, stochastic multi-machine algorithm for asynchronous and decentral-ized matrix completion. In International Conference on Very Large Data Bases (VLDB), 2014.

Y. Zhou, D. Wilkinson, R. Schreiber, and R. Pan. Large-scale parallel collabo-rative filtering for the Netflix prize. In Proceedings of the Fourth International Conference on Algorithmic Aspects in Information and Management, pages 337–

348, 2008.

Y. Zhuang, W.-S. Chin, Y.-C. Juan, and C.-J. Lin. A fast parallel SGD for matrix factorization in shared memory systems. In Proceedings of the ACM Rec-ommender Systems, 2013. URL http://www.csie.ntu.edu.tw/~cjlin/papers/

libmf.pdf.

M. Zinkevich, M. Weimer, A. Smola, and L. Li. Parallelized stochastic gradient de-scent. In J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems 23, pages 2595–2603.

2010.

APPENDICES

APPENDIX A. Formulation Used in LIBMF

In LIBMF, in addition to P and Q, we add user bias, item bias, and average terms, which are useful for recommender systems. The formulation is described in (A.1).

Table A.1 shows the dimension and meaning of symbols in the formulation.

P,Q,a,bmin X

(u,v)∈R

(ru,v − pTuqv− au − bv− avg)2

+ λP||pu||2+ λQ||qv||2+ λa||a||2+ λb||b||2

(A.1)

Symbol Dimension Meaning

m, n 1 × 1 number of users and items k 1 × 1 number of latent dimensions

u, v 1 × 1 index indicates uth user and vth item

R m × n rating matrix

ru,v 1 × 1 (u, v)th rating of R

P k × m latent matrix

Q k × n latent matrix

pu, qv k × 1 uth column of P and vth column of Q a m × 1 user bias vector

b n × 1 item bias vector

au, bv 1 × 1 uth element of a and vth element of b λP, λQ 1 × 1 penalty of regularized term of P and Q λa, λb 1 × 1 penalty of regularized term of a and b avg 1 × 1 average rating in training data

Table A.1: Symbols in (A.1).

相關文件