Implementation Issues - Our Approaches - 在共享記憶體系統的快速平行隨機梯度下降法矩陣分解

IV. Our Approaches

4.4 Implementation Issues

FPSG uses the standard thread class in C++ implemented by pthread to do the parallelization. For the data set Yahoo!Music of about 250M ratings, using a typical machine (details specified in Chapter 5.1), FPSG finishes processing all ratings once in 6 seconds and takes only about 8 minutes to converge to a reasonable RMSE. Here we describe some techniques employed in our implementation. First, empirically we find that using single-precision floating-point computation does not suffer from numerical error accumulation. For the data set Netflix, using single precision runs 1.1 times faster then using double precision. Second, modern CPU provides SSE instructions that can concurrently run floating-point multiplications and additions. We apply SSE instructions for vector inner products and additions. For Yahoo!Music data set, the speed up is 2.4 times. Figure 4.8¹ shows the speedup after these techniques are applied in two data sets.

1For experimental settings, see Chapter 5.1.

0 1000 2000 3000 4000 5000

0 1000 2000 3000 4000 5000 20

Figure 4.6: A comparison between the ordered method, the random method, and the partial random method on the set Yahoo!Music. One thread is used.

Rating Matrix After the random shuffle Sort row indices in each block

Figure 4.7: An illustration of the partial random method. After the random shuffle of data, some indices (in red color) are not ordered within each block. We make the row indices ordered (in blue color) by a sorting procedure.

0 50 100 150 200 250

0 500 1000 1500 2000

Figure 4.8: A comparison between two implementations of FPSG in Netflix and

Ya-Algorithm 5 Scheduler of FPSG

1: procedure get job

2: Initialize an empty list b and b.ut_min = ∞ . ut: number of updates

3: for all b in B do

4: if b is non-free then

5: continue

6: else

7: if b.ut == b.ut_min then

8: Add b into b

9: else if b.ut < b.utmin then

10: b.ut_min = b.ut

11: Make b empty and add b into b

12: end if

13: end if

14: end for

15: Randomly select an element denoted by bx from b

16: return b_x

17: end procedure

18: procedure put job(b)

19: b.ut = b.ut + 1

20: end procedure

Algorithm 6 Working thread of FPSG

1: while true do

2: get a block b from scheduler → get job()

3: process elements orderly in this block

4: scheduler → put job(b)

5: end while

CHAPTER V

Experiments

In this chapter, we provide the details about our experimental settings, and compare FPSG with other parallel matrix factorization algorithms mentioned in Chapter II.

5.1 Settings

Data Sets: Four data sets, MovieLens,¹ Netflix, Yahoo!Music, and Hugewiki,² are used for the experiments. For reproducibility, we consider the original training/test sets in our experiments if they are available (for MovieLens, we use Part B of the original data set generated by the official script). Because the test set of Yahoo!Music is not available, we consider the last four ratings of each user for testing, while the remaining ratings for training set. The data set Hugewiki is too large to fit in our machines, so we sample one quarter of the data randomly, and split them into training/test sets.

The statistics of each data set is in Table 5.1.

Platform: We use a server with two Intel Xeon E5-2620 2.0GHz processors and 64 GB memory. There are six cores in each processor.

Parameters: Table 5.1 lists the parameters used for each data set. The parameters k, λ_P, λ_Q may be chosen by a validation procedure although here we mainly borrow

Data Set MovieLens Netflix Yahoo!Music Hugewiki

m 71,567 2,649,429 1,000,990 39,706

n 65,133 17,770 624,961 25,034,863

#Training 9,301,274 99,072,112 252,800,275 761,429,411

#Test 698,780 1,408,395 4,003,960 100,000,000

k 40 40 100 100

λ_P 0.05 0.05 1 0.01

λ_Q 0.05 0.05 1 0.01

γ 0.003 0.002 0.0001 0.004

Table 5.1: The statistics and parameters for each data set. Note that the Hugewiki set used here contains only one quarter of the original set.

values from earlier works to obtain comparable results. For Netflix and Yahoo!Music, we use the parameters in (Yu et al., 2012); see values listed in Table 5.1. Although (Yu et al., 2012) have considered MovieLens, we use a different setting of λ_P = λ_Q = 0.05 for a better RMSE. For Hugewiki, we consider the same parameters as in (Yun et al., 2014). The initial values of P and Q are chosen randomly under a uniform distribution.

This setting is the same as that in (Yu et al., 2012). The learning rate is determined by an ad hoc parameter selection. Because we focus on the running speed rather than RMSE in this thesis, we do not apply an adaptive learning rate.

In our platform, 12 physical cores are available, so we use 12 threads in all exper-iments. For FPSG, even though Chapter IV shows that (s + 1) × (s + 1) blocks are already enough for s threads, we use more blocks to ensure the randomness of blocks that are simultaneously processed. For Netflix, Yahoo!Music and Hugewiki, R is grided into 32 × 32 blocks; for MovieLens, R is grided into 16 × 16 blocks because the number of non-zeros is smaller.

Evaluation: As most recommender systems do, the metric adopted as our evalu-ation is RMSE on the test set, which is disjoint with the training set; see Eq. (1.4).

In addition, the time in each figure refers to the training time.

Implementations: Among methods included for comparison, HogWild³and CCD++⁴ are publicly available. We reimplement HogWild under the same framework of our FPSG and DSGD implementations for a fairer comparison. In the official HogWild package, the formulation includes the average value of training ratings. After trying different settings, the program still fails to converge. Therefore, we present only results of our HogWild implementation in the experiments.

The publicly available CCD++ code uses double precision. Because ours uses single precision following the discussion in Chapter 4.4, for a fair comparison, we obtain a singles-precision version of CCD++ from its authors. Note that OpenMP⁵ is used in their implementation.

5.2 Comparison of Methods on Training Time versus RMSE

We first illustrate the effectiveness of our solutions for data imbalance and mem-ory discontinuity. Then, we compare parallel matrix factorization methods including DSGD, CCD++, HogWild and our FPSG.

5.2.1 The effectiveness of addressing the locking problem

In Chapter 3.1, we mentioned that updating several blocks in a batch may suffer from the locking problem if the data is unbalanced. To verify the effectiveness of FPSG, in Figure 5.1, we compare it with a modification where the scheduler processes a batch of independent blocks as DSGD (Algorithm 2) does. We call the modified algorithm as FPSG**. It can be clearly seen that FPSG runs much faster than FPSG** because it does not suffer from the locking problem.

3http://hazy.cs.wisc.edu/hazy/victor/

4http://www.cs.utexas.edu/~rofuyu/libpmf/

5http://openmp.org/

0 5 10 15 20 25 Figure 5.1: A comparison between FPSG** and FPSG.

5.2.2 The effectiveness of having better memory locality

We conduct experiments to investigate if the proposed partial random method can not only avoid memory discontinuity, but also keep good convergence. In Figure 5.2, we select rating instances in each block orderly (the partial random method) or randomly (the random method). Both methods converge to a similar RMSE, but the training time of the partial random method is obviously shorter than that of the random method.

5.2.3 Comparison with the state-of-the-art methods

Figure 5.3 presents the test RMSE and training time of various parallel matrix factorization methods. Among the three parallel SG methods, FPSG is faster than DSGD and HogWild. We believe that this result is because FPSG is designed to effec-tively address issues mentioned in Chapter III. However, we must note that for DSGD, it is also easy to incorporate similar techniques (e.g., the partial random method) to

0 50 100 150

0 1000 2000 3000 4000 5000 6000 22

0 1000 2000 3000 4000 5000

0.55

Figure 5.2: A comparison between the partial random method and the random method.

improve its performance.

As shown in Figure 5.3, CCD++ is the fastest in the beginning, but becomes slower than FPSG. Because the optimization problem of matrix factorization is non-convex and CCD++ is a more greedy setting than SG by accurately minimizing the objective function over certain variables at each step, we suspect that CCD++ may converge to some local minimum pre-maturely. On the contrary, SG-based methods may be able to escape from a local minimum because of the randomness. Furthermore, for the Hugewiki in Figure 5.3, CCD++ does not give a satisfactory RMSE. Note that in addition to the regularization parameter used in this experiment, (Yun et al., 2014) have applied larger parameters for Hugewiki. The resulting RMSE can be improved.

5.2.4 Comparison with CCD++ for Non-negative Matrix Factorization

0 20 40 60 80 100 120 140

0 500 1000 1500 2000 2500

0.92

0 500 1000 1500 2000 2500 3000 22

0 500 1000 1500 2000 2500 3000 0.55

Figure 5.3: A comparison among the state-of-the-art parallel matrix factorization methods.

other matrix factorization problems. We consider non-negative matrix factorization (NMF) that requires the non-negativity of P and Q.

minP,Q

(u,v)∈R

(r_u,v− p^T_uq_v)²+ λ_Pkp_uk²+ λ_Qkq_vk², (5.1)

subject to Piu≥ 0, Qiv ≥ 0, ∀i ∈ {1, · · · , k}.

It is straightforward to warp FPSG for solving (5.1) with a simple projection (Gemulla et al., 2011), and the corresponding update rules are

p_u ← max(0, p_u+ γ (e_u,vq_v− λ_Pp_u)) q_v ← max(0, q_v + γ (e_u,vp_u− λ_Qq_v)),

(5.2)

where max(·, ·) is an element-wise maximum operation.

For CCD++, a modification for NMF has been proposed in (Hsieh and Dhillon, 2011). Like (5.2), it projects negative values back to zero during the coordinate descent

0 2 4 6 8 10

Figure 5.4: A comparison between CCD++ and FPSG for non-negative matrix fac-torization.

method. Our experimental comparison on CCD++ and FPSG is presented in Figure 5.4. Similar to Figure 5.3, FPSG outperforms CCD++ on NMF.

5.3 Speedup of FPSG

Speedup is an indicator on the effectiveness of a parallel algorithm. On a shared memory system, it refers to the time reduction from using one core to several cores.

In this chapter, we compare the speedup of FPSG with other methods. From Figure 5.5, FPSG outperforms DSGD and HogWild. This result is expected because FPSG aims at improving some shortcomings of these two methods.

Compared with CCD++, FPSG is better on two data sets, while CCD++ is bet-ter on the others. As Algorithm 3 and Algorithm 4 show, FPSG and CCD++ are

1 2 3 4 5 6 7 8 9 10 11 12 Figure 5.5: Speedup of different matrix factorization methods.

the other on some problems. Nevertheless, even though CCD++ gives better speedup in some occasions, its overall performance (running time and RMSE) is still worse than FPSG in Figure 5.3. Thus parallel SG remains a compelling method for matrix factorization.

CHAPTER VI

Discussion

We discuss some miscellaneous issues in this chapter. Chapter 6.1 demonstrates that taking the advantage of data locality can further improve the proposed FPSG method. In Chapter 6.2, the selection of the number of blocks is discussed.

6.1 Data Locality and the Update Order

In our partial random method, ratings in each block are ordered. We can consider a user-oriented or item-oriented ordering. Interestingly, these two ways may cause different costs on the data access. For example, in Figure 4.3, we consider a user-oriented setting, so under a given u

Ru,v and qv, ∀(u, v) ∈ R

must be accessed. While R_u,v is a scalar, q_v, ∀(u, v) ∈ R involve many columns of the dense matrix Q. Therefore, for going through all users, Q is needed many times.

Alternatively, if an item-oriented setting is used, for every item, P^T is needed. Now if m n, P ’s size (k × m) is much larger than Q (k × n). Under the user-oriented setting, it is possible that Q (or a significant portion of Q) can be stored in a higher

XXXX

Order Data set MovieLens Netflix Yahoo!Music Hugewiki

User 2.27 22.50 173.34 1531.14

Item 2.91 43.26 294.19 1016.19

Table 6.1: Execution time (in seconds) of 50 iterations of FPSG

# item blocks 16 16 16 16 32 32 32 32

# user blocks 16 32 48 64 16 32 48 64

iterations 50 48 49 48 48 49 49 49

time 3.25 3.37 3.66 3.94 3.28 3.86 4.74 6.98

# item blocks 48 48 48 48 64 64 64 64

# item blocks 16 32 48 64 16 32 48 64

iterations 49 49 49 49 49 49 48 49

time 3.43 4.83 8.15 13.30 3.67 6.94 12.87 21.16

Table 6.2: The performance of FPSG on MovieLens with different number of blocks.

The target RMSE is 0.858. Time is in seconds.

may have to be swapped out to lower-level memory several times. Thus the cost for data movements is higher. Based on this discussion, we conjecture that

m n ⇒ user-oriented access should be used, m n ⇒ item-oriented access should be used.

(6.1)

We compare the two update orders in Table 6.1. For Netflix and Yahoo!Music, the user-wise approach is much faster. From Table 5.1, these two data sets have m n. On the contrary, because n m, the item-oriented approach is much better for Hugewiki.

This experiment fully confirms our conclusion in (6.1).

6.2 Number of Blocks

Recall that in FPSG, R is separated to at least (s + 1) × (s + 1) blocks, where s is the number of threads. We conduct experiments to see how the number of blocks affects the performance of FPSG. The results on three data sets are listed in Table 6.2, Table 6.3, and Table 6.4. In these tables, “iterations” and “time” respectively mean

# item blocks 16 16 16 16 16 16 16