Memory Discontinuity - Problems in Parallel SG Methods for Matrix Factorization

III. Problems in Parallel SG Methods for Matrix Factorization

3.2 Memory Discontinuity

When a program accesses data in memory discontinuously, it suffers from a high cache-miss rate and performance degradation. Most SG solvers for matrix factorization including HogWild and DSGD randomly pick instances from R (or from a block of R) to be updated. We call this setting as the random method, which is illustrated in Figure 3.2. Though the random method generally enjoys good convergence, it suffers from

P^T Q

Figure 3.2: A random method to select rating instances for update.

the memory discontinuity seriously. The reason is that not only are rating instances randomly accessed, but also user/item identities become discontinuous.

The seriousness of the memory discontinuity varies in different methods. In Hog-Wild, each thread randomly picks instances among R independently, so it suffers from memory discontinuity in R, P , and Q. In contrast, for DSGD, though ratings in a block are randomly selected, as we will see in Chapter 4.2, we can easily change the update order to mitigate the memory discontinuity.

CHAPTER IV

Our Approaches

In this thesis, we propose two techniques, lock-free scheduling and partial random method, to respectively solve the locking problem mentioned in Chapter 3.1 and the memory discontinuity mentioned in Chapter 3.2. We name the new parallel SG method as fast parallel SG (FPSG). In Chapter 4.1, we discuss how FPSG flexibly assigns blocks to threads to avoid the locking problem. In Chapter 4.2, we observe that a comprehensive random selection may not be necessary, and show that randomization can be applied only among blocks instead of within blocks to maintain both the memory continuity and the fast convergence. In Chapter 4.3, we overview the complete design of FPSG. Finally, in Chapter 4.4, we introduce our implementation techniques to accelerate the computation.

4.1 Lock-Free Scheduling

We follow DSGD to grid R into blocks and design a scheduler to keep s threads busy in running a set of independent blocks. For a block b_i,j, if it is independent from all blocks being processed, then we call it as a free block. Otherwise, it is a non-free block. When a thread finishes processing a block, the scheduler assigns a new block that meets the following two criteria:

1. It is a free block.

2. Its number of past updates is the smallest among all free blocks.

The number of updates of a block indicates how many times it has been processed.

The second criterion is applied because we want to keep a similar number of updates for each block. If two or more blocks meet the above two criteria, then we randomly select one. Given s threads, we show that FPSG should grid R into at least (s + 1) × (s + 1) blocks. Take two threads as an example. Let T1 be a thread that is updating certain block and T2 be a thread that just finished updating a block and is getting a new job from the scheduler. If we grid R into 2 × 2 blocks shown in Figure 4.1a, then T2 has only one choice: the block it just processed. A similar situation happens when T1 gets its new job. Because T1 and T2 always process the same block, the remaining two blocks are never processed. In contrast, if we grid R into 3 × 3 blocks like Figure 4.1b, T2 has three choices b1,1, b1,2 and b2,1 when getting a new block.

As discussed above, because we can always assign a free block to a thread when it finishes updating the previous one, our scheduler does not suffer from the locking problem. However, for extremely unbalanced data sets, where most available ratings are in certain blocks, our scheduler is unable to keep the number of updates in all blocks balanced. In such a case blocks with many ratings are updated only very few times. A simple remedy is the random shuffling technique introduced in Chapter 3.1.

In our experience, after random shuffling, the number of ratings in the heaviest block is smaller than twice of the lightest block. We then experimentally check how serious the imbalance problem is after random shuffling. Here we define degree of imbalance (DoI) to check the number of updates in all blocks. Let UTM(t) and UTm(t) be the maximal and the minimal numbers of updates in all blocks, respectively, where t is the iteration index. (FPSG does not have the concept of iterations. Here we call every

T₁

Figure 4.1: An illustration of how the split of R to blocks affects the job scheduling.

T₁ is the thread that is updating block b_0,0. T₂ is the thread that is getting a new block from the scheduler. Blocks with “x” are dependent on block b_0,0, so they cannot be updated by T₂.

cycle of processing (s + 1)² blocks as an iteration.) DoI is defined as

DoI = UT_M(t) − UT_m(t)

t .

A small DoI indicates that the number of updates is similar across all blocks. In Figure 4.2, we show DoI for four different data sets. We can see that our scheduler reduces DoI to be close to zero in just a few iterations. For details of the data sets used in Figure 4.2, please refer to Chapter 5.1.

4.2 Partial Random Method

To achieve memory continuity, in contrast to the random method, we can consider an ordered method to sequentially select rating instances by user identities or item identities. Figure 4.3 gives an example of following the order of users. Then matrix P can be accessed continuously. Alternatively, if we follow the order of items, then the continuous access of Q can be achieved. For R, if the order of selecting rating instances is fixed, we can store R into memory with the same order to ensure its continuous access. Although the ordered method can access data in a more continuous manner, empirically we find that it is not stable. Figure 4.4 gives an example showing that under two slightly different learning rates for SG, the ordered method can be

0 200 400 600 800 1000

Figure 4.2: DoI on four data sets. R is grided into 13 × 13 blocks after being randomly shuffled and 12 threads are used.

either much faster or much slower than the random method.

The above experiment indicates that a random access of data/variables may be useful for the convergence. This property has been observed in related optimization techniques. For example, in coordinate descent methods to solve some optimization problems, Chang et al. (2008) show that a random rather than a sequential order to update variables significantly improves the convergence speed. To compromise be-tween data continuity and convergence speed, in FPSG, we propose a partial random method, which selects ratings in a block orderly but randomizes the selection of blocks.

Although our scheduling is close to deterministic by choosing blocks with the smallest numbers of accesses, the randomness can be enhanced by griding R into more blocks.

Then at any time point, some blocks have been processed by the same number of

P^T Q

Figure 4.3: Ordered method to select rating instances for update.

the partial random method works using three threads. Figure 4.6 extends the running time comparison in Figure 4.4 to include FPSG. We can see that FPSG enjoys both fast convergence and excellent RMSE. Some related methods have been investigated in (Gemulla et al., 2011), although they showed that the convergence on the ordered method in terms of training loss is worse than the random method. Their observation is opposite to our experimental results. A possible reason is that we consider RMSE on the testing set while they consider the training loss.

Some subtle implementation details must be noted. We discussed in Chapter 4.1 that FPSG applies random shuffling to avoid the unbalanced number of updates of each block. However, after applying the random shuffling and griding R in to blocks, the ratings in each block are not sorted by user (or item) identities. To apply the partial random method we must sort user identities before processing each block because an ordered method is applied within the block. We give an illustration in Figure 4.7.

In the beginning, we make the rating matrix more balanced by randomly shuffling all ratings; see the middle figure in Figure 4.7. However, user identities and item identities become not ordered, so we cannot achieve memory continuity by using the update strategy shown in Figure 4.3. Therefore, we must rearrange ratings in each block so that their row indices (i.e., user identities) are ordered; see the last figure in Figure 4.7.

0 1000 2000 3000 4000 5000

0 1000 2000 3000 4000 5000 20

Figure 4.4: A comparison between the random method and the ordered method using the Yahoo!Music data set. One thread is used.

Algorithm 4 The overall procedure of FPSG

1: randomly shuffle R

2: grid R into a set B with at least (s + 1) × (s + 1) blocks

3: sort each block by user (or item) identities

4: construct a scheduler

5: launch s working threads

6: wait until the total number of updates reaches a user-defined value

4.3 Overview of FPSG

Algorithm 4 gives the overall procedure of FPSG. Based on the discussion in Chap-ter 4.1, FPSG first randomly shuffles R to avoid data imbalance. Then it grids R into at least (s + 1) × (s + 1) blocks and applies the partial random method discussed in Chapter 4.2 by sorting each block by user (or item) identities. Finally it constructs a scheduler and launches s working threads. After the required number of iterations is reached, it notifies the scheduler to stop all working threads. The pseudo code of the scheduler and each working thread are shown in Algorithm 5 and Algorithm 6, respectively. Each working thread continuously gets a block from the scheduler by invoking get job, and the scheduler returns a block that meets criteria mentioned in Chapter 4.1. After a working thread gets a new block, it processes ratings in the block

Figure 4.5: An illustration of the partial random method. Each color indicates the block being processed by a thread. Within each block, the update sequence is ordered like that in Figure 4.3. If block (1,1) is finished first, three candidates independent of the two other running blocks (2,2) and (3,3) are (1,4), (4,1), and (4,4), which are indicated by red arrows. If these three candidates have been accessed by the same number of times, then one is randomly chosen. This example explains how we achieve the random order of blocks.

4.4 Implementation Issues

FPSG uses the standard thread class in C++ implemented by pthread to do the parallelization. For the data set Yahoo!Music of about 250M ratings, using a typical machine (details specified in Chapter 5.1), FPSG finishes processing all ratings once in 6 seconds and takes only about 8 minutes to converge to a reasonable RMSE. Here we describe some techniques employed in our implementation. First, empirically we find that using single-precision floating-point computation does not suffer from numerical error accumulation. For the data set Netflix, using single precision runs 1.1 times faster then using double precision. Second, modern CPU provides SSE instructions that can concurrently run floating-point multiplications and additions. We apply SSE instructions for vector inner products and additions. For Yahoo!Music data set, the speed up is 2.4 times. Figure 4.8¹ shows the speedup after these techniques are applied in two data sets.

1For experimental settings, see Chapter 5.1.

0 1000 2000 3000 4000 5000

0 1000 2000 3000 4000 5000 20

Figure 4.6: A comparison between the ordered method, the random method, and the partial random method on the set Yahoo!Music. One thread is used.

Rating Matrix After the random shuffle Sort row indices in each block

Figure 4.7: An illustration of the partial random method. After the random shuffle of data, some indices (in red color) are not ordered within each block. We make the row indices ordered (in blue color) by a sorting procedure.

0 50 100 150 200 250

0 500 1000 1500 2000

Figure 4.8: A comparison between two implementations of FPSG in Netflix and

Ya-Algorithm 5 Scheduler of FPSG

1: procedure get job

2: Initialize an empty list b and b.ut_min = ∞ . ut: number of updates

3: for all b in B do

4: if b is non-free then

5: continue

6: else

7: if b.ut == b.ut_min then

8: Add b into b

9: else if b.ut < b.utmin then

10: b.ut_min = b.ut

11: Make b empty and add b into b

12: end if

13: end if

14: end for

15: Randomly select an element denoted by bx from b

16: return b_x

17: end procedure

18: procedure put job(b)

19: b.ut = b.ut + 1

20: end procedure

Algorithm 6 Working thread of FPSG

1: while true do

2: get a block b from scheduler → get job()

3: process elements orderly in this block

4: scheduler → put job(b)

5: end while

CHAPTER V

Experiments

In this chapter, we provide the details about our experimental settings, and compare FPSG with other parallel matrix factorization algorithms mentioned in Chapter II.

5.1 Settings

Data Sets: Four data sets, MovieLens,¹ Netflix, Yahoo!Music, and Hugewiki,² are used for the experiments. For reproducibility, we consider the original training/test sets in our experiments if they are available (for MovieLens, we use Part B of the original data set generated by the official script). Because the test set of Yahoo!Music is not available, we consider the last four ratings of each user for testing, while the remaining ratings for training set. The data set Hugewiki is too large to fit in our machines, so we sample one quarter of the data randomly, and split them into training/test sets.

The statistics of each data set is in Table 5.1.

Platform: We use a server with two Intel Xeon E5-2620 2.0GHz processors and 64 GB memory. There are six cores in each processor.

Parameters: Table 5.1 lists the parameters used for each data set. The parameters k, λ_P, λ_Q may be chosen by a validation procedure although here we mainly borrow

Data Set MovieLens Netflix Yahoo!Music Hugewiki

m 71,567 2,649,429 1,000,990 39,706

n 65,133 17,770 624,961 25,034,863

#Training 9,301,274 99,072,112 252,800,275 761,429,411

#Test 698,780 1,408,395 4,003,960 100,000,000

k 40 40 100 100

λ_P 0.05 0.05 1 0.01

λ_Q 0.05 0.05 1 0.01

γ 0.003 0.002 0.0001 0.004

Table 5.1: The statistics and parameters for each data set. Note that the Hugewiki set used here contains only one quarter of the original set.

values from earlier works to obtain comparable results. For Netflix and Yahoo!Music, we use the parameters in (Yu et al., 2012); see values listed in Table 5.1. Although (Yu et al., 2012) have considered MovieLens, we use a different setting of λ_P = λ_Q = 0.05 for a better RMSE. For Hugewiki, we consider the same parameters as in (Yun et al., 2014). The initial values of P and Q are chosen randomly under a uniform distribution.

This setting is the same as that in (Yu et al., 2012). The learning rate is determined by an ad hoc parameter selection. Because we focus on the running speed rather than RMSE in this thesis, we do not apply an adaptive learning rate.

In our platform, 12 physical cores are available, so we use 12 threads in all exper-iments. For FPSG, even though Chapter IV shows that (s + 1) × (s + 1) blocks are already enough for s threads, we use more blocks to ensure the randomness of blocks that are simultaneously processed. For Netflix, Yahoo!Music and Hugewiki, R is grided into 32 × 32 blocks; for MovieLens, R is grided into 16 × 16 blocks because the number of non-zeros is smaller.

Evaluation: As most recommender systems do, the metric adopted as our evalu-ation is RMSE on the test set, which is disjoint with the training set; see Eq. (1.4).

In addition, the time in each figure refers to the training time.

Implementations: Among methods included for comparison, HogWild³and CCD++⁴ are publicly available. We reimplement HogWild under the same framework of our FPSG and DSGD implementations for a fairer comparison. In the official HogWild package, the formulation includes the average value of training ratings. After trying different settings, the program still fails to converge. Therefore, we present only results of our HogWild implementation in the experiments.

The publicly available CCD++ code uses double precision. Because ours uses single precision following the discussion in Chapter 4.4, for a fair comparison, we obtain a singles-precision version of CCD++ from its authors. Note that OpenMP⁵ is used in their implementation.

5.2 Comparison of Methods on Training Time versus RMSE

We first illustrate the effectiveness of our solutions for data imbalance and mem-ory discontinuity. Then, we compare parallel matrix factorization methods including DSGD, CCD++, HogWild and our FPSG.

5.2.1 The effectiveness of addressing the locking problem

In Chapter 3.1, we mentioned that updating several blocks in a batch may suffer from the locking problem if the data is unbalanced. To verify the effectiveness of FPSG, in Figure 5.1, we compare it with a modification where the scheduler processes a batch of independent blocks as DSGD (Algorithm 2) does. We call the modified algorithm as FPSG**. It can be clearly seen that FPSG runs much faster than FPSG** because it does not suffer from the locking problem.

3http://hazy.cs.wisc.edu/hazy/victor/

4http://www.cs.utexas.edu/~rofuyu/libpmf/

5http://openmp.org/

0 5 10 15 20 25 Figure 5.1: A comparison between FPSG** and FPSG.

5.2.2 The effectiveness of having better memory locality

We conduct experiments to investigate if the proposed partial random method can not only avoid memory discontinuity, but also keep good convergence. In Figure 5.2, we select rating instances in each block orderly (the partial random method) or randomly (the random method). Both methods converge to a similar RMSE, but the training time of the partial random method is obviously shorter than that of the random method.

5.2.3 Comparison with the state-of-the-art methods

Figure 5.3 presents the test RMSE and training time of various parallel matrix factorization methods. Among the three parallel SG methods, FPSG is faster than DSGD and HogWild. We believe that this result is because FPSG is designed to effec-tively address issues mentioned in Chapter III. However, we must note that for DSGD, it is also easy to incorporate similar techniques (e.g., the partial random method) to

0 50 100 150

0 1000 2000 3000 4000 5000 6000 22

0 1000 2000 3000 4000 5000

0.55

Figure 5.2: A comparison between the partial random method and the random method.

improve its performance.

As shown in Figure 5.3, CCD++ is the fastest in the beginning, but becomes slower than FPSG. Because the optimization problem of matrix factorization is non-convex and CCD++ is a more greedy setting than SG by accurately minimizing the objective function over certain variables at each step, we suspect that CCD++ may converge to some local minimum pre-maturely. On the contrary, SG-based methods may be able to escape from a local minimum because of the randomness. Furthermore, for the Hugewiki in Figure 5.3, CCD++ does not give a satisfactory RMSE. Note that in addition to the regularization parameter used in this experiment, (Yun et al., 2014) have applied larger parameters for Hugewiki. The resulting RMSE can be improved.

5.2.4 Comparison with CCD++ for Non-negative Matrix Factorization

0 20 40 60 80 100 120 140

0 500 1000 1500 2000 2500

0.92

0 500 1000 1500 2000 2500 3000 22

0 500 1000 1500 2000 2500 3000 0.55

Figure 5.3: A comparison among the state-of-the-art parallel matrix factorization methods.

other matrix factorization problems. We consider non-negative matrix factorization (NMF) that requires the non-negativity of P and Q.

minP,Q

(u,v)∈R

(r_u,v− p^T_uq_v)²+ λ_Pkp_uk²+ λ_Qkq_vk², (5.1)

subject to Piu≥ 0, Qiv ≥ 0, ∀i ∈ {1, · · · , k}.

It is straightforward to warp FPSG for solving (5.1) with a simple projection (Gemulla et al., 2011), and the corresponding update rules are

p_u ← max(0, p_u+ γ (e_u,vq_v− λ_Pp_u)) q_v ← max(0, q_v + γ (e_u,vp_u− λ_Qq_v)),

(5.2)

where max(·, ·) is an element-wise maximum operation.

For CCD++, a modification for NMF has been proposed in (Hsieh and Dhillon, 2011). Like (5.2), it projects negative values back to zero during the coordinate descent

0 2 4 6 8 10

Figure 5.4: A comparison between CCD++ and FPSG for non-negative matrix fac-torization.

method. Our experimental comparison on CCD++ and FPSG is presented in Figure 5.4. Similar to Figure 5.3, FPSG outperforms CCD++ on NMF.

5.3 Speedup of FPSG

Speedup is an indicator on the effectiveness of a parallel algorithm. On a shared memory system, it refers to the time reduction from using one core to several cores.

In this chapter, we compare the speedup of FPSG with other methods. From Figure 5.5, FPSG outperforms DSGD and HogWild. This result is expected because FPSG aims at improving some shortcomings of these two methods.

Compared with CCD++, FPSG is better on two data sets, while CCD++ is bet-ter on the others. As Algorithm 3 and Algorithm 4 show, FPSG and CCD++ are

1 2 3 4 5 6 7 8 9 10 11 12 Figure 5.5: Speedup of different matrix factorization methods.

the other on some problems. Nevertheless, even though CCD++ gives better speedup in some occasions, its overall performance (running time and RMSE) is still worse than FPSG in Figure 5.3. Thus parallel SG remains a compelling method for matrix factorization.

CHAPTER VI

Discussion

We discuss some miscellaneous issues in this chapter. Chapter 6.1 demonstrates

在文檔中在共享記憶體系統的快速平行隨機梯度下降法矩陣分解 (頁 23-0)