Speedup of FPSG - 在共享記憶體系統的快速平行隨機梯度下降法矩陣分解

V. Experiments

5.3 Speedup of FPSG

Speedup is an indicator on the effectiveness of a parallel algorithm. On a shared memory system, it refers to the time reduction from using one core to several cores.

In this chapter, we compare the speedup of FPSG with other methods. From Figure 5.5, FPSG outperforms DSGD and HogWild. This result is expected because FPSG aims at improving some shortcomings of these two methods.

Compared with CCD++, FPSG is better on two data sets, while CCD++ is bet-ter on the others. As Algorithm 3 and Algorithm 4 show, FPSG and CCD++ are

1 2 3 4 5 6 7 8 9 10 11 12 Figure 5.5: Speedup of different matrix factorization methods.

the other on some problems. Nevertheless, even though CCD++ gives better speedup in some occasions, its overall performance (running time and RMSE) is still worse than FPSG in Figure 5.3. Thus parallel SG remains a compelling method for matrix factorization.

CHAPTER VI

Discussion

We discuss some miscellaneous issues in this chapter. Chapter 6.1 demonstrates that taking the advantage of data locality can further improve the proposed FPSG method. In Chapter 6.2, the selection of the number of blocks is discussed.

6.1 Data Locality and the Update Order

In our partial random method, ratings in each block are ordered. We can consider a user-oriented or item-oriented ordering. Interestingly, these two ways may cause different costs on the data access. For example, in Figure 4.3, we consider a user-oriented setting, so under a given u

Ru,v and qv, ∀(u, v) ∈ R

must be accessed. While R_u,v is a scalar, q_v, ∀(u, v) ∈ R involve many columns of the dense matrix Q. Therefore, for going through all users, Q is needed many times.

Alternatively, if an item-oriented setting is used, for every item, P^T is needed. Now if m n, P ’s size (k × m) is much larger than Q (k × n). Under the user-oriented setting, it is possible that Q (or a significant portion of Q) can be stored in a higher

XXXX

Order Data set MovieLens Netflix Yahoo!Music Hugewiki

User 2.27 22.50 173.34 1531.14

Item 2.91 43.26 294.19 1016.19

Table 6.1: Execution time (in seconds) of 50 iterations of FPSG

# item blocks 16 16 16 16 32 32 32 32

# user blocks 16 32 48 64 16 32 48 64

iterations 50 48 49 48 48 49 49 49

time 3.25 3.37 3.66 3.94 3.28 3.86 4.74 6.98

# item blocks 48 48 48 48 64 64 64 64

# item blocks 16 32 48 64 16 32 48 64

iterations 49 49 49 49 49 49 48 49

time 3.43 4.83 8.15 13.30 3.67 6.94 12.87 21.16

Table 6.2: The performance of FPSG on MovieLens with different number of blocks.

The target RMSE is 0.858. Time is in seconds.

may have to be swapped out to lower-level memory several times. Thus the cost for data movements is higher. Based on this discussion, we conjecture that

m n ⇒ user-oriented access should be used, m n ⇒ item-oriented access should be used.

(6.1)

We compare the two update orders in Table 6.1. For Netflix and Yahoo!Music, the user-wise approach is much faster. From Table 5.1, these two data sets have m n. On the contrary, because n m, the item-oriented approach is much better for Hugewiki.

This experiment fully confirms our conclusion in (6.1).

6.2 Number of Blocks

Recall that in FPSG, R is separated to at least (s + 1) × (s + 1) blocks, where s is the number of threads. We conduct experiments to see how the number of blocks affects the performance of FPSG. The results on three data sets are listed in Table 6.2, Table 6.3, and Table 6.4. In these tables, “iterations” and “time” respectively mean

# item blocks 16 16 16 16 16 16 16