Matching Users and Items Across Domains to Improve the Recommendation Quality

(1)

Matching Users and Items Across Domains to Improve the Recommendation Quality

Chung-Yi Li

Department of Computer Science and Information Engineering, National Taiwan University, Taiwan

r00922051@csie.ntu.edu.tw

Shou-De Lin

Department of Computer Science and Information Engineering, National Taiwan University, Taiwan

sdlin@csie.ntu.edu.tw ABSTRACT

Given two homogeneous rating matrices with some overlapped users/items whose mappings are unknown, this paper aims at an- swering two questions. First, can we identify the unknown mapping between the users and/or items? Second, can we further utilize the identified mappings to improve the quality of recommendation in either domain? Our solution integrates a latent space matching procedure and a refining process based on the optimization of prediction to identify the matching. Then, we further design a transfer-based method to improve the recommendation performance. Using both synthetic and real data, we have done extensive experiments given different real life scenarios to verify the effectiveness of our models. The code and other materials are available athttp://www.csie.ntu.edu.tw/~r00922051/matching/

Categories and Subject Descriptors

I.2.6 [ARTIFICIAL INTELLIGENCE]: Learning—Parameter learning

Keywords

Collaborative Filtering; Transfer Learning; Matrix Factorization

1. INTRODUCTION

Nowadays, recommendation systems are widely deployed for not only e-business services but also brick and mortar stores. In general there are two possible strategies in designing a recommendation system: content-based [10] and collaborative filtering (CF) based [5] approaches. The former relies on the content and profile information of users and items to perform recommendation, while the latter relies mainly on the ratings of items given by users to identify similar users and items for recommendation. This paper mainly focuses on the analysis of CF-based models and assumes only rating data are available.

Collecting sufficient amount of rating data has been recognized as a critical factor in designing a successful CF-based recommendation system. As CF-based recommendation systems become in- creasingly popular, it is not hard to imagine more and more rating

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full cita- tion on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re- publish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org.

KDD’14,August 24–27, 2014, New York, NY, USA.

http://dx.doi.org/10.1145/2623330.2623657.

data would become available to models for different businesses.

One might wonder what kind of useful knowledge one can extract from several different rating sets of completely independent services.

For example, a newly established online song requesting service, which possesses some internal ratings about its users’ preferences to songs, can take advantage of the publicly available rating data from other similar services (e.g. Yahoo! music service). The reason is that we know there exists some overlapping between users and/or items in the two services, only the exact mappings are unknown. We call the newly established service “the target domain”

and the other service “the source domain”. One would then ap- preciate a model that, based only on the two sets of rating data, is able to identify not only the user mappings but the item mappings between the two domains. Such connection can then be used to enrich the understanding of the users and items in the target domain. One would be even happier if the model can exploit the likely

“marginally correct” correspondence it found previously to boost the performance of a CF system built in the target domain. In our experiments, we provide 8 effective scenarios (e.g. whether user sets are completely overlapped, partially overlapped, or completely independent; or whether the ratings are disjoint across matrices or not) to verify the usefulness of such model.

Formally, we assume there are two homogeneous rating matrices: a target matrix R1and an auxiliary data matrix R2. The rows and columns represent users and items, and each entry in the matrix represents the rating of an item given by a user. In these matrices, the two user sets and two item sets are overlapped to some extent, but we do not know how to associate the users and the items between the matrices. We want to answer the following two questions:

1. Given only R1and R2, is it possible to find out the mapping of users and the mapping of items between them?

2. Given the noisy mapping we obtained, is it possible to transfer information from R2to improve the rating prediction in R1?

The underlying assumption we made is that the rating behavior of both domains are similar; we call such rating data the “homogeneous”. The problems sound hard because the user and item correspondences are both unknown, which prevents us from exploiting any supervised learning algorithm. If one side is already matched (e.g. if the items are matched), then one can use approaches like nearest neighbor to match the other side (e.g. users), as discussed in [13]. Unfortunately, such method cannot be applied when the columns and rows of the rating matrices both have gone through an unknown permutation process.

(2)

Fortunately, the matching problem is not impossible to solve.

Recall the underlying assumption of collaborative filtering: similar users shall rate items similarly and similar items shall receive similar ratings from users. Viewing collaborative filtering as a matrix completion process, the above assumption actually suggests the rows/columns of the rating matrix are indeed dependent. That is, the matrix is of low-rank. The low-rank assumption has been vali- dated by experiments. Owing to this assumption, researchers have proposed a family of strategies based on matrix factorization (MF), which has proven to be one of the most successful solutions for collaborative filtering [6, 1].

Extending from the concept of CF, our key hypothesis is that if two users rated some items similarly in one domain, they shall rate similar items similarly in the other domain. The low-rank assumption on both matrices plays a key role to the solution we propose in this paper. Our idea is to find a way to transform the incomplete matrices R1and R2into low-rank approximation, and match users/items in the latent space. One plausible solution is to perform matrix factorization on R1and R2then try to match users/items in the latent space. Figure 1 shows an illustrative example.

Figure 1: User and Item Matching on Low-Rank Matrices At the first glance, it seems non-trivial to find the relationship between R1 and R2. Assuming that both matrices are of low- rank, they can be factorized into a user-latent matrix and item-latent matrix. Then it might be possible to match users based on the two user-latent matrices, as shown in the first decomposition in Figure 1. Unfortunately, matrix factorization is not unique, so R1 and R₂can be factorized using different latent space, prohibiting any further matching of users or items (see the second decomposition in Figure 1).

Another plausible idea is to exploit the idea of singular value decomposition (SVD) on the rating matrices R1 and R2, and then match users and items based on the singular vectors. This paper proposes a trick that allows us to obtain singular value decomposi-

tion on an incomplete matrix. However, the solution of SVD is still not unique, i.e. sign difference may exist, and the decomposition is unstable when noise occurs. We then propose a greedy-based algorithm that solves the sign problem and searches the nearest neigh- bors in the latent space. Then, we propose to adjust the matching results through a process that minimizes the prediction error on the existing ratings of R1 using R2, making the matching results more resistant to noise. Experiments show that our method can ac- curately find the correspondence given clean low-rank data, while achieving much higher matching accuracy than other competitors if given real, noisy data. Finally, based on the discovered matching outcome, we propose a transfer learning approach that transfers the rating information from R2to R1to boost the performance of the recommendation system in R1. We conduct extensive experiments including a variety of scenarios to verify the effectiveness of our model.

2. RELATED WORK 2.1 User Identification

User identification has been studied under different settings by different research communities. In natural language processing, author identification [2] is formulated as a text categorization problem, in which a supervised learning model is built based on features extracted from the authors’ manuscripts. In social network analysis, Narayanan and Shmatikov [14] propose to link two social networks and de-anonymize users by first identifying some seed nodes and then iteratively propagating the mapping.

Some approaches have recently been proposed to link users from different social media sites. Zafarani and Liu propose MOBIUS for cross-media user identification [20]. Based on features extracted from user names, they learn a classification model to determine whether an account belongs to a certain person. Liu et al. [9] as- sociates users with an unsupervised approach by calculating the rareness of the names. When a rare name (such as pennystar881) occurs on different websites, it is very likely that the two accounts are owned by the same person. Yuan et al. [19] propose another unsupervised approach to link users. They find that some users may explicitly display their other accounts on the profile page or disclose the account links when they share a post across websites.

They design an algorithm to automatically capture all such information to link users.

Most of the studies differ from us because our model relies on rating data, and we do not assume there are any labeled pairs available for training. The closest work to us is [13], in which Narayanan and Shmatikov study how to de-anonymize users from rating data.

In their setup the item mapping is known, which makes the task easier than the one we are trying to solve. Moreover, our model embeds itself a second objective: to improve the rating prediction in the target matrix.

2.2 Transfer Learning in Collaborative Filter- ing Given Correspondence

Many models have been proposed for transfer learning in collaborative filtering [17, 11, 18, 21, 15, 16, 4, 3]. The common goal is to transfer information across several data matrices. Most of the models assume that there is a one-to-one correspondence between users or between items across domains.

For example, in collective matrix factorization [17], there is a rating matrix (users by movies) and a label matrix (genres by movies) for transferring. The shared movie side then becomes the bridge for transferring, as they assume the latent factors of the movies are similar. Similar ideas can be applied on matrices of different time

(3)

frames [18], on binary versus numerical matrices [15, 16], or on a rating matrix and a social relation matrix [11, 4].

To sum up, the above models transfer information across two or more heterogeneous data matrices based on restrictions that the latent factors of the shared user side or item side are similar. Our setting is quite the opposite: the data matrices are homogeneous while both the user and item correspondence are unknown.

2.3 Transfer Learning in Collaborative Filter- ing When Correspondence is Unknown

We only find two models that transfer information between rating matrices without exploiting user and item correspondences and will discuss them in details here.

Codebook transfer (CBT) [7] and rating-matrix generative model (RMGM) [8] transfer information between two rating matrices without assuming user correspondence or item correspondence. The basic assumption of the two models are of the form:

R1≈ U1BV^T₁, R2≈ U2BV^T₂,

where B is a shared K by K matrix. The major difference between these models and our model is that these models do not assume user correspondence or item correspondence between R1and R2. Instead, they assume that R1 and R2 have homogeneous rating pattern; they enforce constraints on U’s and V’s while assuming the homogeneous pattern matrix, B, is shared across R1and R2. However, the constraints make the optimization process compli- cated and limit the expressive power of the model.

In CBT, U and V are constrained to be0-1 matrices, and there can only be one entry with value1 in each row. After adding the constraints, R≈ UBV^Tcan be viewed as a co-clustering process that simultaneously divide users and items into groups. Assume a user is in i-th user group and an item is in j-th item group, then the rating of such user and item (rij in R) is predicted to be the value of corresponding group rating value in B (bij). Thus, the formulation of CBT is actually associating groups of users (items) in R1with groups of users (items) in R2.

However, the hard clustering constraint greatly reduces the expressive power of CBT; many lower-rank matrices cannot be fac- tored under such constraint. Furthermore, to solve CBT, the optimization process requires the auxiliary matrix R2 to be fully observed, or it has to fill in missing entries with data mean before factorization. Since there are much more missing entries than observed ones in a sparse rating matrix, filling in the missing values manually can seriously bias the model.

RMGM relaxes the constraints in CBT from hard clustering to soft clustering by using a probabilistic graphical model, and it does not require R2to be fully observed. To be more specific, the joint probability defined in RMGM is

P(r, m⁽ⁱ⁾,n⁽ⁱ⁾, Cm, Cn)

= P (m⁽ⁱ⁾|Cm)P (n⁽ⁱ⁾|Cn)P (r|Cm, C_n)P (Cm)P (Cn),

and the rating prediction is

X

r

r X

k1,k2

P(r|Cm= k1, Cn= k2)P (Cm= k1|m⁽ⁱ⁾)P (Cn= k2|n⁽ⁱ⁾),

where m⁽ⁱ⁾and n⁽ⁱ⁾are user and item in i-th domain, and Cmand Cnare the cluster for the user and item, respectively. If we rewrite the rating prediction as

X

k1,k2

P(Cm= k1|m⁽ⁱ⁾)P (Cn= k2|n⁽ⁱ⁾)X r

rP(r|Cm= k1, Cn= k2)

! ,

the termP

rrP(r|Cm, Cn) is similar to B in CBT. P (Cm|m⁽ⁱ⁾) and P(Cn|n⁽ⁱ⁾) are similar to Uiand Vi, respectively. When the rating prediction is perfect, R= UiBV^T_i.

Even though RMGM relaxes the hard clustering constraints in CBT, constraints still exist: the elements in each row of U and V must be nonnegative and sum up to1. This again limits the expressive power of the model and complicate the optimization task.

Besides, the objective function of RMGM is the self-defined likelihood. It deviates from common evaluation measures such as root mean square error. Consider the following example:

R=



 1 1 3 3 2 2



 , U= P (Cm|m⁽ⁱ⁾) =





1 0

0 1

0.5 0.5



 , V

T= P (Cn|n⁽ⁱ⁾)^T=1 1 0 0

,

B=P

rrP(r|Cm, Cn) = 1 ×1 1 0 0

+ 2 ×0 0 0 0

+ 3 ×0 0 1 1

=1 1 3 3

.

When we evaluate this model by root mean square error, the error is0 because R is exactly the product of the three matrices. How- ever, the data likelihood defined by RMGM is also0 because the term P(r = 2|Cm, Cn) is always 0. Thus, RMGM will try to find other parameters to fit the data, but they are not necessarily minimizing the prediction error. Nevertheless, RMGM is considered as the state-of-the-art and will be our main competitor in the evaluation.

In conclusion, past models do not try to solve correspondence and impose extra constraints in the factorization equation. Our model directly solves the correspondence problem and leverages such information to transfer across matrices.

3. METHODOLOGY

Given two partially observed rating matrices R1 and R2, we made the following assumptions:

1. There are some common users and common items across R1

and R2, but we do not know the correspondence.

2. The two matrices represent the same homogeneous domain.

That is, if we can link the common users/items (rows/columns) between R1and R2, we can merge them into one single matrix (Figure 2), and such matrix is likely to be low-rank which allows the collaborative filtering models to perform recommendation.

To find out the correspondence between R1and R2, intuitively we want to solve

R1 ≈ GuserR2G^Titem,

where Guserand Gitemare 0-1 matrices that represent user and item correspondences. The symbol≈ implies the corresponding entries in R1and GuserR2G^T_itemshall be as similar as possible. The appar- ent challenge is that the rating matrices are partially observed, and very few entries are observed in both R1and R2. Thus, we want to modify the above equation to deal with this problem.

We want to exploit the low-rank property of the rating matrices to solve the problem. To do so, we replace the original rating matrix by its fully filled low-rank approximation (i.e. no missing values).

We propose the following two objective functions:

1. ˆR1≈ GuserRˆ2G^T_item 2. R1≈ GuserRˆ₂G^T_item

We want to find G’s that satisfies the above criteria. The symbol Rˆ stands for the low-rank approximation of R. In Equation 1, we hope to map the low-rank approximation R2 to the low-rank approximation of R1. In Equation 2, we hope the observed entries

(4)

(5)

(6)

(7)

The extra regularization terms associated with β restrict the latent factors of matched users/items to be the same. Therefore, if the matching is correct, since the corresponding users or items are forced to align in their latent space, the extra information from the other domain can then be exploited.

Acknowledging the fact that our matching is not perfect, special care needs to be taken to prevent incorrect matching from hamper- ing the prediction performance. Thus, we add anarctan function to alleviate the influence of outliers. When the matching results become unreliable, our model degenerates into regular matrix factorization model as β would become0 after parameter selection.

The objective is still differentiable and can be solved via standard approaches such as gradient descent.

4. EXPERIMENTS 4.1 Experimental Setup

Table 1: Statistics of R

Dataset # of Users and Items Rating Scale Sparsity Low Rank (20000, 10000) real, [-1,1] 5%

Yahoo! Music (20000, 10000) integer, 0-100 5.4%

Our goal is to link the users and items in two homogeneous rating matrices, R1 and R2, and then use the corresponding information in R2 to improve rating prediction in R1. However, we cannot find two independent datasets that provide the ground truth for user/item mappings to evaluate our model. Fortunately, we can use a real dataset and split it into two rating sets as R1 and R2. Another advantage of using such strategy for evaluation is that we can then test different scenarios (i.e. different splitting condition) to evaluate the usefulness of our model under a variety of different assumptions. The user and item ids of R2are randomly permuted, while the permutation (unknown to our model) becomes the ground truth of the correspondence.

We conduct experiments on a synthetic dataset and a real dataset, Yahoo! music dataset. The synthetic dataset is a noise-free low- rank matrix for verifying the soundness of different models. It is a rank-50 matrix, generated by the following MATLAB command

randn(20000,50)*diag(1.1.^[1:50])*randn(50,10000).

We sample 5% of it as R and linearly scale the minimum and max- imum values to−1 and 1. Yahoo! music dataset has been used as the benchmark data in KDD Cup 2011 [1]. For both datasets, we take out a subset R and split it into R1and R2. The statistics of R are listed in Table 1.

With the capability to control the mapping condition, now we are ready to test the effectiveness of our model under different assumptions. Below lists the conditions (see Figure 7 for details):

1. Disjoint Split. We assume a user will not rate the same item twice, so the same rating will not appear in both R1 and R2. Real world scenario is in ratings for durable goods. For example, assume R1 and R2 are two nearby retailers that sells laptops, and thus have similar customers and products.

When a user buys a laptop in R1, he or she is unlikely to buy the same laptop again in the other store.

2. Overlap Split. We assume ratings given in R1may or may not appear in R2. This is a very common situation. For example, given two supermarkets in the same area, a customer can buy identical or different products in both stores.

3. Contained Split. We assume R2is more frequently visited and all rating information in R1is also available in R2. This is an extreme case of overlap split when the overlap ratio is maximized, and it is presumably the easiest situation for linking users and items between R1and R2.

For the three splits, we vary the overlap ratio for ratings, but we assume that every user and item in R1can be found in R2and vice versa. This assumption may not be true in some cases so we created two additional scenarios.

4. Subset Split. We assume the users and items in R1are sub- sets of that in R2. This is a common situation when R1and R2 are two stores selling same type of goods in the same area, where R2is a much larger and potentially cheaper store that has more products for sale.

5. Partial Split. We assume the user/item set in R1 are partially overlapped with that in R2. This is also a very common situation that two stores have overlapping but not identical customers and products for sell. We want to first discover the mapping between the overlapped users/items, and then use such information to improve the rating prediction.

For subset split and partial split, we simply assume that their overlap ratio for ratings is similar to that of the overlap split.

Because R1is the target domain in which we want to evaluate whether rating prediction accuracy can be boosted by knowledge transfer, we divide its data into training, validation and testing sets.

R2, on the other hand, belongs to the source domain, thus we only need to divide it into training and validation set. Validation sets are used for parameter tuning in factorization models. Each of the two validation sets takes up 2.5% of the original data R, while the testing set takes up 5%. Each of the validation and testing sets does not overlap with any other set so as to ensure the sanity of our experiment. The training sets of R1and R2 are sampled from R according to the scenarios described above (Figure 7).

4.2 Competing Models

The first experiment is to evaluate the matching quality. To our knowledge, there has not yet been solutions proposed to utilize only incomplete rating information to perform user/item alignment, so we compare our method with other baseline models.

We define the following three baseline models. Here we only de- scribe the user-matching procedure as the item-matching procedure is identical.

1. Matching based on Regular Matrix Factorization. We conduct the regular matrix factorization and then perform nearest neighbor matching based on Euclidean distances between the obtained latent factors.

2. Matching based on User Mean/Item Mean. Find nearest users for matching based on the mean of their ratings. In other words, users with similar rating averages are matched.

3. Matching based on Rating Lists. We first sort the ratings of each user from large to small into a list. For any two users, we align the two rating lists and use the Euclidean distance of these two vectors as the matching criteria. In other words, two users are paired if their sorted rating lists are similar.

As mentioned in the introduction section, latent factor matching from simple matrix factorization does not work. The solution of matrix factorization is not unique since

PQ^T= PAA⁻¹Q^T,

(8)

Figure 7: Illustration of Splits

Disjoint Split Overlap Split Contained Split Subset Split training set of R₁

training set of R₂

Partial Split

users

items

# of Ratings:

40% of R 50% of R 20% of R

# of Ratings:

40% of R 50% of R 0% of R

# of Users/Items:

40% of R 100% of R 40% of R

# of Ratings:

40% of R 50% of R 40% of R

# of Users/Items:

40% of R 90% of R 30% of R

Even when R1 = R2, we can still obtain very different P1 and P₂. In fact, experiment results show that the first two baselines perform no better than random guess, so we omit them from the result table to save space. We found the third baseline that works much better than the first two, and therefore its results are included in our table.

Second, we conduct an experiment to evaluate the quality boost of rating prediction in R1 after transferring information from R2. There has not yet been much work on information transfer without knowing the underlying correspondence. Hence, for rating prediction we use rating-matrix generative model (RMGM) as the state- of-the-art for comparison.

4.3 Implementation Details

For our factorization models, we use gradient descent and back- tracking line search to solve the objective function. The dimension K is fixed to50, and the parameters λ and β are automatically selected by observing the error rate of the validation set. After selection, λ is determined to be0 for noise-free low-rank dataset and 5 for Yahoo! music dataset, and β ranges from 0 to 400. Besides, we employ the data scaling and early stopping procedure. We scale the ratings in the training set to zero mean and unit variance, and scale the values back in the prediction phase; the training process stops when validation error starts to increase.

We implement two versions of rating-matrix generative model (RMGM) [8] for comparison. They can be solved by standard expectation-maximization algorithm. ² The original RMGM uses categorical distribution for P(r|Cm, Cn). However, categorical distribution does not reflect the ordinal relation among the rating values. Therefore, we implement another version of RMGM that uses Gaussian distribution. For both models, the original and Gaus- sian RMGMs, we set the latent dimension K to50 (we find that a larger K leads to similar performance), and we conduct the same early stopping procedure. For Gaussian RMGM, we find that the variance of Gaussian is better set to a constant parameter that can be tuned by observing the error rate of the validation set. We have also conducted the data scaling process for Gaussian RMGM.

4.4 Results of User and Item Matching

In the matching process, the matching can output either the most likely candidate or a ranked list of candidates for each user/item in R1. Therefore, we can use accuracy as well as mean average precision (MAP) as the evaluation criteria. However, in the partial split, some users and items in R1do not appear in R2. Thus, when

2An example code is provided by the author of RMGM: https:

//sites.google.com/site/libin82cn/

Table 2: Matching Result on the Low-Rank Dataset

Rating Latent

Refinement

List Space

Disjoint Split

Accuracy(user) 0.000 1.000 1.000

MAP(user) 0.003 1.000 1.000

Accuracy(item) 0.001 1.000 1.000

MAP(item) 0.007 1.000 1.000

Overlap Split

Accuracy(user) 0.025 1.000 1.000

MAP(user) 0.048 1.000 1.000

Accuracy(item) 0.051 1.000 1.000

MAP(item) 0.089 1.000 1.000

Contained Split

Accuracy(user) 0.538 1.000 1.000

MAP(user) 0.612 1.000 1.000

Accuracy(item) 0.703 1.000 1.000

MAP(item) 0.765 1.000 1.000

Subset Split

Accuracy(user) 0.013 1.000 1.000

MAP(user) 0.029 1.000 1.000

Accuracy(item) 0.029 1.000 1.000

MAP(item) 0.058 1.000 1.000

Partial Split

Accuracy(user) 0.007 1.000 1.000

MAP(user) 0.019 1.000 1.000

Accuracy(item) 0.018 1.000 1.000

MAP(item) 0.042 1.000 1.000

evaluating the matching result of such scenario, we remove these users and items from consideration.

The matching results are shown in Table 2 and Table 3. The results on the noise-free low-rank dataset demonstrate the soundness of our approach. When there is no noise in the rating matrix, our latent space matching can precisely match all users and items by comparing the singular vectors regardless of the splits.

On the other hand, the performance of baseline model (rating list) improves when more ratings are shared between domains , but the results are still far from perfect.

On Yahoo! music dataset, we see a similar trend: when more ratings are shared, the results are generally better. We observe that the contained split is easier to match than overlap split, while both are easier to match than disjoint split. The matching accuracy for the subset split and partial split are slightly worse (though still com- parable) than that of the overlap split. It is because although the

(9)

Table 3: Matching Result on Yahoo! Music

Rating Latent

Refinement

List Space

Disjoint Split

Accuracy(user) 0.074 0.310 0.633

MAP(user) 0.126 0.419 0.717

Accuracy(item) 0.019 0.204 0.325

MAP(item) 0.047 0.325 0.463

Overlap Split

Accuracy(user) 0.150 0.547 0.960

MAP(user) 0.226 0.652 0.973

Accuracy(item) 0.056 0.442 0.786

MAP(item) 0.107 0.578 0.859

Contained Split

Accuracy(user) 0.419 0.851 0.997

MAP(user) 0.526 0.905 0.998

Accuracy(item) 0.323 0.815 0.975

MAP(item) 0.423 0.886 0.986

Subset Split

Accuracy(user) 0.122 0.392 0.918

MAP(user) 0.190 0.510 0.941

Accuracy(item) 0.047 0.297 0.686

MAP(item) 0.090 0.433 0.780

Partial Split

Accuracy(user) 0.109 0.350 0.871

MAP(user) 0.175 0.470 0.906

Accuracy(item) 0.043 0.272 0.573

MAP(item) 0.081 0.402 0.684

ratio of shared ratings is similar among these splits, some users and items can never be aligned in the subset split and partial split.

For all cases, the latent space matching we proposed already en- joys a significant boost comparing to the best baseline model, while the matching refinement through optimization further leads to great improvement over latent space matching. It is reasonable because there are some drawbacks of latent space matching, which have been discussed in Section 3.1.4.

4.5 Results for Rating Prediction by Transfer- ring

Table 4: Rating Prediction (RMSE) on Yahoo! Music

RMGM Gaussian

MF Proposed

RMGM Approach

Disjoint Split 27.47 26.59 24.24 23.34† Overlap Split 27.54 26.48 24.23 23.49† Contained Split 27.58 26.53 24.29 23.92† Subset Split 27.65 26.56 24.13 23.40† Partial Split 27.57 26.51 24.21 23.77† Those significantly better then the second best results are marked with †

Next we want to evaluate whether our matching can indeed be used to improve the quality of a recommendation system in the target domain. We resort to the experiment on rating prediction and use root mean square error (RMSE) as the evaluation criterion. The hypothesis to be verified with this experiment is that despite the ex- istence of user/item mis-matching, our model can still improve the prediction accuracy of the target domain after information transfer.

Here we focus on the Yahoo! music dataset because in the noise- free low-rank dataset the RMSE before transferring is already close to0.

The results are shown in Table 4. Gaussian RMGM outperforms the original RMGM, likely due to the fact that the actual rating itself does not follow categorical distribution. However, RMGM does not perform well in this experiment. One of the reason is that the likelihood function optimized by RMGM does not directly reflect the objective for evaluation (discussed in Section 2.3), while in our model minimizing the prediction error (i.e. RMSE in this experiment) is one of the direct objectives.

For all cases, our proposed approach leads to significant improvement over the original matrix factorization (MF) model, which is widely considered as one of the most effective single domain model. Interestingly, although the results in Table 3 show that the matching accuracy can be ranked as disjoint < overlap < contained split, results in Table 4 show that in terms of rating prediction, the improvement for the disjoint split is the best, while the improvement for the contained split is the worst. We believe this is because when there is less overlap of ratings, the amount of information R2

provides to R1increases, and our solution is capable of leveraging the extra information to provide better prediction outcome.

4.6 When Users/Items Do Not Overlap

Figure 8: Illustration of Three New Splits

training set of R₁ training set of R₂ Item Disjoint

Split User Disjoint

Split

# of Users:

40% of R 50% of R

# of Items:

40% of R 50% of R

User-Item Disjoint Split

# of Users/Items:

40% of R 50% of R users

items

Finally, we want to test what happens when the original assumption of our model is violated: that is, what happens if the user sets or the item sets do not overlap at all. It implies the matching is always wrong and our matching algorithm can at best identify some

“similar” users/items. Thus, we create three other splits and they are illustrated in Figure 8.

1. User Disjoint Split. The user sets of R1and R2are disjoint and the item sets are the same though the mapping is still unknown. It is also a common scenario in the real world. For example, there are two DVD-rental stores located in different states. Although they have similar items to be rented, the customer bases are completely disjoint.

2. Item Disjoint Split. The user sets are identical (without knowing the mapping), while the item sets are disjoint. For example, two near-by restaurants, one of which serves east- ern food and the other serves western food, may share similar set of customers.

3. User-Item Disjoint Split. The user sets and item sets are both non-overlapped. It is conceivably the most challenging task of all.

The results are shown in Table 5 and Table 6. For the disjoint dimension, our algorithm can only identify the “similar” entities.

(thus we do not report their matching accuracy.) Despite this, it

(10)

is very promising to see that the rating prediction can still be im- proved, most likely due to the fact that on the other overlapped dimension the correspondence can be identified. If both sides are disjoint, our model cannot yield any boost, as β becomes0 after parameter tuning and our model degenerates into regular matrix factorization model.

In short, when original assumption is violated, our model may still lead to some improvement and it is at least no worse than regular matrix factorization model. Such discovery delivers an en- couraging and important message for practical usage because the aforementioned scenarios are all very common in the real world.

Table 5: Matching Result on Yahoo! Music Rating Latent

Refinement

List Space

User Disjoint Split

Accuracy(item) 0.017 0.148 0.247

MAP(item) 0.044 0.257 0.370

Item Disjoint Split

Accuracy(user) 0.070 0.159 0.579

MAP(user) 0.121 0.250 0.663

Table 6: Rating Prediction (RMSE) on Yahoo! Music MF Proposed Approach User Disjoint Split 23.24 23.15† Item Disjoint Split 23.88 23.35† User-Item Disjoint Split 24.21 24.21

Those significantly better then the second best results are marked with †

5. CONCLUSION

We present a novel yet intuitive two-step algorithm for a very challenging and seldom tackled task of identifying user/item correspondences between two homogeneous rating matrices. After the correspondences are identified, we introduce a transfer learning approach to boost the rating prediction accuracy in the target domain. Our two-stage matching algorithm not only aims at matching users/items in the latent space, but also refines the matching based on the objective to predict the observed values in the target domain.

The refinement not only boosts the quality of matching but also fa- cilitates further transferring process to enhance rating prediction.

We test our model on 8 different scenarios, each corresponds to a real-world scenario of transferring. The results are very promising as our model can identify the matching between user/item sets from different domains to some extent. More importantly, except one ex- tremely difficult scenario where the users/items are both completely disjoint, our model significantly boosts the rating prediction performance.

6. ACKNOWLEDGEMENTS

This work was sponsored by AOARD grant number No. FA2386- 13-1-4045, Ministry of Science and Technology, National Taiwan University and Intel Corporation under grants NSC102-2911-I-002- 001 and NTU103R7501, and grant 102-2923-E-002-007-MY2, 102- 2221-E-002-170, 101-2628-E-002-028-MY2.

References

[1] G. Dror, N. Koenigstein, Y. Koren, and M. Weimer. The yahoo! music dataset and kdd-cup’11. Journal of Machine Learning Research-Proceedings Track, 2012.

[2] M. Gamon. Linguistic correlates of style: authorship classification with deep linguistic analysis features. In Proc. COL- ING, 2004.

[3] L. Hu, J. Cao, G. Xu, L. Cao, Z. Gu, and C. Zhu. Personal- ized recommendation via cross-domain triadic factorization.

In Proc. WWW, 2013.

[4] M. Jamali and L. Lakshmanan. Heteromf: recommendation in heterogeneous information networks using context dependent factor models. In Proc. WWW, 2013.

[5] Y. Koren and R. Bell. Advances in collaborative filtering. In Recommender Systems Handbook, pages 145–186. Springer, 2011.

[6] Y. Koren, R. Bell, and C. Volinsky. Matrix factorization tech- niques for recommender systems. Computer, 2009.

[7] B. Li, Q. Yang, and X. Xue. Can movies and books collabo- rate? cross-domain collaborative filtering for sparsity reduction. In Proc. IJCAI, 2009.

[8] B. Li, Q. Yang, and X. Xue. Transfer learning for collaborative filtering via a rating-matrix generative model. In Proc.

ICML, 2009.

[9] J. Liu, F. Zhang, X. Song, Y.-I. Song, C.-Y. Lin, and H.-W.

Hon. What’s in a name?: an unsupervised approach to link users across communities. In Proc. WSDM, 2013.

[10] P. Lops, M. de Gemmis, and G. Semeraro. Content-based recommender systems: State of the art and trends. In Recom- mender Systems Handbook, pages 73–105. Springer, 2011.

[11] H. Ma, H. Yang, M. R. Lyu, and I. King. Sorec: social recommendation using probabilistic matrix factorization. In Proc.

CIKM, 2008.

[12] A. Mnih and R. Salakhutdinov. Probabilistic matrix factorization. In Proc. NIPS, 2008.

[13] A. Narayanan and V. Shmatikov. Robust de-anonymization of large sparse datasets. In Security and Privacy, IEEE Sympo- sium on, 2008.

[14] A. Narayanan and V. Shmatikov. De-anonymizing social networks. In Security and Privacy, IEEE Symposium on, 2009.

[15] W. Pan, E. W. Xiang, N. N. Liu, and Q. Yang. Transfer learning in collaborative filtering for sparsity reduction. In Proc.

AAAI, 2010.

[16] W. Pan and Q. Yang. Transfer learning in heterogeneous collaborative filtering domains. Artificial Intelligence, 2013.

[17] A. P. Singh and G. J. Gordon. Relational learning via collective matrix factorization. In Proc. SIGKDD, 2008.

[18] L. Xiong, X. Chen, T.-K. Huang, J. G. Schneider, and J. G.

Carbonell. Temporal collaborative filtering with bayesian probabilistic tensor factorization. In SDM, 2010.

[19] N. J. Yuan, F. Zhang, D. Lian, K. Zheng, S. Yu, and X. Xie.

We know how you live: exploring the spectrum of urban lifestyles. In Proc. COSN, 2013.

[20] R. Zafarani and H. Liu. Connecting users across social media sites: a behavioral-modeling approach. In Proc. KDD, 2013.

[21] Y. Zhang, B. Cao, and D.-Y. Yeung. Multi-domain collaborative filtering. In Proc. UAI, 2010.