• 沒有找到結果。

Compatibility Family Learning for Item Recommendation and Generation

N/A
N/A
Protected

Academic year: 2022

Share "Compatibility Family Learning for Item Recommendation and Generation"

Copied!
9
0
0

加載中.... (立即查看全文)

全文

(1)

Compatibility Family Learning for Item Recommendation and Generation

Yong-Siang Shih

1

, Kai-Yueh Chang

1

, Hsuan-Tien Lin

1,2

, Min Sun

3

1Appier Inc., Taipei, Taiwan,

2National Taiwan University, Taipei, Taiwan,

3National Tsing Hua University, Hsinchu, Taiwan.

{yongsiang.shih,kychang}@appier.com, htlin@csie.ntu.edu.tw, sunmin@ee.nthu.edu.tw

Abstract

Compatibility between items, such as clothes and shoes, is a major factor among customer’s purchasing decisions.

However, learning “compatibility” is challenging due to (1) broader notions of compatibility than those of similarity, (2) the asymmetric nature of compatibility, and (3) only a small set of compatible and incompatible items are observed. We propose an end-to-end trainable system to embed each item into a latent vector and project a query item into K com- patible prototypes in the same space. These prototypes re- flect the broad notions of compatibility. We refer to both the embedding and prototypes as “Compatibility Family”. In our learned space, we introduce a novel Projected Compatibil- ity Distance (PCD) function which is differentiable and en- sures diversity by aiming for at least one prototype to be close to a compatible item, whereas none of the prototypes are close to an incompatible item. We evaluate our system on a toy dataset, two Amazon product datasets, and Polyvore out- fit dataset. Our method consistently achieves state-of-the-art performance. Finally, we show that we can visualize the can- didate compatible prototypes using a Metric-regularized Con- ditional Generative Adversarial Network (MrCGAN), where the input is a projected prototype and the output is a gener- ated image of a compatible item. We ask human evaluators to judge the relative compatibility between our generated im- ages and images generated by CGANs conditioned directly on query items. Our generated images are significantly pre- ferred, with roughly twice the number of votes as others.

1 Introduction

Identifying compatible items is an important aspect in build- ing recommendation systems. For instance, recommending matching shoes to a specific dress is important for fashion;

recommending a wine to go with different dishes is impor- tant for restaurants. In addition, it is valuable to visualize what style is missing from the existing dataset so as to fore- see potential matching items that could have been up-sold to the users. We believe that the generated compatible items could inspire fashion designers to create novel products and help our business clients to fulfill the needs of customers.

For items with a sufficient number of viewing or pur- chasing intents, it is possible to take the co-viewing (or co- purchasing) records as signals of compatibility, and simply use standard techniques for a recommendation system, such as collaborative filtering, to identify compatible items. In

Figure 1: Notion of Compatibility (Left) vs. Similarity (Right). Left: The upper outer garment in the center (red circle) is the query item. The surrounding items are its com- patible ones. The styles of both the compatible shoes and lower body garments are various. Right: The style of a sim- ilar item (bottom) is constrained.

real world application, it is quite often encountered that there are insufficient records to make a decent compatible recom- mendation — it is then critical to fully exploit relevant con- tents associated with items, such as the images for dresses, or the wineries for wines.

Even leveraging such relevant information, recommend- ing or generating compatible items is challenging due to three key reasons. First, the notion of compatibility typi- cally goes across categories and is broader and more di- verse than the notion of similarity, and it involves complex many-to-many relationships. As shown in Figure 1, compat- ible items are not necessarily similar and vice versa. Sec- ond, the compatibility relationship is inherently asymmetric for real world applications. For instance, students purchase elementary textbooks before buying advanced ones, house owners buy furniture only after their house purchases. Rec- ommendation systems must take the asymmetry into consid- eration, as recommending car accessories to customers who bought cars is rational; recommending cars to those who bought car accessories would be improper. The two reasons above make many existing methods (McAuley et al. 2015;

Veit et al. 2015; Lee, Seol, and Lee 2017) less fit for com- patibility learning, as they aim to learn a symmetric metric so as to model the item-item relationship. Third, the cur- rently available labeled data sets of compatible and incom-

arXiv:1712.01262v1 [cs.LG] 2 Dec 2017

(2)

patible items are insufficient to train a decent image gener- ation model. Due to the asymmetric relationships, the gen- erator could not simply learn to modify the input image as most CGANs do in the similarity learning setting.

However, humans have the capabilities to create compat- ible items by associating internal concepts. For instance, fashion designers utilize their internal concept of compat- ibility, e.g., style and material to design many compatible outfits. Inspired by this, we demonstrate extracting meaning- ful representation from the image contents for compatibility is an effective way of tackling such challenges.

We aim at recommending and generating compatible items through learning a “Compatibility Family”. The fam- ily for each item contains a representation vector as the embedding of the item, and multiple compatible prototype vectors in the same space. We refer to the latent space as the “Compatibility Space”. Firstly, we propose an end-to- end trainable system to learn the family for each item. The multiple prototypes in each family capture the diversity of compatibility, conquering the first challenge. Secondly, we introduce a novel Projected Compatibility Distance (PCD) function which is differentiable and ensures diversity by en- couraging the following properties: (1) at least one proto- type is close to a compatible item, (2) none of the prototypes is close to an incompatible item. The function captures the notion of asymmetry for compatibility, tackling the second challenge. While our paper focuses mainly on image con- tent, this framework can also be applied to other modalities.

The learned compatible family’s usefulness is beyond item recommendation. We design a compatible image gen- erator, which can be trained with only the limited labeled data given the succinct representation that has been captured in the compatibility space, bypassing the third challenge. In- stead of directly generating the image of a compatible item from a query item, we first obtain a compatible prototype using our system. Then, the prototype is used to generate images of compatible items. This relieves the burden for the generator to simultaneously learn the notion of compatibil- ity and how to generate realistic images. In contrast, exist- ing approaches generate target images directly from source images or source-related features. We propose a novel gen- erator referred to as Metric-regularized Conditional Gener- ative Adversarial Network (MrCGAN). The generator is re- stricted to work in a similar latent space to the compatibility space. In addition, it learns to avoid generating ambiguous samples that lie on the boundary of two clusters of samples that have conflicting relationships with some query items.

We evaluate our framework on Fashion-MNIST dataset, two Amazon product datasets, and Polyvore outfit dataset.

Our method consistently achieves state-of-the-art perfor- mance for compatible item recommendation. Finally, we show that we can generate images of compatible items using our learned Compatible Family and MrCGAN. We ask hu- man evaluators to judge the relative compatibility between our generated images and images generated by CGANs con- ditioned directly on query items. Our generated images are roughly 2x more likely to be voted as compatible.

The main contributions of this paper can be summarized as follows:

• Introduce an end-to-end trainable system for Compatible Family learning to capture asymmetric-relationships.

• Introduce a novel Projected Compatibility Distance to measure compatibility given limited ground truth compat- ible and incompatible items.

• Propose a Metric-regularized Conditional Generative Ad- versarial Network model to visually reveal our learned compatible prototypes.

2 Related Work

We focus on describing the related work in content-based compatible item recommendation and conditional image generation using Generative Adversarial Networks (GANs).

Content-based compatible item recommendation Many works assume a similarity learning setting that requires com- patible items to stay close in a learned latent space. McAuley et al. (2015) proposed to use Low-rank Mahalanobis Trans- form to map compatible items to embeddings close in the la- tent space. Veit et al. (2015) utilized the co-purchase records from Amazon.com to train a Siamese network to learn repre- sentations of items. Lee, Seol, and Lee (2017) assumed that different items in an outfit share a coherent style, and pro- posed to learn style representations of fashion items by max- imizing the probability of item co-occurrences. In contrast, our method is designed to learn asymmetric-relationships.

Several methods go beyond similarity learning. Iwata, Watanabe, and Sawada (2011) proposed to learn a topic model to find compatible tops from bottoms. He, Packer, and McAuley (2016) extended the work of McAuley et al. (2015) to compute “query-item-only” dependent weighted sum (i.e., independent of the compatible items) of K distances between two items in K latent spaces to handle heterogeneous item recommendation. However, this means that the query item only prefers several subspaces. While this could deal with diversity across different query items (i.e., different query items have different compatible items), it is less effective for diversity across compatible items of the same query item (i.e., one query item has a diverse set of compatible items). Our model instead represents the dis- tance between an item and a candidate as the minimum of the distance between each prototype and the candidate, al- lowing compatible items to locate on different locations in the same latent space. In addition, our method is end-to-end trained, and can be coupled with MrCGAN to generate im- ages of compatible items, unlike the other methods.

Another line of research tackles outfit composition as a set problem, and attempts to predict the most compatible item in a multi-item set. Li et al. (2017) proposed learning the representation of an outfit by pooling the item features pro- duced by a multi-modal fusion model. Prediction is made by computing a compatibility score for the outfit representa- tion. Han et al. (2017) treated items in an outfit as a sequence and modeled the item recommendation with a bidirectional LSTM to predict the next item from current ones. Note that these methods require mult-item sets to be given. This po- tentially can be a limitation in practical use.

(3)

Explanation

K the number of prototypes N the size of the latent vector X the set of query items

Y the set of items to be recommended E0(·) the encoding function

ey0 a shorthand for E0(y)

Ek(·) the k-th projection, where k ∈ {1, . . . , K}

exk a shorthand for Ek(x) d(x, y) PCD between x and y

dk(x, y) squared L2distance between exkand ey0 G the generator

D the discriminator

Q0(·) the latent vector prediction from D qy0 a shorthand for Q0(y)

z the noise input to the generator pz the distribution of z

Table 1: Notation.

Generative Adversarial Networks After the introduction of Generative Adversarial Networks (GANs) (Goodfellow et al. 2014) as a popular way to train generative models, GANs have been shown to be capable of conditional gen- eration (Mirza and Osindero 2014), which has wide applica- tions such as image generation by class labels (Odena, Olah, and Shlens 2017) or by texts (Reed et al. 2016), and image transformation between different domains (Isola et al. 2016;

Kim et al. 2017; Zhu et al. 2017). Most works for condi- tional generation focused on similarity relationships. How- ever, compatibility is represented by complex many-to-many relationships, such that the compatible item could be visu- ally distinct from the query item and different query items could have overlapping compatible items.

The idea to regularize GANs with a metric is related to the reconstruction loss used in autoencoders that requires the reconstructed image to stay close to the original im- age, and it was also applied to GANs by comparing the visual features extracted from samples to enforce the met- ric in the sample space (Boesen Lindbo Larsen et al. 2015;

Che et al. 2016). Our model instead regularizes the subspace of the latent space (i.e., the input of the generator). The sub- space does not have the restriction to have known distribu- tion that could be sampled, and visually distinct samples are allowed to be close as long as they have similar compatibil- ity relationships with other items. This allows the subspace to be learned from a far more powerful architecture.

3 Our Method

We first introduce a novel Projected Compatibility Distance (PCD) function. Then, we introduce our model architec- ture and learning objectives. Finally, we introduce a novel metric-regularized Conditional GAN (MrCGAN). The no- tation used in this paper is also summarized in Table 1.

3.1 Projected Compatibility Distance

PCD is proposed to measure the compatibility between two items x and y. Each item is transformed into a latent vector

Figure 2: Projected Compatibility Distance. A query item x is respectively projected to ex1 and ex2 for its two distinct compatible items, yaand yb. Thus, d(x, ya) ∼ d1(x, ya) for ya while d(x, yb) ∼ d2(x, yb) for yb. For an incompatible item yc, none of the projections are close to ey0c.

Figure 3: Model Architecture. The CNNs on the left and right are identical. Only the prototypes {exk}k∈{1,...,K}from x and ey0from y are considered to form (1).

by an encoding function E0(·). Additionally, K projections, denoted as {Ek(·)}k∈{1,...,K}, are learned to directly map an item to K latent vectors (i.e., prototypes) close to clusters of its compatible items. Each of the latent vectors has a size of N. Finally the compatibility distance is computed as follows,

d(x, y) =

PK

k=1[exp(−dk(x, y))exk] PK

k=1exp(−dk(x, y)) − ey0

2

2

(1)

where

dk(x, y) = kexk− ey0k22, (2) and exk stands for Ek(x) for readability.

The concept of PCD is illustrated in Figure 2. When at least one exk is close enough to a latent vector ey0of item y, the distance d(x, y) approaches minkdk(x, y).

3.2 Model Architecture

In our experiments, most of the layers between Ek(·) are shared and only the last layers for outputs are separated.

This is achieved by using Siamese CNNs (Hadsell, Chopra, and LeCun 2006) for feature transformation. As illustrated in Figure 3, rather than learning just an embedding like the original formulation, K +1 projections are learned for the item embedding E0(·) and K prototype projections.

We model the probability of a pair of items (x, y) being compatible with a shifted-sigmoid function similar to (He, Packer, and McAuley 2016) as shown below,

P (x, y) = σc(−d(x, y)) = 1

1 + exp(d(x, y) − c) , (3) where c is the shift parameter to be learned.

(4)

Figure 4: The training procedure of MrCGAN. The gener- ated samples conditioned on different latent vectors are de- noted as: yenc= G(z, ey0), and yprj = G(z, exk).

Learning objective Given the set of compatible pairs R+ and the set of incompatible pairs R, we compute the binary cross-entropy loss, i.e.,

Lce= − 1

|R+| X

(x,y)∈R+

log (P (x, y))

− 1

|R| X

(x,y)∈R

log (1 − P (x, y)) ,

(4)

where | · | denotes the size of a set. We further regularize the compatibility space by minimizing the distance for compat- ible pairs. The total loss is as follows:

L = Lce+ λm

1

|R+| X

(x,y)∈R+

d(x, y) , (5)

where λmis a hyper-parameter to balance loss terms.

3.3 MrCGAN for Compatible Item Generation Once the compatibility space is learned, the metric within the space could be used to regularize a CGAN. The proposed model is called Metric-regularized CGAN (MrCGAN). The learning objectives are illustrated in Figure 4.

The discriminator has two outputs: D(y), the probability of the sample y being real, and Q0(y) = qy0, the predicted latent vector of y. The generator is conditioned on both z and a latent vector from the compatibility space constructed by E0(·). Given the set of query items X, items to be rec- ommended Y , and the noise input z ∈ RZ ∼ pz= N (0, 1),

we compute the MrCGAN losses as, Lreal= − 1

|Y | X

y∈Y

log D(y) , (6)

Lenc= − 1

|Y | X

y∈Y

z∼pEzlog (1 − D(G(z, ey0))) , (7)

Lprj = − 1

| K ||X|

K

X

k=1

X

x∈X

z∼pEzlog (1 − D(G(z, exk))) (8) The discriminator learns to discriminate between real and generated images, while the generator learns to fool the dis- criminator with both the encoded vector ey0and the projected prototypes exkas conditioned vectors. We also adopt the gra- dient penalty loss Lgpof DRAGAN (Kodali et al. 2017),

Lgp= λgp E

ˆ

y∼pperturbed Y

(k∇yˆD(ˆy)k2− 1)2, (9) where a perturbed batch is computed from a batch sampled from Y as batch + λdra· batch.stddev() · U [0, 1], and λgp, λdra are hyper-parameters. In addition, MrCGAN has the following metric regularizers,

c= 1

|Y | X

y∈Y

key0− qy0k22 , (10)

which requires the predicted qy0 to stay close to the real la- tent vector ey0, forcing Q0(·) to approximate E0(·), and

enc = 1

|Y | X

y∈Y

M+(ey0, ey0) , (11) where mencis a hyper-parameter and

M+(v, s) = Ez∼p

z

max(0, −menc+ kv − Q0(G(z, s))k2)2, which measures the distance between a given vector v and the predicted latent vector of the generated sample condi- tioned on s, and it guides the generator to learn to align its latent space with the compatibility space. The margin menc

relaxes the constraint so that the generator does not collapse into a 1-to-1 decoder. Finally the generator also learns to avoid generating incompatible items by,

prj= 1

| K ||R|

K

X

k=1

X

(x,y)∈R

M(ey0, exk) , (12) where mprjis a hyper-parameter and

M(v, s) = Ez∼p

z

max(0, mprj− kv − Q0(G(z, s))k2)2, which requires the distance between a given latent vector v and the predicted latent vector of the generated sample conditioned on s to be larger than a margin mprj.

The total losses for G and D are as below, LG= −1

2(Lenc+ Lprj) + Ωenc+ Ωprj (13) LD= Lreal+1

2(Lenc+ Lprj) + Lgp+ Ωc (14)

(5)

In effect, the learned latent space is constructed by: (1) z space, (2) a subspace that has similar structure as the com- patibility space. To generate compatible items for x, exk is used as the conditioning vector: G(z, exk). To generate items with similar style to y, ey0is used instead: G(z, ey0).

3.4 Implementation Details

We set λmto 0 and 0.5 respectively for recommendation and generation experiments and the batch size to 100, and use Adam optimizer with (λlr, β1, β2) = (0.001, 0.9, 0.999).

The validation set is for the best epoch selection. The last layer of each discriminative model takes a fully-connected layer with weight normalization (Salimans and Kingma 2016) except for the Amazon also-bought/viewed exper- iments, where the weight normalization is not used for fair comparison. Before the last layer, the following fea- ture extractors are for different experiments: (1) Fashion- MNIST+1+2 / MNIST+1+2: multi-layer CNNs with weight normalization, (2) Amazon also-bought/viewed: none, (3) Amazon co-purchase: Inception-V1 (Szegedy et al. 2015), (4) Polyvore: Inception-V3 (Szegedy et al. 2016).

For the generation experiments, we set λgp to 0 and ap- ply DCGAN (Radford, Metz, and Chintala 2015) architec- ture for both of our model and the GAN-INT-CLS (Reed et al. 2016) in MNIST+1+2. For Amazon co-purchase and Polyvore generation experiments, we set both λgpand λdra

to 0.5, and adopt a SRResNet(Ledig et al. 2016)-like archi- tecture for GAN training with a different learning rate set- ting, i.e., (λlr, β1, β2) = (0.0002, 0.5, 0.999). The model and parameter choosing is inspired by Jin et al. (2017), but we take off the skip connections from the generator since it does not work well in our experiment, and we also use weight normalization for all layers. The architecture is shown in Figure 5.

In most of our experiments, the sets, X and Y , are identical. However, we create non-overlapped sets of X and Y by restricting the categories in the generation ex- periments for Amazon co-purchase and Polyvore dataset.

The dimension of z and the number of K are respec- tively set to 20 and 2 in all generation experiments. Be- sides, the rest of the parameters are taken as follows: (1) MNIST+1+2: (N, menc, mprj) = (20, 0.1, 0.5), (2) Ama- zon co-purchase: (N, menc, mprj) = (64, 0.05, 0.2), (3) Polyvore: (N, menc, mprj) = (20, 0.05, 0.3).

4 Experiment

We conduct experiments on several datasets and compare the performance with state-of-the-art methods for both com- patible item recommendation and generation.

4.1 Recommendation Experiments

Baseline Our proposed PCD is compared with two base- lines: (1) the L2 distance between the latent vectors of the Siamese model, (2) the Monomer proposed by He, Packer, and McAuley (2016). Although Monomer was originally not trained end-to-end, we still cast an end-to-end setting for it on Fashion-MNIST+1+2 dataset.

Model K Error rate AUC

L2 41.45% ± 0.55% 0.6178 ± 0.0034 Monomer

2 23.24% ± 0.62% 0.8533 ± 0.0045 3 22.46% ± 0.44% 0.8598 ± 0.0021 4 21.60% ± 0.39% 0.8651 ± 0.0052 5 22.03% ± 0.39% 0.8623 ± 0.0030 PCD

2 21.31% ± 0.77% 0.8746 ± 0.0070 3 20.66% ± 0.30% 0.8804 ± 0.0033 4 20.39% ± 0.60% 0.8808 ± 0.0043 5 20.01% ± 0.39% 0.8830 ± 0.0032 Table 2: Performance on Fashion-MNIST+1+2.

Our model achieves superior performance in most exper- iments. In addition, our model has two advantages in ef- ficiency compared to Monomer: (1) our storage is K1 of Monomer since they projected each item into K spaces beforehand while we only compute K prototype projec- tions during recommendation, (2) PCD is approximately minkdk(x, y). Therefore, in query time we could do K nearest-neighbor search in parallel to get approximate re- sults, while Monomer needs to aggregate the weighted sum of the distances in K latent spaces.

Fashion-MNIST+1+2 Dataset To show our model’s abil- ity to handle asymmetric relationships, we build a toy dataset from Fashion-MNIST (Xiao, Rasul, and Vollgraf 2017). The dataset consists of 28x28 gray-scale images of 10 classes, in- cluding T-shirt, Trouser, Pullover, etc. We create an arbitrary asymmetric relationship as follows,

(x, y) ∈ R+ ⇐⇒ Cy = (Cx+ i) mod 10, ∀i ∈ {1, 2}, where Cx means the class of x. The other cases of (x, y) belong to R.

Among the 60,000 training samples, 16,500 and 1,500 pairs are non-overlapped and randomly selected to form the training and validation sets, respectively. Besides, 10,000 pairs are created from the 10,000 testing samples for the test- ing set. The samples in each split are also non-overlapped.

The strategy of forming a pair is that we randomly choose a negative or a positive sample y for each sample x. Thus,

|R+| ≈ |R|. We erase the class labels while only keep the pair labels during training for the reason that the underlying factor for compatibility is generally unavailable.

We repeat the above setting five times and show the aver- aged result in Table 2. Like the settings of He, Packer, and McAuley (2016), the latent size in L2equals to (K +1) × N.

Here, the size in L2 is 60. Each model is trained for 50 epochs. The experiment shows that the L2model performs poorly on a highly asymmetric dataset while our model achieves the best results.

Amazon also-bought/viewed Dataset The image features of 4096 dimensions are extracted beforehand in this dataset.

Following the setting of He, Packer, and McAuley (2016), the also-bought and also-viewed relationships in the Ama- zon dataset (McAuley et al. 2015) are positive pairs while the negative pairs are sampled accordingly. We set our pa- rameters, K and N, as the same as Monomer for compar- ison, i.e., (1) (K, N) = (4, 20) for also-bought, and (2)

(6)

Figure 5: MrCGAN Architecture

Dataset Graph LMT Monomer PCD

Men also bought also viewed

9.20% 6.48% 6.05%

6.78% 6.58% 5.97%

Women also bought also viewed

11.52% 7.87% 7.75%

7.90% 7.34% 7.37%

Boys also bought also viewed

8.80% 5.71% 5.27%

6.72% 5.35% 5.03%

Girls also bought also viewed

8.33% 5.78% 5.34%

6.46% 5.62% 4.86%

Baby also bought also viewed

12.48% 7.94% 7.00%

11.88% 9.25% 8.00%

Avg. 9.00% 6.79% 6.26%

Table 3: Error rates on Amazon also-bought/viewed dataset. LMT stands for Low-rank Mahalanobis Transform (McAuley et al. 2015).

(K, N) = (4, 10) for also-viewed. We train 200 epochs for each model and Table 3 shows the results. Compared to the error rates reported in He, Packer, and McAuley (2016), our model yields the best performance in most settings and the lowest error rate on average.

Amazon Co-purchase Dataset Based on the data split1,2 defined in Veit et al. (2015), we increase the validation set via randomly selecting an additional 9,996 pairs from the original training set since its original size is too small, and accordingly decrease the training set by removing the related pairs for the non-overlapping requirement. Totally, 1,824,974 pairs remain in the training set. As the ratio of positive and negative pairs is disrupted, we re-weigh each sample during training to re-balance it back to 1:16. Besides, we randomly take one direction for each pair since the “Co-

1https://vision.cornell.edu/se3/projects/clothing-style/

27 images from the training set are missing, so 35 pairs contain- ing these images are dropped.

Model AUC

Veit et al. (2015) 0.826

Veit et al. (2015) (retrain last layer) 0.8698

Monomer 0.8747

PCD 0.8762

Table 4: AUC on Amazon co-purchase dataset.

purchase” relation in this dataset is symmetric.

We adopt and fix the pre-trained weights from Veit et al. (2015), and replace the last embedding layer with each model’s projection layer. The last layer is trained for 5 epochs in each model. Still, we set the latent size in Veit et al. (2015) as 256 and equal to (K +1) × N. Thus, (K, N) = (3, 64) for both Monomer and our model. Table 4 shows the results. Our model obtains superior performance to Veit et al. (2015) and comparable results with Monomer.

Polyvore Dataset To demonstrate the ability of our model to capture the implicit nature of compatibility, we create an outfit dataset from Polyvore.com, a collection of user- generated fashion outfits. We crawl outfits under Women’s Fashion category and group the category of items into tops, bottoms, and shoes. Outfits are ranked by the number of likes and the top 20% are chosen as positive.

Three datasets are constructed for different recommenda- tion settings: (1) from tops to others, (2) from bottoms to others, and (3) from shoes to others. The statistics of the dataset are shown in Table 5. The construction procedure is as follows: (1) Items of source and target categories are non- overlapped split according to the ratios 60 : 20 : 20 for train- ing, validation, and test sets. (2) A positive pair is decided if it belongs to the positive outfit. (3) The other combina- tions are negative and sub-sampled by choosing 4 items from target categories for each positive pair. Duplicate pairs are dropped afterwards. This dataset is more difficult since the compatibility information across categories no longer exists,

(7)

Dataset split # source # target # pairs top to others

train 165,969 201,028 462,176 val 55,323 67,010 51,420 test 55,324 67,010 52,335 bottom to

others

train 67,974 299,022 343,383 val 22,659 99,675 40,015 test 22,659 99,675 39,360 shoe to others

train 133,053 233,944 454,829 val 44,351 77,982 48,500 test 44,352 77,982 47,100 Table 5: Polyvore dataset.

Graph L2 Monomer PCD

top-to-others 0.7165 0.7431 0.7484 shoe-to-others 0.6988 0.7146 0.7165 bottom-to-others 0.7450 0.7637 0.7680

Avg. 0.7201 0.7405 0.7443

Table 6: AUC on Polyvore dataset.

i.e., both positive and negative pairs are from the same cate- gories. The model is forced to learn the elusive compatibility between items.

Pre-trained Inception-V3 is used to extract image fea- tures, and the last layer is trained for 50 epochs. We set N to 100 for L2, and (K, N) = (4, 20) for Monomer and PCD.

The AUC scores are listed in Table 6, and our model still achieves the best results.

4.2 Generation Experiments

Baseline We compare our model with GAN-INT-CLS from (Reed et al. 2016) in the MNIST+1+2 experiment. ex0 is used as the conditioning vector for GAN-INT-CLS, and each model uses exact architecture except that the discrim- inator for MrCGAN outputs an additional Q0(.), while the discriminator for GAN-INT-CLS is conditioned by ex0. For other experiments, we compare with pix2pix (Isola et al.

2016) that utilizes labeled image pairs for image-to-image translation and DiscoGAN (Kim et al. 2017) that unsuper- visedly learns to transform images into a different domain.

MNIST+1+2 We use MNIST dataset to build a similar dataset as Fashion-MNIST+1+2 because the result is easier to interpret. Additional 38,500 samples are selected as un- labeled data to train the GANs. In Figure 6 we display the generation results conditioned on samples from the test set.

We found that our model could preserve diversity better and that the K projections automatically group different modes, so the variation can be controlled by changing k.

Amazon Co-purchase Dataset We sample a subset from Amazon Co-purchase dataset (Veit et al. 2015) by reducing the number of types for target items so that it’s easier for GANs to work with. In particular, we keep only relation- ships from Clothing to Watches and Shoes in Women’s and Men’s categories. We re-weigh each sample during train- ing to balance the ratio between positives and negatives to

Figure 6: Column x is the input. The rows show the gener- ated images by varying z and k of different methods.

1:1, but for validation set, we simply drop all excessive neg- ative pairs. Totally, there remain 9,176 : 86 : 557 positive pairs and 435,396 : 86 : 22,521 negative pairs for training, validation and test split, respectively. The unlabeled train- ing set for GANs is selected from the training ids from Veit et al. (2015), and it consist of 226,566 and 252,476 items of source and target categories, respectively. Each image is also resized to 64x64 before being fed into the discriminator.

Both DiscoGAN and pix2pix are inherently designed for similarity learning, and as shown in Figure 7a, they could not produce satisfying results. Moreover, our model pro- duces diverse outputs, while the diversity for conventional image-to-image models are limited. We additionally sample images conditioned on ey0 instead of on exk as illustrated in Figure 7b. We find that the MrCGAN has the ability to gen- erate diverse items having similar style as y.

Polyvore Top-to-others Dataset We use the top-to-others split as a demonstration for our methods. Likewise we re- weigh each sample during training to balance positive and negative pairs to 1:1 to encourage the model to focus on pos- itive pairs. The results are shown in Figure 7.

User study Finally we conduct online user surveys to see whether our model could produce images that are perceived as compatible. We conduct two types of surveys: (1) Users are given a random image from source categories and three generated images by different models in a randomized order.

Users are asked to select the item which is most compatible with the source item, and if none of the items are compati- ble, select the most recognizable one. (2) Users are given a random image from source categories, a random image from target categories, and an image generated by MrCGAN in a randomized order. Users are asked to select the most com- patible item with the source item, and if none of the items are compatible, users can decline to answer. The results are shown in Figure 8, and it shows that MrCGAN can generate compatible and realistic images under compatibility learning setting compared to baselines. While the difference against random images is small in the Polyvore survey, MrCGAN is significantly preferred in the Amazon co-purchase survey.

This is also consistent with the discriminative performance.

4.3 Discussion

As shown in Table 2, a larger K gives better performance when the total embedding dimension (K +1) × N is kept equal, but the differences are small if the size is increased continuously. The regularizer controlled by λm forces the distances between prototypes and a compatible item to be

(8)

Amazon Co-purchase Polyvore Top-to-others

(a) Each block of images represents one set of conditional generation. Top-left: conditioning image x. Top-right: four samples generated by MrCGAN conditioned on exk(in green box). Bottom-left: GT (in black box). Bottom-middle: DiscoGAN (Kim et al. 2017) (in red box).

Bottom-right: pix2pix (Isola et al. 2016) (in blue box).

(b) Each block of images represents conditional generation (in green box) based on the latent vector ey0 of the image y on the top. Note that by conditioning on ey0, MrCGAN generates items having similar style as y instead of compatible items.

Figure 7: Examples of generated images conditioned on (a) exkand (b) ey0.

(a) (b) (c) (d)

0 200 400 600

Total#ofVotes

MrCGAN DiscoGAN pix2pix Random Decline Figure 8: Survey results on Amazon co-purchase (see panel (a, c)) and Polyvore (see panel (b, d)). MrCGAN (blue) gen- erally outperforms others.

small. We found that this constraint hurts recommendation performance, but it could be helpful for generation. We rec- ommend choosing a smaller λmwhen the quality difference between images generated from exk and ey0is not noticeable.

From the four datasets, the recommendation performance of our model is gradually getting closer to the others, which might be due to the disappearance of the asymmetric re- lationship. In Fashion-MNIST+1+2, this relationship is in- jected by force. Then Amazon also-bought/viewed dataset preserves buying and viewing orders so some asymmet- ric relationship exists. However, only the symmetric re- lationship remains for Amazon co-purchase and Polyvore datasets. This suggests our model is suitable for asymmetric relationship but it still works well under symmetric settings.

We found that a larger mprj reduced the diversity but a smaller mprj made incompatible items more likely to ap- pear. In practice, it works well to set mprj to be slightly larger than the average distances of positive pairs from the training set. Removing margin menc seems to decrease the diversity in simple datasets such as MNIST+1+2, but we did not tune it on other complex datasets.

5 Conclusion

We propose modeling the asymmetric and many-to-many re- lationship of compatibility by learning a Compatibility Fam- ily of representation and prototypes with an end-to-end sys- tem and the novel Projected Compatible Distance function.

The learned Compatibility Family achieves more accurate recommendation results when compared with the state-of- the-art Monomer method (He, Packer, and McAuley 2016) on real-world datasets. Furthermore, the learned Compati- bility Family resides in a meaningful Compatibility Space and can be seamlessly coupled with our proposed MrCGAN model to generate images of compatible items. The gen- erated images validate the capability of the Compatibilty Family in modeling many-to-many relationships. Further- more, when compared with other approaches for generating compatible images, the proposed MrCGAN model is signif- icantly more preferred in our user surveys. The recommen- dation and generation results justify the usefulness of the learned Compatibility Family.

Acknowledgement

The authors thank Chih-Han Yu and other colleagues of Ap- pier as well as the anonymous reviewers for their construc- tive comments. We thank Chia-Yu Joey Tung for organizing the user study. The work was mostly completed during Min Sun’s visit to Appier for summer research, and was part of the industrial collaboration project between National Tsing Hua University and Appier. We also thank the supports from MOST project 106-2221-E-007-107.

References

Boesen Lindbo Larsen, A.; Kaae Sønderby, S.; Larochelle, H.; and Winther, O. 2015. Autoencoding beyond pixels using a learned similarity metric. arXiv preprint arXiv:1512.09300.

Che, T.; Li, Y.; Jacob, A. P.; Bengio, Y.; and Li, W. 2016. Mode Regularized Generative Adversarial Networks. arXiv preprint arXiv:1612.02136.

(9)

Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde- Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Gen- erative adversarial nets. In Advances in Neural Information Pro- cessing Systems 27. Curran Associates, Inc. 2672–2680.

Hadsell, R.; Chopra, S.; and LeCun, Y. 2006. Dimensionality re- duction by learning an invariant mapping. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 2, CVPR ’06, 1735–1742. Washing- ton, DC, USA: IEEE Computer Society.

Han, X.; Wu, Z.; Jiang, Y.-G.; and Davis, L. S. 2017. Learning fashion compatibility with bidirectional LSTMs. In Proceedings of the 2017 ACM on Multimedia Conference, 1078–1086. ACM.

He, R.; Packer, C.; and McAuley, J. 2016. Learning compatibil- ity across categories for heterogeneous item recommendation. In International Conference on Data Mining, 937–942. IEEE.

Isola, P.; Zhu, J.-Y.; Zhou, T.; and Efros, A. A. 2016. Image- to-image translation with conditional adversarial networks. arXiv preprint arXiv:1611.07004.

Iwata, T.; Watanabe, S.; and Sawada, H. 2011. Fashion coordinates recommender system using photographs from fashion magazines.

In International Joint Conference on Artificial Intelligence, 2262–

2267. AAAI Press.

Jin, Y.; Zhang, J.; Li, M.; Tian, Y.; Zhu, H.; and Fang, Z. 2017. To- wards the Automatic Anime Characters Creation with Generative Adversarial Networks. arXiv preprint arXiv:1708.05509.

Kim, T.; Cha, M.; Kim, H.; Lee, J. K.; and Kim, J. 2017. Learn- ing to discover cross-domain relations with generative adversarial networks. In Proceedings of the 34th International Conference on Machine Learning, 1857–1865. PMLR.

Kodali, N.; Abernethy, J.; Hays, J.; and Kira, Z. 2017. How to Train Your DRAGAN. arXiv preprint arXiv:1705.07215.

Ledig, C.; Theis, L.; Huszar, F.; Caballero, J.; Cunningham, A.; Acosta, A.; Aitken, A.; Tejani, A.; Totz, J.; Wang, Z.;

and Shi, W. 2016. Photo-Realistic Single Image Super- Resolution Using a Generative Adversarial Network. arXiv preprint arXiv:1609.04802.

Lee, H.; Seol, J.; and Lee, S.-g. 2017. Style2Vec: Representa- tion Learning for Fashion Items from Style Sets. arXiv preprint arXiv:1708.04014.

Li, Y.; Cao, L.; Zhu, J.; and Luo, J. 2017. Mining fashion out- fit composition using an end-to-end deep learning approach on set data. IEEE Trans. Multimedia 19:1946–1955.

McAuley, J. J.; Targett, C.; Shi, Q.; and van den Hengel, A. 2015.

Image-based recommendations on styles and substitutes. In SIGIR, 43–52. ACM.

Mirza, M., and Osindero, S. 2014. Conditional generative adver- sarial nets. CoRR abs/1411.1784.

Odena, A.; Olah, C.; and Shlens, J. 2017. Conditional image synthesis with auxiliary classifier GANs. In Proceedings of the 34th International Conference on Machine Learning, 2642–2651.

PMLR.

Radford, A.; Metz, L.; and Chintala, S. 2015. Unsupervised Rep- resentation Learning with Deep Convolutional Generative Adver- sarial Networks. arXiv preprint arXiv:1511.06434.

Reed, S.; Akata, Z.; Yan, X.; Logeswaran, L.; Schiele, B.; and Lee, H. 2016. Generative adversarial text-to-image synthesis. In Pro- ceedings of The 33rd International Conference on Machine Learn- ing, 1060–1069. PMLR.

Salimans, T., and Kingma, D. P. 2016. Weight normalization: A simple reparameterization to accelerate training of deep neural net-

works. In Advances in Neural Information Processing Systems 29.

Curran Associates, Inc. 901–909.

Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; and Rabinovich, A. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1–9.

Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; and Wojna, Z.

2016. Rethinking the inception architecture for computer vision.

In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2818–2826.

Veit, A.; Kovacs, B.; Bell, S.; McAuely, J.; Bala, K.; and Belongie, S. 2015. Learning visual clothing style with heterogeneous dyadic co-occurrences. In Proceedings of the IEEE International Confer- ence on Computer Vision, 4642–4650.

Xiao, H.; Rasul, K.; and Vollgraf, R. 2017. Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algo- rithms. arXiv preprint arXiv:1708.07747.

Zhu, J.-Y.; Park, T.; Isola, P.; and Efros, A. A. 2017. Unpaired image-to-image translation using cycle-consistent adversarial net- works. arXiv preprint arXiv:1703.10593.

參考文獻

相關文件

Objectives  To introduce the Learning Progression Framework LPF for English Language as a reference tool to identify students’ strengths and weaknesses, and give constructive

We were particularly impressed by the large garden which is looked after by the students and used to grow fruit, herbs and vegetables for the midday meal which the school serves free

 A genre is more dynamic than a text type and is always changing and evolving; however, for our practical purposes here, we can take genre to mean text type. Materials developed

Information technology learning targets: A guideline for schools to organize teaching and learning activities to develop our students' capability in using IT. Hong

By correcting for the speed of individual test takers, it is possible to reveal systematic differences between the items in a test, which were modeled by item discrimination and

Using a one-factor higher-order item response theory (HO-IRT) model formulation, it is pos- ited that an examinee’s performance in each domain is accounted for by a

In x 2 we describe a top-down construction approach for which prototype charge- qubit devices have been successfully fabricated (Dzurak et al. Array sites are de­ ned by

At least one can show that such operators  has real eigenvalues for W 0 .   Æ OK. we  did it... For the Virasoro