Image Model - 圖文生活日誌之圖片回憶研究

Chapter 4 Models

4.2 Image Model

We refer to the structure of L. Wang et al. [14] which reach state-of-the-art performance on many image-text retrieval tasks. In the learning state (See Figure 4-2), sentence encoder extracts sentence important features as sentence embedding and image encoder extract image features as image embedding. And then, we build neural network to train these two embedding into a new coordinated embedding. The embedding loss constrains the image and sentence from the corresponding pair (positive pair) will be close to each other in the new embedding. On the other side, the image and sentence from non-corresponding pair (negative pair) will be far away from each other in the new embedding.

The detail of training will be discussed in Section 4.2.1.

After the new coordinated embedding is trained, we could input query and images to compute similarity scores (See Figure 4-3) to achieve image recall.

Figure 4-2 Structure of the learning stage which could train the sentence and image into a coordinated embedding. “fc” means fully-connected.

Figure 4-3 Structure of Image model when doing image recall.

4.2.1 Embedding Loss Function

Embedding loss function is the objective function to be minimized in our learning stage. There are four types of positive pair and four types of negative pair (See Figure 4-4). The distance of these eight types of pair will be computed for L1 to L4 (See formula (1) to (4)) from different point of view. And we will combine L1 to L4 to get final embedding loss L5 (See formula (5)). Where X is the set of images, Y is the set of sentences, 𝑥_-ÎX, 𝑦_-ÎY, m is the margin and d is the Euclidean distance. The details of formula are discussed as follows.

Figure 4-4 Positive pairs and negative pairs to be used in embedding loss function.

In (1), 𝑥_-, 𝑦_F is positive image-sentence pair (See Figure 4-4), and 𝑥_-, 𝑦₃ is negative image-sentence pair. The embedding of the image should be close to the corresponding sentences embedding and should far away from the non-corresponding sentences embedding.

In (2), 𝑦_-, 𝑥_F is positive sentence-image pair, and 𝑦_-, 𝑥₃ is negative sentence-image pair. The embedding of the sentence should be close to the corresponding sentence-images embedding and should far away from the non-corresponding images embedding.

In (3), 𝑥_-, 𝑥_F is positive image-image pair, and 𝑥_-, 𝑥₃ is negative image-image pair. That is, the loss is computed only considered the relation between the images. The embedding of the image should be close to the corresponding image embedding and should far away from the non-corresponding image embedding.

In (4), 𝑦_-, 𝑦_F is positive sentence-sentence pair, and 𝑦_-, 𝑦₃ is negative sentence-sentence pair. That is, the loss is computed only under the consideration of the relation between the sentences. The embedding of the sentence should be close to the corresponding sentence embedding and should far away from the non-corresponding

sentence embedding.

In (5), the final embedding loss L5 equals to sum of all loss 𝐿_- with weight 𝜆 -where iÎ{1,2,3,4}. We set 𝜆₀=1.5, 𝜆_I=1, 𝜆_J=0, 𝜆_K=0.05 and m=0.05 as original paper setting.

4.2.2 From Supervised Learning to Unsupervised Learning

Most of the image-text embedding training method is based on the supervised learning which uses the pair of corresponding image and caption as ground truth.

However, TBN_MSCOCO which is trained from that kind of dataset “MSCOCO” could not combine the information from related story and perform not well on the image recall task on Blog-travel (See Table 4-1). Therefore, we propose an unsupervised method to consider more information from related stories near the image.

Baseline model Food Accom Q1 Q2 Q3

TBN_MSCOCO 0.091 0.083 0.062 0.074 0.049

Google_image_search 0.014 0.011 0.106 0.142 0.221 Table 4-1 The performance of TBN_MSCOCO is apparently not as well as Google_image_search where TBN_MSCOCO refers Wang et al. [14] structure.

From the statistics (See Figure 3-9), we consider the nearby sentences which are within distance 3 from an image to be the corresponding sentences of the image (See Figure 4-5). Instead of using the pairs of corresponding captions and the image (e.g., MSCOCO). We apply the pairs of nearby sentences and the image. Therefore, we do not need any caption annotation.

Figure 4-5 Example of the sentences which is within distance 3 (red circles) from the image is the corresponding sentences of the image.

In addition, we change the batch size of the model from 500 (original paper) to 100.

The idea of original model will pick top 10 similar image as negative example. This idea works for MSCOCO dataset because we could consider all images to be independent from each other. However, this idea seems wield in the blog because many images were taken from the same person in same place. We should not consider all image are independent

from each other. That is, top 10 from 500 images will easily pick the almost same photos which should not consider as negative example. But if we downsize the batch to 100, it could improve this situation and the model can still be trained efficiently.

4.2.3 Image Encoder and Text Encoder

The original paper of L. Wang et al. [14] uses HGLMM as sentence encoder which is not very easy to train and implement on the rest of our research. Due to the purpose of this thesis focuses on finding a way to deal with image recall issue. We apply another state-of-the-art sentence encoder models to get sentence embedding. The original paper uses VGG19 as image encoder. We also apply different image encoder. Finally, we choose ResNet50 as image encoder and InferSent as sentence encoder after the comparison on MSCOCO. (Figure 4-6).

Figure 4-6 The performance of different image encoder models and sentence encoder models on MSCOCO dataset. R@K which means recall at K is the common evaluation on image-text retrieval task.

在文檔中圖文生活日誌之圖片回憶研究 (頁 35-42)