6. Matching and Learning for Image Retrieval
As mentioned in the previous chapter, two approaches could be potential to solve the problem of semantic gap in image retrieval: (i) image annotation for discovering the semantic contents of images and (ii) relevance feedbacks for interactively learning the user intention. This chapter presents our approach to estimating the user intention in relevance feedbacks for image retrieval. The estimation of user intension is based on either the visual-word-based image feature, which is described in Chapter 4, or the semantic-based image feature, which is proposed in Chapter 5. Moreover, we employ the estimation of user intention to define the image matching for retrieval.
In this chapter, Section 6.1 first introduces the motivation of our ideas to solve the problem of image retrieval and presents the overview of our approaches. Section 6.2 formulates our problems and notations. Then, the three components of the proposed estimation of the user intention, containing the likelihood measure, the confusion measure, and the initialization, are described in Section 6.3, 6.4, and 6.5, respectively. Therefore, the image matching and ranking, which are based on the estimation results of the user intention, is presented in Section 6.6.
6.1. Motivation and Overview
We consider the problem of category search in image retrieval. This involves grouping images into the same category that the user perceives to be semantically relevant. For example, the image set from Corel Photo, a set of image data widely
used in many researches, contains many types of semantic categories. Hence we can consider a user called “Corel Photo” who chooses relevant images to form these categories. Note that different users may assign different semantic categories within the same image set. The main challenge for category search is to estimate the user concepts, e.g. Corel Photo, from the interaction of the retrieval.
Let a query session comprise the first query and corresponding relevance feedbacks. We assume that the user does not change the requesting concepts, that is, the semantic concepts in a query session are constant. Ideally, we can view the process of obtaining relevance feedbacks as tracing the path from the first query to the retrieval goals, from which we can estimate the user intention in a retrieval task. That is to say, the task of image retrieval with relevance feedbacks can be transformed to estimate what semantic concepts the user intention involves. For this purpose, we need to design three tasks: (i) what are the representing units of the semantic concepts in the estimation, (ii) how to estimate the user intention based on the representing units, and (iii) how to match two images (i.e., compute the similarity measure of two images) based on the estimated user intention.
In order to bridge the semantic gap for image retrieval, in intuitive, the image representation should be based on semantic information, e.g., the semantic-based image features proposed in the previous chapter. However, it is still difficult to discover the semantic-based information well from images. In this work, two types of representing units in the estimation of the user intention are designed. The one is based on the semantic-based image features (in Chapter 5) that involve some semantic information of images by use of automatic image annotation. The other is based on the visual-word-based image features (in Chapter 4) that are somewhat middle-level information.
Moreover, we design an interactive estimation for the user intention for image retrieval. The interactive estimation comprises two factors: (i) the likelihood measure that analyzes which units are appropriate to represent positive examples interactively assigned by the user in relevance feedbacks, and (ii) the confusion measure that analyze the degree of the confusion between any two representing units. Note that the second factor, confusion measure, is less discussed in image retrieval. The proposed interactive scheme of the estimation is independent of the selection of the representing units. Using either visual-word-based or semantic-based image features, the proposed scheme can yield the representing features (visual-word-based or semantic-based, respectively) for the user intention in the query session.
Finally, we employ the estimated features to perform image matching and ranking. The query part includes multiple positive examples in the query session.
Then, we need to compute the similarity measure between the images of the query session and each image in the database. Our approach is to extract the feature of each image in the database and to estimate the user intention of the query session to generate the representing features, using either visual-word-based or semantic-based image features, and then their similarity measure can be computed by Euclidean distance.
6.2. Estimating User Intention
The most intuitive idea of the estimation is to observe the likelihood in the positive examples, i.e., which units are mostly included in these positive examples.
However, the semantic-based or visual-word-based image features does not often well represent images. Hence, we plan to involve the confusion possibility between two
representing units such that the estimation can cover the miss of the image representation.
Suppose we have the set of NU units, simply denoted { 1,..., }
Nu
v v
V = , that are
used to represent all images. Note that the representing units can be based on either visual-word-based or semantic-based image features. In the former, all visual words in the feature spaces can form the representing units, also denoted VVW ={v1,...,vK} simply, with NU=K. In the latter, the set of predefined semantic labels is used to be the representing units, VS ={v1,...,vH}, with NU=H.
Let ( ) { 1 ,..., }
Nt
I I t
Q = containing Nt images be the set of positive images that are specified by the user at t-th iteration in relevance feedbacks. Also, we define
)}
( ..., ), ( { )
(t u1 t u t
NU
=
u with NU dimensions as the image feature that is used for estimating the user requests at t-th iteration in the query session. ui(t) is the confidence value that the user intention contains the unit v in the query session. i Then we design the following equation, Equation (6.1), to interactively estimate the user intention.
) ) 1 ( (
)) ( ( ) (
∑
1=
−
⋅
⋅
= K
j
j ij
i t Q t M u t
u L . (6.1)
In Equation (6.1), the interactive computation contains three terms: (i) L(Q(t)) for the likelihood of positive examples at the current iteration, (ii) M for the ij confusion possibility between two units vi and vj, and (iii) uj(t−1) for the estimation of the previous iteration. Note that Mij⋅uj(t−1) means the possibility of the false estimation that decide unit i to be j at the previous iteration. For each iteration in relevance feedbacks, we can compute (t) {u1(t),...,u (t)}
NU
=
u for the estimation of
the user intention by Equation (6.1), and then apply it to matching and ranking images
in the database.
6.3. Likelihood Measure
Because the set, Q(t), of positive images consists of Nt images { 1 ,..., }
Nt
I
I , the
likelihood of Q(t) can be defined as the combination of all of individual likelihoods of images. Let L(Q(t)) be the likelihood of an individual image Ii. ))L(Q(t can be defined the features, denoted as V, extracted from image Ii:
)
| ..., , ( )
| ( )
(Ii pV Ii p v1 vN Ii
= U
L = , where Ii∈Q(t). (6.2)
That is to say, the system observes positive images by extracting their features for the likelihood measure at each iteration. Therefore, the global likelihood of Q(t) can be integrated as:
∑
∑
= ==
= t Nt
i
i t
N
i i t
I V N p
N I t Q
1 1
)
| 1 (
) 1 (
)) (
( L
L . (6.3)
Note that we set the equal weights for combining sub-likelihoods in Equation (6.3).
Other weighting methods are able to employ the combination. Then we describe the details of the likelihood based on either semantic-based or visual-word-based image features in the two subsections.
6.3.1. Likelihood measure on visual-word-based image feature
Based on visual-word-based image feature, the likelihood, L(Q(t)), of an image Ii can be defined the visual-word-based image feature extracted from the image Ii. In this case, the number of representing units is the number of visual words in the feature spaces i.e., NU=K. The details of extracting visual-word-based image feature
have been described in Chapter 4, and the likelihood L(Q(t)) can be denoted as:
} ..., , { )
| ( )
(Ii = pV Ii =vVW = v1 vK
L . (6.4)
6.3.2. Likelihood measure on semantic-based image feature
Based on semantic-based image feature, similar to the previous subsection, the likelihood of an image Ii can be defined the semantic-based image feature extracted from the image Ii. In this case, the number of representing units is the number of semantic labels for image annotation, i.e., NU =H. The details of extracting semantic- based image feature have been described in Chapter 5, and the likelihood L(Q(t)) can be denoted as:
} ..., , { )
| ( )
(Ii = pV Ii =vS = v1 vH
L . (6.5)
6.4. Confusion Measure
The confusion measure is defined as the confusion possibility which measures the degree of the confusing representation between two representing units. For example, the two semantic units “sky” and “lake” are easier to be confused in image annotation for the blue in color space, but labels “sky” and “forest” are not. The confusion information can assist the estimation of user intention in avoiding the effect of the false units in image representation.
Let M is the confusion matrix, with NU×NU, where each item Mij in the matrix means the confusion possibility that an image with unit vi is false represented to be unit vj. Thus we compute the confusion possibility for each pairwise of two units vi and vj, based on either visual-word-based or semantic-based image features, and
describe the details in the next two subsections.
6.4.1. Confusion measure on visual-word-based image feature
Given any two representing units vi to vj, the changing cost from vi to vj can be measured by the Euclidean distance, denoted as dist(vi,vj), from the centroid of vi to the centroid of vj in the visual feature space. Hence, we can compute each item Mij of the confusion matrix as
)) . , ( exp(
)) , ( exp(
2
∑ −
−
= ⋅
y
y i
j i
ij dist v v
v v
M dist (6.6)
Note that Mij ≠Mji for the asymmetric of the concepts between two units.
In implementation, we adopted several visual features, inclusive of color-size histogram and moments, Gabor texture, and SIFT descriptor, to generate the visual-word-based image feature. Our idea is to collect many types of feature spaces because we cannot know which types of features are appropriated for the user intention until the user feedbacks are reached. The more the different types of features are used for the visual-word-based image features, the better the system estimates which units can represent the user intention. Note that the confusion measure between two units is set 0 if the two units are from different types of visual features.
6.4.2. Confusion measure on semantic-based image feature
Now, we design the pairwise confusion measure between any two semantic labels. The basic idea is to analyze the labeled training data which are used for the annotation task in Chapter 5. Consider two labels Li and Lj with training data Di and Dj. Note that the intersection of Di and Dj may be not empty because an image can
include multiple labels. With loss of the generality, we can assume that the intersection set of them is empty, Di∩Dj =φ, by eliminating the duplicated parts.
Table 6-1. The algorithm of computing the confusion matrix on semantic-based image features.
Our idea is to analyze how confusion these two datasets make for the two labels.
The detailed algorithm that computes the confusion item Mij for label Li and Lj is shown in Table 6-1. For each pair of labels Li and Lj, we employ the splitting method, which is described in Chapter 5, to divide the union of the two datasets, Di+Dj, into two classes, which are used to measure the confusion possibility of the two labels.
Finally, we should normalize all items Mij by:
←
∑
k ik ij
ij M
M M . (6.7)
Figure 6-1 illustrates an example to present the design of the confusion measure.
Input: Di, Dj, for all 1≦i, j≦H
Output: a confusion matrix M with items Mij
//Mij: the confusion possibility of classifying data labeled Li to Lj
Initialization: Mij = 0 for all i and j 1. for each pairwise Di and Dj, i≠j {
2. apply K-means clustering to divide Di+Dj into two classes Dii+Dij and Djj+Dji
// Dii> Dij, Djj> Dji
// Dii: data in Di correctly classified to label Li
// Dij: data in Di incorrectly classified to label Lj
3.
i ii ii
ii D
M D
M | |
+
= ,
i ij
ij D
M |D |
= , and
j ji
ji D
M |D |
=
}
4. for each i
5. Mii =Mii/(H −1) //normalizing for Mii
Assume that H=3, |D1|=|D2|=30, and |D3|=40. The standard K-means clustering is used to classify the three pairs of datasets shown in Figure 6-1(a), and the confusion matrix computed by line 3 of the algorithm is shown in Figure 6-1(b). The final confusion matrix computed by Equation (6.7) is shown in Figure 6-1(c).
(a) Datasets and classification results using K-means clustering.
(b) The original confuse matrix of the classification.
(c) The normalized confusion matrix of the semantic labels.
Figure 6-1. Illustration of computing the transition function with H=3.
6.5. Initialization
Equation (6.1) needs an initialization for that the recursive computation. The simplest method for initialization is the equal prior, that is to say, all units have the same possibility to represent the user intention. Our implementation adopted the equal prior for the experiments in Chapter 7. Besides, other methods can be used to decide the prior probability, for example, a common image retrieval approach can be used to in rough decide which units are appropriate for the query.
6.6. Image Matching
Given a set of positive examples, denoted as ( ) { 1,..., }
Nt
I I t
Q = , at t-th iteration of relevance feedbacks in a query session, and given an image I in the database, the estimated vector for user intention, (t) {u1(t),...,u (t)}
NU
=
u described as in Section
6.2, is denoted by the notation Q(t) { 1Q(t),..., QN(t)} u U
= u
u , and the estimated user intention for image I is denoted by I { 1I,..., NI }
u U
= u
u in this section. Therefore, we need to define a similarity measure between them in order to image match and rank. In our work, we design the image similarity measure on one of two bases (not use both them at the same time): one on visual-word-based image feature, proposed in Chapter 4, which is for visual feature spaces, and the other on semantic-based image feature, designed in Chapter 5, which is for higher-level concept. Then, the similarity measure between I and Q(t), denoted as Sim(I,Q(t)), is defined as:
2 / 1 2 1
) ( )
( ) ( ( ) )
, ( distance ))
( ,
(
∑
=
−
=
−
= NU
i
t Q i I i t
Q
I u u
t Q I
Sim u u . (6.8)