Content-based Image Retrieval
5.2 The MINN for CBIR
5.2.1 Instance Extraction
In the last section, it is worth noting that the prototype of the image query system may suffer from the problem of losing the spatial information about the image. It mainly caused from the way of instance extraction, although each instance is consists of 5 subregions, and each subregion is composed of 2x2 pixels, it just considers the neighbor subregions of the point which the user selected(see Fig.5.1).
In this subsection, we proposed a new method to generate the instance from the image which considers the spatial information about the image. Still, instead of performing the precise image segmentation,we tried to find some new way to index the image, and instead of using the color histogram of the whole image, which may losing the spatial information
about the image again, we proposed a histogram named the Weighted Color Histogram.
When the user selects a point on an image, it usually means that the point is important on a certain view or to a certain extent that it is provided with certain significant informa-tion, and the degree of the importance will decrease gradually, that is, when the distance is farther, the importance is lower. We assume that it will decrease progressively by some distribution, i.e. as a Gaussian distribution. After the user decided where to select an instance from the image, a Gaussian-like mask is adopt to create a masked image, and weighting of the color histogram of the masked image is calculated.
The idea of the Gaussian-like mask is from the smoothing parameters, that control the effective size of the local neighborhood, and of the class of regular functions fitted locally[59](see Fig.5.3). The local neighborhood is specified by a kernel function Kλ(x0, x) which assigns weights to points x in a region around x0 that decrease its importance exponentially with their squared Euclidean distance from x0. In our method of image indexing, we adapt the Gaussian kernel as a weight function based on the Gaussian density function, the parameter λ corresponds to the variance of the Gaussian density which is set to λ = 0.2 in[59], and it controls the width of the neighborhood:
Kλ(x0, x) = 1
λexp[−kx − x0k
2λ ] (5.1)
In order to apply the Gaussian kernel to image representation, it must extend the Gaussian kernel from 1-D to 2-D as below. Given a selected point m = (mxi, myi), the Gaussian-like mask located at point m which is selected by the user is :
Gm(x, y) = exp
where σx2i and σy2i are variances of the Gaussian-like mask, we can take the value of the σx2i = 0.05 and σ2yi = 0.05 in the system.
The algorithm for calculating the Weighted Color Histogram(WCH) is described as below:
Figure 5.3: The diagram of the k-nearest neighbor(k = 30) and kernel-weighted function
1. User selects one point (mxi, myi) from the image where he/she considers it is a point matched the concept.
2. Calculating the weighting of each pixels in the image from (mxi, myi) using Eq.(5.2), and get the Gaussian-like mask matrix .
3. Calculating the histograms of each components in the Lab color space. While calculat-ing histograms,instead of increase progressively this bin by one in each bins as usual, we increase the correspondent value to (x, y) from the Gaussian-like mask matrix in step 2 as its importance, i.e. the weight of the neighbor.
4. the diagram is depicted as Fig.5.4
After extracting the color information from the image, we consider to extract the texture information. In the image retrieval system, it is very commonly used the texture charac-teristic of an image, and texture is defined as being specified by the statistical distribution of the spatial dependencies of gray level properties. For the repetitiveness , the direction-ality and the granularity of an image are sensitive to the human perception. We used the same idea of the Weighted Color Histogram feature introduced above, and proposed the Weighted Texture Histogram (WTH) as the texture feature of an instance.
Since the Gabor representation is optimal [60] in the sense of minimizing the uncertainty
Figure 5.4: The schematic diagram of the Weighted Color Histogram
in the space and the frequency domain, the Gabor wavelet decomposition [61] is used to extract the texture features from the image of multiple scales and orientations. Before to decompose an image, a Gabor filter set is created from a two-dimensional Gaussian-modulated complex sinusoid function
g(x, y) =
µ 1
2πσxσy
¶ exp
·
−1 2
µx2 σ2x + y2
σ2y
¶
+ jωx
¸
(5.3) In the Eq.(5.3), σx and σy two parameters decide the size of the Gaussian envelope along the respective axes. By selecting two parameters dilations T and rotations K of the rectilinear coordinates in the g(x, y), a Gabor filter set Gf = {gtk : 1 ≤ t ≤ T, 1 ≤ k ≤ K}
is created from
gtk(x, y) = a−tg(x0, y0), a > 1 in which
x0 = a−t(x cos θ + y sin θ), y0 = a−t(−x sin θ + y cos θ)
Figure 5.5: The schematic diagram of the Weighted Texture Histogram
and θ is the angle of the rotation : θ = kπ/K. With a Gabor filter gtk, the filter response of the image I can be calculated by the following convolution
Itk = I ∗ gtk, and its spectrogram is calculated as
Stk(x) = |Itk(x)|2.
In our system, a set of Gabor filters (scale=3, orientation=4) is used to create a set of texture planes from an image. For each texture plane, their Weighted Texture His-togram is calculated respectively. The algorithm for calculating the Weighted Texture Histogram(WTH) is described as below:
1. User selects one point (mxi, myi) from the image where he/she considers it is a point matched the concept.
2. Applied the Gabor Texture algorithm to the image, selects the two parameters scale and orientation ( i.e. scale=3, orientation=4), and generates scale*orientation texture planes.
3. Treats each texture plane as a gray image, and calculates the weighting of each pixels in the each texture image from (mxi, myi) using equation 5.2, and get the Gaussian-like mask matrix .
4. Calculating the histograms of each texture plane/image respectively. While calculat-ing histograms,instead of increase progressively this bin by one in each bins as usual, we increase the correspondent value to (x, y) from the Gaussian-like mask matrix in step 3 as its importance, i.e. the weight of the neighbor.
5. the diagram is depicted as Fig.5.5
After Calculating Weighted Texture Histogram of the image, we get scale*orientation 256 dimensions texture histograms of each planes, and quantizing each histogram from 256 dimensions to 64 dimensions. Then, the texture feature around a pixel x is represented by scale*orientation 64 dimensions histograms. An example of Gabor decomposition in a image is shown in Fig.5.6
Instead of performing the image segmentation, there is a price to be paid for maintaining the information from the images. In the next subsection, we will using several mixture Gaussian to approximate the feature which is formed by the Weighted Color Histogram and Weighted Texture Histogram.
5.2.2 Image Retrieve
Since the proposed features for image retrieval are no longer just individual points, variance and prior probability of points are also included. Thus, a new component density function is derived.
In the subsection 3.3.1 which mentioned that, given an image I and a set of i.i.d. patterns X = {x(t); t = 1, 2, · · · , N }, it means that a number of masked images are selected from the image first, and then a set of i.i.d. patterns, those so called instances, are extracted from the masked images. These instances are denoted by X = {x(t); t = 1, 2, · · · , N }, where each pattern x(t) is expressed by x(t) = {Ptr, Θtr; r = 1, 2, · · · , Rt} is a set of parameters of the mixture of Rt Gaussian distributions used to approximate the color histogram of
Figure 5.6: The Example of the Weighted Texture Histogram
the corresponding masked image. In each pattern x(t), Ptr denotes the prior probability of the cluster r, and Θtr represents the parameter set {µtr, σtr2} for a cluster r as below:
x(t) =
Rt
X
r=1
Prpr(t|Θr)
=
Rt
X
r=1
Prpr(t|µtr, σ2tr) (5.4) First, we use the proposed WB algorithm to determine the number of components Rtof each pattern x(t). As the same method in the subsection 4.3 and according to the Eq.(4.6):
Mm(t) = Xm
r=1
Prpr(t|Θr).
Now we consider the following three models m = 2, 3 and 4 again, say that : M2(t), M3(t) and M4(t). Then using the EM algorithm to estimate the parameters:
Θ(m) = ( bP1, · · · , bPm, bΘ1, · · · , bΘm)
in each Mm(t), where m = 2, 3 and 4, say bΘ(m) and generate a data set B(m) = ( by1, · · · , byn), n = 300 from cMm(t) =Pm
r=1Pbrpr(t|bΘr) respectively.
Finally, using the residual Ei for the ith observation which is defined in Eq.(4.7), and calculating each Qi which is defined in Eq.(4.8). For each model, computing µboot and σboot, then the best resampling parameters in x(t) can be determined.
Next, we assumed that the probability density function ( abbreviated as p.d.f.) of the conceptual class ωi is a function that describes the relative likelihood for this random variable to occur at a given instance x(t) in MINN which we want to approximate the conceptual image of the users can be represent as a linear combination of new component densities g(x(t)|Γj) in the form
Pr(x(t)) = XM
j=1
τjg(x(t)|Γj), (5.5)
where τj is the weighted of the prototypes j, Γj represents the parameter set {Λj, ²2j} for a prototype j, ²2j is the variance of the jth prototype. Λj is the parameter set of the prototype in the component j. In the traditional mixture density model, prototypes are points in a feature space, and {Λj; j = 1, 2, · · · , M } are mean of each component. Since the features of the proposed image content retrieval system are parameters of mixture Gaussian distributions, we assumed that the prototypes are mixture Gaussian distributions, too.
Suppose that each prototype is comprised by Rj Gaussian distributions, and then Λj can be considered as a parameter set {µji, σ2ji, Pji; i = 1, 2, · · · , Rj} used to describe the mixture Gaussian distribution of the corresponding prototype, where µji, σji2 and Pji are the mean, the variance, and the prior probability of a cluster i of the prototype, respectively.
The new component density g(x(t)|Γj) is defined as g(x(t)|Γj) = exp
where D(x(t), Λj) is a measurement function of the distance between x(t) and Λj, which is described in the follows.
Suppose there are two mixture Gaussian distributions p(t) = PRp
i=1Ppip(t|Θpi) and q(t) = PRq
j=1Pqjp(t|Θqj) , Θpi represents the parameter set {µpi, σ2pi} for a cluster i in p(t), Θqj represents the parameter set {µqj, σ2qj} for a cluster j in q(t), and p(t|Θpi) and q(t|Θqj) are Gaussian components of p(t) and q(t), respectively. The suitable parameters
Rp and Rq can be determine by the proposed WB algorithm which is mentioned in the subsection 4.3, and have been introduced in above paragraph. After Rp and Rq is deter-mined with the proposed WB algoritm, now we consider the similarity measure between two distribution with their parameters.
Let Λp = {Λ(p)i ; i = 1, 2, · · · , Rp} denote the parameter set of p(t) and Λq = {Λ(q)j ; j = 1, 2, · · · , Rq} denote the parameter set of q(t), where Λ(p)i = {Ppi, Θpi} and Λ(q)j = {Pqj, Θqj}.
Define the relation between Λ(p)i and Λ(q)j as the function G(Θpi, Θqj) = 1 When one of the distributions regress to a point, G(Θpi, Θqj) regress to a Gaussian likeli-hood function.
The distance between p(t) and q(t) can be calculated as [63]:
Z ∞
−∞
[p(t) − q(t)]2dt. (5.9)
Since t is a dummy parameter in Eq.(5.9), we can derive the following distance function between p(t) and q(t).
The derivation of (5.10) is as follows:
D(Λp, Λq) = Z ∞
n=−∞
[p(t) − q(t)]2dt
=
5.2.3 Learning Rules for the Image Content Retrieval System
In the image content retrieval system, the reinforced and antireinforced learning rules describing in Subsection 3.3.2 can also be applied to the energy function (3.11) with new component density g(x(t)|Γj), and Γj represents the parameter set {Λj, ²2j} for a component j , Λj can be considered as a parameter set {µji, σji2, Pji; i = 1, 2, · · · , Rj} used to describe
the mixture Gaussian distribution of the corresponding prototype, where µji, σ2ji and Pji are the mean, the variance, and the prior probability of a cluster i of the prototype, respectively. For convenience, we rewrite the energy function in here:
E(Xi, Γj) = − ln
By taking the partial derivative of E with respect to parameters of the conceptual images class, we have probability of a component j given input pattern bkl, which is defined as
g(Γj|bkl) = g(bkl|Γj)
g(bkl) . (5.16)
Since the constrains of PM
j=1Gj = 1 and PRj
i=1Pji= 1, the EM procedure is applied to set the prior probability τj of component j and Pji of cluster i in component j as
Gnewj =
where Wjm is a equation
In order to evaluate the performance of the MINN based image retrieval, 10 categories, which are Africa, Beach, Building, Buses, Dinosaurs, Elephants, Flowers, Horses, Moun-tains, and Food, of pictures are selected [6] from the COREL Gallery 1, 000, 000. Each category contains 100 images from the Gallery data set. The mean and variance of the precision rate are computed as follows. Give a query image q belonging to a category C, a retrieved image is considered a match if it belongs to the category C. Suppose the first N retrievals contains nq matched candidates, and then the precision rate for the query image q are defined as P recision(q) = nNq, and the average precision rate of the category C are computed as PC = W1 P
q∈CP recision(q), where W is the number of the images in the category C.
For each category, one subnet of the MINN is trained to represent it. The training data of the MINN are generated as follows. For each subnet of the MINN, there are 100 images in each category, is firstly clustering by k-means as a positive example set of the training set to capture user’s concept, and to train a prototype model. For each image in the database, randomly selected five points to simulate user’s click on that image, then extract the WCH feature and WTH feature as described in subsection 5.2.1, and apply the EM algorithm to get the parameters of the mixed Gaussian distribution.
After the prototype model has been trained, each image in the database is compared to the prototype model, then it will return a ranked list. In order to demonstrate the performance of the MINN based image retrieval with relevance feedback, the MINN system is retrained by selected first 10 images in the ranked list which are in the same category with the query image as the positive images set, and picked up 5 unmatched images as the negative images set automatically. We compared the proposed MINN based image retrieval with two leading image retrieval methods, the IRM [1] and CLUE [39]. Both methods use
the same amount of testing images and categories from COREL data set as the proposed system. The results of the experimental results are shown in Table 5.1. The first row is the result of the IRM method, the second row is of the CLUE method, the third row is of the MINN without relevance feedback, and the fourth row is of the MINN with relevance feedback.
Table 5.1: The average precision rates of four different retrieving results from (1)IRM, (2)CLUE, (3)MINN without relevance feedback, and (4)MINN with relevance feedback, on the different categories of images.
category Africa Beach Building Buses Dinosaurs Elephants
IRM 0.475 0.325 0.33 0.36 0.981 0.4
CLUE 0.49 0.34 0.35 0.62 0.98 0.29
MINN 0.39 0.21 0.28 0.35 0.86 0.37
MINN with RF 0.62 0.33 0.55 0.49 0.99 0.42
category Flowers Horses Mountains Food Average
IRM 0.406 0.719 0.342 0.34 0.468
CLUE 0.75 0.7 0.28 0.59 0.538
MINN 0.39 0.63 0.29 0.26 0.403
MINN with RF 0.63 0.84 0.47 0.58 0.593
It can see that in Table 5.1, without segmentation and using only the color histogram and texture as the image features, the MINN without relevance feedback performs slightly inferior to IRM method. But with the relevance feedback mechanism, the average precision rate of the MINN can effectively be improved from 40.3% to 59.3%, which outperforms to 46.8% of the IRM method and 53.8% of the CLUE method. It indicates that the MINN with relevance feedback could be more appropriate to user’s desired than by CLUE or IRM method. Just for reference, from fig.5.7 to fig.5.16 illustrate the retrieved results by MINN method for 10 classes respectively. It just shows the top-20 results of each class, and the numeral below each image is the rank, index number and similarity score, respectively.
From the figure of each retrieval results, it also shows the number of the matching. Even though some matched rate are not very good for certain classes, there are several images in that class have the same color distribution with the query image, which are very similar to the query image in the human vision.
Figure 5.7: The retrieval results of the ’africa’ class : 8 matches out of 20.
Figure 5.8: The retrieval results of the ’beach’ class : Although there is only 1 match out of 20, there are several images with red and white interleaving are similar to the query image.
Figure 5.9: The retrieval results of the ’building’ class : 9 matches out of 20. It’s interesting that there are several images with leg of the elephant are similar to the query image.
Figure 5.10: The retrieval results of the ’bus’ class : 15 matches out of 20. The main reason of the high matched rate is the color of the bus which dominates the color histogram of the image.
Figure 5.11: The retrieval results of the ’dinosaur’ class : 19 matches out of 20. The main reason of the high matched rate is the color of the background which dominates the color histogram of the image.
Figure 5.12: The retrieval results of the ’elephant’ class : 8 matches out of 20. There are still several images have the same color distribution with the query image
Figure 5.13: The retrieval results of the ’flowers’ class : 6 matches out of 20. There are still several images have the similar color distribution with the query image
Figure 5.14: The retrieval results of the ’horses’ class : 12 matches out of 20. It’s interesting that there are several images of elephant are similar to the query image with horses.
Figure 5.15: The retrieval results of the ’mountains’ class : 3 matches out of 20. There are still several images have the similar color distribution with the query image
Figure 5.16: The retrieval results of the ’food’ class : 5 matches out of 20. There are still several images have the similar texture distribution with the query image