The MINN for CBIR - Content-based Image Retrieval

Content-based Image Retrieval

5.2 The MINN for CBIR

5.2.1 Instance Extraction

In the last section, it is worth noting that the prototype of the image query system may suffer from the problem of losing the spatial information about the image. It mainly caused from the way of instance extraction, although each instance is consists of 5 subregions, and each subregion is composed of 2x2 pixels, it just considers the neighbor subregions of the point which the user selected(see Fig.5.1).

In this subsection, we proposed a new method to generate the instance from the image which considers the spatial information about the image. Still, instead of performing the precise image segmentation,we tried to find some new way to index the image, and instead of using the color histogram of the whole image, which may losing the spatial information

about the image again, we proposed a histogram named the Weighted Color Histogram.

When the user selects a point on an image, it usually means that the point is important on a certain view or to a certain extent that it is provided with certain significant informa-tion, and the degree of the importance will decrease gradually, that is, when the distance is farther, the importance is lower. We assume that it will decrease progressively by some distribution, i.e. as a Gaussian distribution. After the user decided where to select an instance from the image, a Gaussian-like mask is adopt to create a masked image, and weighting of the color histogram of the masked image is calculated.

The idea of the Gaussian-like mask is from the smoothing parameters, that control the effective size of the local neighborhood, and of the class of regular functions fitted locally[59](see Fig.5.3). The local neighborhood is specified by a kernel function K_λ(x₀, x) which assigns weights to points x in a region around x₀ that decrease its importance exponentially with their squared Euclidean distance from x0. In our method of image indexing, we adapt the Gaussian kernel as a weight function based on the Gaussian density function, the parameter λ corresponds to the variance of the Gaussian density which is set to λ = 0.2 in[59], and it controls the width of the neighborhood:

K_λ(x₀, x) = 1

λexp[−kx − x₀k

2λ ] (5.1)

In order to apply the Gaussian kernel to image representation, it must extend the Gaussian kernel from 1-D to 2-D as below. Given a selected point m = (mxi, myi), the Gaussian-like mask located at point m which is selected by the user is :

G_m(x, y) = exp

where σ_x²_i and σ_y²_i are variances of the Gaussian-like mask, we can take the value of the σ_x²_i = 0.05 and σ²_y_i = 0.05 in the system.

The algorithm for calculating the Weighted Color Histogram(WCH) is described as below:

Figure 5.3: The diagram of the k-nearest neighbor(k = 30) and kernel-weighted function

1. User selects one point (m_x_i, m_y_i) from the image where he/she considers it is a point matched the concept.

2. Calculating the weighting of each pixels in the image from (m_x_i, m_y_i) using Eq.(5.2), and get the Gaussian-like mask matrix .

3. Calculating the histograms of each components in the Lab color space. While calculat-ing histograms,instead of increase progressively this bin by one in each bins as usual, we increase the correspondent value to (x, y) from the Gaussian-like mask matrix in step 2 as its importance, i.e. the weight of the neighbor.

4. the diagram is depicted as Fig.5.4

After extracting the color information from the image, we consider to extract the texture information. In the image retrieval system, it is very commonly used the texture charac-teristic of an image, and texture is defined as being specified by the statistical distribution of the spatial dependencies of gray level properties. For the repetitiveness , the direction-ality and the granularity of an image are sensitive to the human perception. We used the same idea of the Weighted Color Histogram feature introduced above, and proposed the Weighted Texture Histogram (WTH) as the texture feature of an instance.

Since the Gabor representation is optimal [60] in the sense of minimizing the uncertainty

Figure 5.4: The schematic diagram of the Weighted Color Histogram

in the space and the frequency domain, the Gabor wavelet decomposition [61] is used to extract the texture features from the image of multiple scales and orientations. Before to decompose an image, a Gabor filter set is created from a two-dimensional Gaussian-modulated complex sinusoid function

g(x, y) =

µ 1

2πσ_xσ_y

¶ exp

−1 2

µx² σ²_x + y²

σ²_y

+ jωx

(5.3) In the Eq.(5.3), σ_x and σ_y two parameters decide the size of the Gaussian envelope along the respective axes. By selecting two parameters dilations T and rotations K of the rectilinear coordinates in the g(x, y), a Gabor filter set G_f = {g_tk : 1 ≤ t ≤ T, 1 ≤ k ≤ K}

is created from

gtk(x, y) = a^−tg(x⁰, y⁰), a > 1 in which

x⁰ = a^−t(x cos θ + y sin θ), y⁰ = a^−t(−x sin θ + y cos θ)

Figure 5.5: The schematic diagram of the Weighted Texture Histogram

and θ is the angle of the rotation : θ = kπ/K. With a Gabor filter gtk, the filter response of the image I can be calculated by the following convolution

I_tk = I ∗ g_tk, and its spectrogram is calculated as

S_tk(x) = |I_tk(x)|².

In our system, a set of Gabor filters (scale=3, orientation=4) is used to create a set of texture planes from an image. For each texture plane, their Weighted Texture His-togram is calculated respectively. The algorithm for calculating the Weighted Texture Histogram(WTH) is described as below:

1. User selects one point (m_x_i, m_y_i) from the image where he/she considers it is a point matched the concept.

2. Applied the Gabor Texture algorithm to the image, selects the two parameters scale and orientation ( i.e. scale=3, orientation=4), and generates scale*orientation texture planes.

3. Treats each texture plane as a gray image, and calculates the weighting of each pixels in the each texture image from (m_x_i, m_y_i) using equation 5.2, and get the Gaussian-like mask matrix .

4. Calculating the histograms of each texture plane/image respectively. While calculat-ing histograms,instead of increase progressively this bin by one in each bins as usual, we increase the correspondent value to (x, y) from the Gaussian-like mask matrix in step 3 as its importance, i.e. the weight of the neighbor.

5. the diagram is depicted as Fig.5.5

After Calculating Weighted Texture Histogram of the image, we get scale*orientation 256 dimensions texture histograms of each planes, and quantizing each histogram from 256 dimensions to 64 dimensions. Then, the texture feature around a pixel x is represented by scale*orientation 64 dimensions histograms. An example of Gabor decomposition in a image is shown in Fig.5.6

Instead of performing the image segmentation, there is a price to be paid for maintaining the information from the images. In the next subsection, we will using several mixture Gaussian to approximate the feature which is formed by the Weighted Color Histogram and Weighted Texture Histogram.

5.2.2 Image Retrieve

Since the proposed features for image retrieval are no longer just individual points, variance and prior probability of points are also included. Thus, a new component density function is derived.

In the subsection 3.3.1 which mentioned that, given an image I and a set of i.i.d. patterns X = {x(t); t = 1, 2, · · · , N }, it means that a number of masked images are selected from the image first, and then a set of i.i.d. patterns, those so called instances, are extracted from the masked images. These instances are denoted by X = {x(t); t = 1, 2, · · · , N }, where each pattern x(t) is expressed by x(t) = {Ptr, Θtr; r = 1, 2, · · · , Rt} is a set of parameters of the mixture of Rt Gaussian distributions used to approximate the color histogram of

Figure 5.6: The Example of the Weighted Texture Histogram

the corresponding masked image. In each pattern x(t), P_tr denotes the prior probability of the cluster r, and Θ_tr represents the parameter set {µ_tr, σ_tr²} for a cluster r as below:

x(t) =

r=1

Prpr(t|Θr)

r=1

P_rp_r(t|µ_tr, σ²_tr) (5.4) First, we use the proposed WB algorithm to determine the number of components R_tof each pattern x(t). As the same method in the subsection 4.3 and according to the Eq.(4.6):

M_m(t) = Xm

r=1

P_rp_r(t|Θ_r).

Now we consider the following three models m = 2, 3 and 4 again, say that : M₂(t), M₃(t) and M₄(t). Then using the EM algorithm to estimate the parameters:

Θ(m) = ( bP₁, · · · , bP_m, bΘ₁, · · · , bΘ_m)

in each Mm(t), where m = 2, 3 and 4, say bΘ(m) and generate a data set B(m) = ( by1, · · · , byn), n = 300 from cMm(t) =P_m

r=1Pbrpr(t|bΘr) respectively.

Finally, using the residual E_i for the ith observation which is defined in Eq.(4.7), and calculating each Q_i which is defined in Eq.(4.8). For each model, computing µ_boot and σ_boot, then the best resampling parameters in x(t) can be determined.

Next, we assumed that the probability density function ( abbreviated as p.d.f.) of the conceptual class ωi is a function that describes the relative likelihood for this random variable to occur at a given instance x(t) in MINN which we want to approximate the conceptual image of the users can be represent as a linear combination of new component densities g(x(t)|Γ_j) in the form

P_r(x(t)) = XM

j=1

τ_jg(x(t)|Γ_j), (5.5)

where τ_j is the weighted of the prototypes j, Γ_j represents the parameter set {Λ_j, ²²_j} for a prototype j, ²²_j is the variance of the j^th prototype. Λj is the parameter set of the prototype in the component j. In the traditional mixture density model, prototypes are points in a feature space, and {Λ_j; j = 1, 2, · · · , M } are mean of each component. Since the features of the proposed image content retrieval system are parameters of mixture Gaussian distributions, we assumed that the prototypes are mixture Gaussian distributions, too.

Suppose that each prototype is comprised by R_j Gaussian distributions, and then Λ_j can be considered as a parameter set {µ_ji, σ²_ji, P_ji; i = 1, 2, · · · , R_j} used to describe the mixture Gaussian distribution of the corresponding prototype, where µji, σ_ji² and Pji are the mean, the variance, and the prior probability of a cluster i of the prototype, respectively.

The new component density g(x(t)|Γ_j) is defined as g(x(t)|Γ_j) = exp

where D(x(t), Λ_j) is a measurement function of the distance between x(t) and Λ_j, which is described in the follows.

Suppose there are two mixture Gaussian distributions p(t) = P_R_p

i=1P_pip(t|Θ_pi) and q(t) = P_R_q

j=1P_qjp(t|Θ_qj) , Θ_pi represents the parameter set {µ_pi, σ²_pi} for a cluster i in p(t), Θqj represents the parameter set {µqj, σ²_qj} for a cluster j in q(t), and p(t|Θpi) and q(t|Θqj) are Gaussian components of p(t) and q(t), respectively. The suitable parameters

R_p and R_q can be determine by the proposed WB algorithm which is mentioned in the subsection 4.3, and have been introduced in above paragraph. After R_p and R_q is deter-mined with the proposed WB algoritm, now we consider the similarity measure between two distribution with their parameters.

Let Λp = {Λ^(p)_i ; i = 1, 2, · · · , Rp} denote the parameter set of p(t) and Λq = {Λ^(q)_j ; j = 1, 2, · · · , Rq} denote the parameter set of q(t), where Λ^(p)_i = {Ppi, Θpi} and Λ^(q)_j = {Pqj, Θqj}.

Define the relation between Λ^(p)_i and Λ^(q)_j as the function G(Θ_pi, Θ_qj) = 1 When one of the distributions regress to a point, G(Θ_pi, Θ_qj) regress to a Gaussian likeli-hood function.

The distance between p(t) and q(t) can be calculated as [63]:

Z _∞

−∞

[p(t) − q(t)]²dt. (5.9)

Since t is a dummy parameter in Eq.(5.9), we can derive the following distance function between p(t) and q(t).

The derivation of (5.10) is as follows:

D(Λp, Λq) = Z _∞

n=−∞

[p(t) − q(t)]²dt

5.2.3 Learning Rules for the Image Content Retrieval System

In the image content retrieval system, the reinforced and antireinforced learning rules describing in Subsection 3.3.2 can also be applied to the energy function (3.11) with new component density g(x(t)|Γj), and Γj represents the parameter set {Λj, ²²_j} for a component j , Λ_j can be considered as a parameter set {µ_ji, σ_ji², P_ji; i = 1, 2, · · · , R_j} used to describe

the mixture Gaussian distribution of the corresponding prototype, where µ_ji, σ²_ji and P_ji are the mean, the variance, and the prior probability of a cluster i of the prototype, respectively. For convenience, we rewrite the energy function in here:

E(Xi, Γj) = − ln

By taking the partial derivative of E with respect to parameters of the conceptual images class, we have probability of a component j given input pattern b_kl, which is defined as

g(Γ_j|b_kl) = g(b_kl|Γ_j)

g(b_kl) . (5.16)

Since the constrains of P_M

j=1G_j = 1 and P_R_j

i=1P_ji= 1, the EM procedure is applied to set the prior probability τ_j of component j and P_ji of cluster i in component j as

G^new_j =

where W_jm is a equation

In order to evaluate the performance of the MINN based image retrieval, 10 categories, which are Africa, Beach, Building, Buses, Dinosaurs, Elephants, Flowers, Horses, Moun-tains, and Food, of pictures are selected [6] from the COREL Gallery 1, 000, 000. Each category contains 100 images from the Gallery data set. The mean and variance of the precision rate are computed as follows. Give a query image q belonging to a category C, a retrieved image is considered a match if it belongs to the category C. Suppose the first N retrievals contains n_q matched candidates, and then the precision rate for the query image q are defined as P recision(q) = ⁿ_N^q, and the average precision rate of the category C are computed as PC = _W¹ P

q∈CP recision(q), where W is the number of the images in the category C.

For each category, one subnet of the MINN is trained to represent it. The training data of the MINN are generated as follows. For each subnet of the MINN, there are 100 images in each category, is firstly clustering by k-means as a positive example set of the training set to capture user’s concept, and to train a prototype model. For each image in the database, randomly selected five points to simulate user’s click on that image, then extract the WCH feature and WTH feature as described in subsection 5.2.1, and apply the EM algorithm to get the parameters of the mixed Gaussian distribution.

After the prototype model has been trained, each image in the database is compared to the prototype model, then it will return a ranked list. In order to demonstrate the performance of the MINN based image retrieval with relevance feedback, the MINN system is retrained by selected first 10 images in the ranked list which are in the same category with the query image as the positive images set, and picked up 5 unmatched images as the negative images set automatically. We compared the proposed MINN based image retrieval with two leading image retrieval methods, the IRM [1] and CLUE [39]. Both methods use

the same amount of testing images and categories from COREL data set as the proposed system. The results of the experimental results are shown in Table 5.1. The first row is the result of the IRM method, the second row is of the CLUE method, the third row is of the MINN without relevance feedback, and the fourth row is of the MINN with relevance feedback.

Table 5.1: The average precision rates of four different retrieving results from (1)IRM, (2)CLUE, (3)MINN without relevance feedback, and (4)MINN with relevance feedback, on the different categories of images.

category Africa Beach Building Buses Dinosaurs Elephants

IRM 0.475 0.325 0.33 0.36 0.981 0.4

CLUE 0.49 0.34 0.35 0.62 0.98 0.29

MINN 0.39 0.21 0.28 0.35 0.86 0.37

MINN with RF 0.62 0.33 0.55 0.49 0.99 0.42

category Flowers Horses Mountains Food Average

IRM 0.406 0.719 0.342 0.34 0.468

CLUE 0.75 0.7 0.28 0.59 0.538

MINN 0.39 0.63 0.29 0.26 0.403

MINN with RF 0.63 0.84 0.47 0.58 0.593

It can see that in Table 5.1, without segmentation and using only the color histogram and texture as the image features, the MINN without relevance feedback performs slightly inferior to IRM method. But with the relevance feedback mechanism, the average precision rate of the MINN can effectively be improved from 40.3% to 59.3%, which outperforms to 46.8% of the IRM method and 53.8% of the CLUE method. It indicates that the MINN with relevance feedback could be more appropriate to user’s desired than by CLUE or IRM method. Just for reference, from fig.5.7 to fig.5.16 illustrate the retrieved results by MINN method for 10 classes respectively. It just shows the top-20 results of each class, and the numeral below each image is the rank, index number and similarity score, respectively.

From the figure of each retrieval results, it also shows the number of the matching. Even though some matched rate are not very good for certain classes, there are several images in that class have the same color distribution with the query image, which are very similar to the query image in the human vision.

Figure 5.7: The retrieval results of the ’africa’ class : 8 matches out of 20.

Figure 5.8: The retrieval results of the ’beach’ class : Although there is only 1 match out of 20, there are several images with red and white interleaving are similar to the query image.

Figure 5.9: The retrieval results of the ’building’ class : 9 matches out of 20. It’s interesting that there are several images with leg of the elephant are similar to the query image.

Figure 5.10: The retrieval results of the ’bus’ class : 15 matches out of 20. The main reason of the high matched rate is the color of the bus which dominates the color histogram of the image.

Figure 5.11: The retrieval results of the ’dinosaur’ class : 19 matches out of 20. The main reason of the high matched rate is the color of the background which dominates the color histogram of the image.

Figure 5.12: The retrieval results of the ’elephant’ class : 8 matches out of 20. There are still several images have the same color distribution with the query image

Figure 5.13: The retrieval results of the ’flowers’ class : 6 matches out of 20. There are still several images have the similar color distribution with the query image

Figure 5.14: The retrieval results of the ’horses’ class : 12 matches out of 20. It’s interesting that there are several images of elephant are similar to the query image with horses.

Figure 5.15: The retrieval results of the ’mountains’ class : 3 matches out of 20. There are still several images have the similar color distribution with the query image

Figure 5.16: The retrieval results of the ’food’ class : 5 matches out of 20. There are still several images have the similar texture distribution with the query image

Chapter 6

在文檔中多實例類神經網路影像檢索之研究 (頁 54-72)