Visual Keyword Generation - Image Pre-processing and Feature Extraction

5.2 Image Pre-processing and Feature Extraction

5.2.3 Visual Keyword Generation

Similar to the (text) keywords for representing the key information of a docu-ment, the visual keyword is proposed to illustrate the visual key characteristics of an image. In general, an image can be characteristically specified by a few objects, each of which usually is composed of one or few near homogenous re-gions. For each of these regions, a set of visual features such as color, texture, and shape can be extracted to represent the region. With these visual features, the visual keyword is defined as follows.

Definition 5.2.1 (Visual Keyword). Given a homogeneous region i in an image, the visual keyword ωi is a triple of Gaussian mixture models (GMM)

{G^si, G^c_i, G^t_i} to formulate the spatial, color, and texture features of the region i.

The 2D GMM G^s_i approximates the spatial features (location and shape) of the region i according to its means and the covariance matrices. The other GMMs G^c_i and G^t_i formulate the average and variation of color and texture features over the region i by their means and the covariance matrices, respectively.

The key issues for the visual keyword representation are shown as follows:

(1) The precise segmentation or skillful sketch of the region is no longer needed, and (2) the approximating a region by mixture Gaussian distributions allows the searching for its similar regions more flexible and robust. An exemplar of the visual keyword is shown in Fig. 5.5. The three visual keywords, shown as the elliptic regions ω1, ω2, and ω3, are created to cover the two sails and the boat body. As we can see, the shape and the location of these regions can be formulated by three 2D mixture Gaussian distributions, respectively.

Figure 5.5: An example of the visual keyword. The visual keywords ω1, ω2 and ω3 are used to represent the sailboat in the image.

A visual keyword is generated to formulate the spatial, color and texture features of a homogeneous region via two steps: (1) the spatial modelling and

(2) the color and texture modelling.

Spatial Modelling

For illustration purpose, we often use an elliptic region to illustrate a 2D Gaus-sian distribution. In addition, the shape of an elliptic region can be altered by changing the parameters (the mean, the covariance matrix, and the prior probability) of its corresponding 2D Gaussian distribution. Thus, an arbitrary shaped region can be approximated by the union of several elliptic regions.

In the following, we will present the methods and procedures of adjusting the parameter values of a 2D mixture Gaussian distribution.

For a given homogeneous region A_i = {x(l) : l = 1, 2, . . . , L} and its corresponding Homogeneous Region Array (HRA) H_i, where L is the number of pixels in A_i, suppose a 2D mixture Gaussian distribution p_s(x(l) | ωi) formulates the spatial feature of A_i. Define p_s(x(l) | θs,rⁱ, ω_i) as a Gaussian cluster to comprise ps(x(l) | ωⁱ), i.e.,

ps(x(l) | ωⁱ) =

Nⁱ

rⁱ=1

Ps(θs,rⁱ | ωⁱ)ps(x(l) | θ^s,rⁱ, ωi), (5.2.5)

where θs,ri represents the parameter set {µ^s,rⁱ, Σs,ri}, and P^s(θs,ri | ωⁱ) denotes the prior probability of the cluster ri. By definition, PNⁱ

rⁱ=1Ps(θs,ri | ωⁱ) = 1, where Ni is the number of clusters in ps(x(l) | ωⁱ). Suppose the cluster ri is a 2D Gaussian distribution:

ps(x(l) | θ^s,rⁱ, ωi) = exp

−¹₂(x(l) − µ^s,rⁱ)^TΣ⁻¹_s,ri(x(l) − µ^s,rⁱ)

2π|Σ^s,rⁱ|^1/2 . (5.2.6)

The dissimilarity between the HRA Hi and 2D mixture Gaussian

distribu-tion ps(x(l) | ωⁱ) can be measured by the cross-entropy function

E = − XL

l=1

Hi(x(l)) ln(ps(x(l) | ωⁱ)), (5.2.7) regarded as an error function between the region A_i and visual keyword ω_i. By applying the EM algorithm, (5.2.7) is minimized by the following update equations for the parameters of 2D mixture Gaussian distribution: At each epoch j,

The iteration of EM computation is continuous until (5.2.7) becomes less than a given threshold. Fig. 5.6 illustrates the spatial modelling of a sail boat.

The original image with a reference point (the black dot) is depicted in Fig.

5.6(a), and the corresponding homogenous region and its spatial model, a 2D GMM comprised of two Gaussian clusters, is depicted in Fig. 5.6 (b) and (c, respectively.

Color and Texture Modelling

After the spatial modelling is done, the homogeneous region Ai is approximated by the 2D mixture Gaussian distribution ps(x(l) | ωⁱ). Suppose ps(x(l) | ωⁱ)

(a) (b) (c)

Figure 5.6: The spatial feature modelling of a sail region. (a) The original image with a reference point shown as a black dot on a sail. (b) The corresponding homogeneous region of the reference point. (c) The modelling results of a 2D mixture Gaussian approximation on the sail region. The pictures in (b) and (c) are shown as gray level images.

consists of N_i Gaussian clusters, then A_i can be divided into N_i elliptic regions, {ar1, a_r₂, . . . , a_r_Ni}, each of which corresponds to an Gaussian cluster in ps(x(l) | ω_i). For each elliptic region, its color and texture features are modelled by one of Gaussian clusters in G^c_i and G^t_i, respectively. In the following, only the color modelling is illustrated. The notations and formulas for texture modelling can be obtained by replacing the subscript c by t.

Suppose a Gaussian distribution pc(c_x(l) | θ^c,rⁱ, ωi) is used to approximate the color (texture) feature distribution in an elliptic region ari. Let c_x(l) be a Dc-dimensional color (texture) feature vector at a pixel x(l), then the visual keyword ωi models the color (texture) with a GMM pc(c_x(l) | ωⁱ) comprised by Ni Gaussian clusters:

pc(c_x(l) | ωⁱ) =

Nⁱ

ri=1

Pc(θc,ri | ωⁱ)pc(c_x(l) | θ^c,rⁱ, ωi), (5.2.8)

where θc,ri represents the parameter sets {µ^c,rⁱ, Σc,ri}, and P^c(θc,ri | ωⁱ) denotes the prior probability of the cluster ri. Suppose pc(c_x(l) | θ^c,rⁱ, ωi) is a Gaussian

distribution,

where N(a_ri) denotes the number of pixels in the elliptic region a_ri.

After modelling the spatial, color, and texture of an elliptic region ari, ps(x(l) | θ^s,rⁱ, ωi), pc(c_x(l) | θ^c,rⁱ, ωi), and pt(t_x(l) | θ^c,rⁱ, ωi) can be merged into a Gaussian cluster as

p(z(l) | θ^rⁱ, ωi) = ps(x(l) | θ^s,rⁱ, ωi)pc(c_x(l) | θ^c,rⁱ, ωi)

× p^t(t_x(l) | θ^c,rⁱ, ωi), (5.2.12) where z(l) = (x(l), c_x(l), t_x(l))^T, and θ_ri = (θ_s,ri, θ_c,ri, θ_t,ri). Then, for the ho-mogenous region A_i, the visual keyword ω_i formulates its spatial, color, and texture features by a uniformed GMM:

p(z(l) | ωi) = prob-ability of clusters ri.

Since a visual keyword is in the form of mixture Gaussian distribution, its difference from the other one can be measured by (3.1.9), which is described in Section 3.1. While the spatial relation of regions is concerned, the visual string is generated and presented in the following section.

在文檔中複合式高斯類神經網路之研究 (頁 66-72)