The Multiple-Instance Neural Network - Multiple Instance Learning Methods

Multiple Instance Learning Methods

3.3 The Multiple-Instance Neural Network

The Multiple-Instance Neural Networks (MINN) is a probabilistic variant of the decision-based modular neural networks [50] for classification. One subnet of the MINN is designed to represent one class of a multiple-instance problem. Based on the given positive example images and negative example images of a concept from the user, the MINN performs the parameters updating process to the subsets to formulate the concept. The updating rule contains reinforced learning to the subset corresponding to the positive images and anti-reinforced learning to the subset corresponding to the negative images.

3.3.1 The Discriminate Functions of MINN

Given an image I and a set of i.i.d. patterns X = {x(t); t = 1, 2, · · · , N } , where each element is a feature vector x(t) extracted from each instance in the image I. It can assume that a concept class ω_i for the density as a linear combination of component densities

p(x(t)|ω_i, Θ_r_i) in the form

where P_r_i is the prior probability of a cluster r_i, and Θ_r_i represents the parameter set {µ_r_i, Σ_r_i}. By definition, P_R_i

ri=1P_r_i = 1, where R_i is the number of clusters in ω_i. The discriminate function of the MINN models is defined as:

φ(X, Ω_i, k) = ln(− ln( parameter k is a user determined parameter called bounded number, which is the number of instances to be related to the class ω_i. If k is set to one, only the nearest instance of the concept class ω_i is considered. If k is equal to the total number of instance in the given image, all instances in the given image are considered. The smaller value of φ(X, Ω_i, k) means the given image I is more similar to the class ω_i.

In most general formulation, the density function p(x(t)|ω_i, Θ_r_i) in (3.7) should be proximated by the distribution with full-rank covariance matrix. However, for those ap-plication that deal with high-dimension data but a finite number of training patterns, the training performance and storage space discourage such matrix modelling. A natural sim-plifying assumption is to assume uncorrelated features of unequal importance. That is, suppose that p(x(t)|ω_i, Θ_r_i) is a D-dimensional distribution with uncorrelated features and in order to make sure that the negative log likelihood in Eq.(3.8) is positive, the component density function is defined as

p(x(t)|ωi, Θri) = exp

Figure 3.2: The schematic diagram of the proposed Multiple-Instance Neural Network

Fig. 3.2, the MINN contains M subnets that used to represent a M-category multiple-instance learning problem. Inside each subnet, an elliptic basis function (EBF) is used to serve as the basis function for each cluster r_i

ϕ(x(t), ω_i, Θ_r_i) = −1 2

XD d=1

(xd(t) − µrid)²

σ_r²_i_d . (3.10)

After passing an exponential activation function, exp{ϕ(x(t), ωi, Θri)} can be viewed as a distribution described in Eq.(3.9).

3.3.2 The Energy Function of MINN

In multiple-instance learning problems, the best matched class fells within a region which is near to the intersection of the positive images, and far away from the negative images.

In other words, the best matched class fells within the area which contains most instances

of positive images and very few instances of negative images.

If there is only one image in the X_i, the energy function is reduce to (3.8), where the bounded number k of Eq.(3.8) is set to the number of instances in the image.

In order to rigorously test how the MINN and the proposed energy function (3.11) deal with multiple-instance learning problems, we generated the following data set. There are two classes in the data set. In each class, ten bags are generated, each with 100 instances.

The concepts of the class 1 and class 2 are the Gaussian distributions with the same variance 0.04. The mean of class 1 is (0.2, 0.8), and the mean of the class 2 is (0.8, 0.2) in the feature space. In each class, five bags are labeled positive and the rest are labeled negative. In each positive bag, 20 instances are generated randomly from the distribution of the concept class and 80 instances were generated uniformly at random. In each negative bag, 100 instances were generated uniformly at random, but none of the instances fell within the designated concept class. The distribution of the instances from each bag is shown in Figure 3.3. The instances from the positive and negative bags are denoted as ‘+’ and ‘x’ respectively. The trajectory of the predicted positions of the class concepts during the training phase of the MINN are shown in Fig.3.3.

The change of the energy during the training phase are shown in Fig.3.4. It is clear that the energy function decreased after each iteration in the training phase of the MINN.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Figure 3.3: The artificial data set (a) Class 1, and (b) Class 2. The instances from the positive and negative bags are denoted as ‘+’ and ‘x’ respectively. The trajectory of the predicted points of the class concepts during the training phase are denoted as ‘.’.

0 10 20 30 40 50 60 70 80 90 100

Figure 3.4: The class energy decreased monotonically during the training phase of the MINN, where the bold line implies the class 1 and the dotted line implies the class 2.

3.3.3 The Training Phase

It can see that minimizing E(X_i, Ω_i) with respect to Ω_i, the class ω_i will be located where there are most instances of the images. On the other hand, maximizing E(X_i, Ω_i) with respect to Ω_i, the class ω_i will be located where there are very fewer instances of the images.

After given a set of positive training images X⁺_i and a set of negative training images X⁻_i of the corresponding class, the following reinforced and anti-reinforced learning rules are applied to the corresponding subset.

Reinforced Learning:

Ω^(m+1)_i = Ω^(m)_i − η∇E(X⁺_i , Ω_i), (3.12)

Antireinforced Learning:

Ω^(m+1)_i = Ω^(m)_i + η∇E(X⁻_i , Ω_i). (3.13)

In (3.12) and (3.13), η is a user defined learning rate 0 < η ≤ 1, we set η = 0.5 in this dissertation. ∇E are gradient vectors, which are computed as follows:

∂E

is a posterior probability of the cluster r_i in concept class ω_i given x_ib(t)

As to the conditional prior probability P_r_i, since the EM algorithm can automatically satisfy the probabilistic constraints P_R_i

ri=1P_r_i = 1 and P_r_i ≥ 0, it is applied to update the

Threshold Updating: The threshold value of MINN can also be learned by the reinforced and anti-reinforced learning rules. For example, given a class ω_i, if a positive image of ω_i is misclassified, the threshold T_i needs to be increased because the T_i is too small to reject this positive image. On the other hand, if a negative image is misclassified, the Ti should be reduced since the Ti is too large to accept this negative image. An adaptive learning rule to train the threshold Ti is proposed as follows:

T_i^(m+1) = T_i^(m)+ γ∆T_i^(m), (3.18)

where γ is a user defined learning rate 0 < γ ≤ 1, and the value of γ can set to γ = 0.5. ∆T_i^(m) is defined as Ep^(m)− En^(m), where Ep^(m) is the misclassified rate of the positive images in the m^th iteration, and En^(m) is the misclassified rate of the negative images in the m^th iteration.

3.3.4 The Testing Phase

As shown in Fig.3.2, an unlabel image is input to the MINN, then the MINN labels the given image with one or several matched classes. First, an unlabel image is applied to all subnets in the MINN. In each subnet, computation is performed according to the discriminate function (Eq.3.8). Then the results of discriminate function are compared with the threshold T_i. Finally, the i^th element of the retrieval result vector V is set to 1 if the value of the discriminate function is smaller than T_i, which implies that the given image belongs to the concept class i. Otherwise, the i^th element of the recognition result vector V is set to 0 if the value of the discriminate function is larger than T_i, which implies that the given image does not belong to the concept class i. From the output vector, one can recall which concept classes the given image belongs to.

3.3.5 Image Indexing with Histogram Approximation

Suppose images are digitized as 24 bits RGB, meaning that 8 bits or 256 linear levels of brightness for red, green, and blue components. For the sake of the more perceptive to the human vision, we calculating the histograms of each components in the Lab color space,

then three histograms corresponding to the three color components L, a, b of the masked image are combined as a histogram vector H = [h₁, h₂, · · · , h_256∗3−1]^T. The n^th element in H is evaluated as

H(n) = Σ_x,yG_m(x, y), {(x, y); I_c(x, y) = n mod 256}, (3.19) where G_m(x, y) is the Gaussian-like mask image which will be defined in (5.2), I_c(x, y) is the intensity value of the c^th color channel at the position (x, y) in the masked image Gm(x, y), and c is the quotient of the n divided by 256.

Instead of using color histograms as features of images, the proposed system used the parameters of mixture density functions which approximate color histograms of each image as features so as to decrease the dimensions of features and speed up the response time of querying. In information theory, the cross entropy[51] between two probability distri-butions measures the average number of bits needed to identify an event from a set of possibilities, and the cross entropy for two distributions over the same probability space can be used to measure the similarity for two distributions.

Define p(t|Θ_r) is a one-dimensional Gaussian distribution, and Θ_r represents the pa-rameter set {µ_r, σ²_r} for a cluster r, where µ_r and σ_r² are the mean and the variance of a

Let p(t|Θ_r) to be one of the Gaussian distributions that comprise P (t), and P_r denotes the prior probability of the cluster r. Then P (t) is a mixture Gaussian distribution, which can express as below:

P (t) = XR

r=1

Prp(t|Θr), (3.21)

where R is the number of clusters in P (t). By definitionP_R

r=1P_r = 1. We can take the value of the P_r = 1/R, and set σ_r² = 0.05.

Since a color histogram of an image can be regarded as an one-dimensional vector, given a color histogram H(n), where 0 ≤ n ≤ N, and a mixture of Gaussian distributions P (t),

the similarity between H(n) and P (t) is measured by using their cross-entropy

− XN n=0

H(n) ln P (n). (3.22)

It is well known that cross-entropy minimization is frequently used in optimization, in Eq.(3.22), the cross-entropy has the minimum value when P (n) is equalized to H(n), where n = 0, 1, · · · N. Hence, the EM algorithm is applied to adjust the parameters Θ_r and P_r of each cluster r in P (t) to minimize Eq.(3.22) so as to approximate H(n). The updating equations for the parameters in the cluster r of mixture model P (t) are

µ^new_r = P_N

n=0H(n)p^old(Θ_r|n)n P_N

n=0H(n)p^old(Θr|n) , (3.23)

(σ_r^new)² = P_N

n=0H(n)p^old(Θr|n)(n − µ^new_r )² P_N

n=0H(n)p^old(Θ_r|n) , (3.24) P_r^new =

P_N

n=0H(n)p^old(Θ_r|n) P_N

n=0H(n) , (3.25)

where p(Θ_r|n) is a posterior probability of the cluster r given n, which is defined as p(Θr|n) = p(n|Θr)

P_R

r=1Prp(n|Θr). (3.26)

In each EM iteration, there are two steps: Expectation (E) step and Maximization (M) step.

The M step maximizes a likelihood function which is further refined in each iteration by the E step. In each iteration, we first compute the posterior probabilities of the clusters using Eq.(3.26) in E-step, and calculate the new parameters of the model using Eqs.(3.23),(3.24) and (3.25) in M-step.

Chapter 4

在文檔中多實例類神經網路影像檢索之研究 (頁 29-38)