Multiple Instance Learning Methods
3.3 The Multiple-Instance Neural Network
The Multiple-Instance Neural Networks (MINN) is a probabilistic variant of the decision-based modular neural networks [50] for classification. One subnet of the MINN is designed to represent one class of a multiple-instance problem. Based on the given positive example images and negative example images of a concept from the user, the MINN performs the parameters updating process to the subsets to formulate the concept. The updating rule contains reinforced learning to the subset corresponding to the positive images and anti-reinforced learning to the subset corresponding to the negative images.
3.3.1 The Discriminate Functions of MINN
Given an image I and a set of i.i.d. patterns X = {x(t); t = 1, 2, · · · , N } , where each element is a feature vector x(t) extracted from each instance in the image I. It can assume that a concept class ωi for the density as a linear combination of component densities
p(x(t)|ωi, Θri) in the form
where Pri is the prior probability of a cluster ri, and Θri represents the parameter set {µri, Σri}. By definition, PRi
ri=1Pri = 1, where Ri is the number of clusters in ωi. The discriminate function of the MINN models is defined as:
φ(X, Ωi, k) = ln(− ln( parameter k is a user determined parameter called bounded number, which is the number of instances to be related to the class ωi. If k is set to one, only the nearest instance of the concept class ωi is considered. If k is equal to the total number of instance in the given image, all instances in the given image are considered. The smaller value of φ(X, Ωi, k) means the given image I is more similar to the class ωi.
In most general formulation, the density function p(x(t)|ωi, Θri) in (3.7) should be proximated by the distribution with full-rank covariance matrix. However, for those ap-plication that deal with high-dimension data but a finite number of training patterns, the training performance and storage space discourage such matrix modelling. A natural sim-plifying assumption is to assume uncorrelated features of unequal importance. That is, suppose that p(x(t)|ωi, Θri) is a D-dimensional distribution with uncorrelated features and in order to make sure that the negative log likelihood in Eq.(3.8) is positive, the component density function is defined as
p(x(t)|ωi, Θri) = exp
Figure 3.2: The schematic diagram of the proposed Multiple-Instance Neural Network
Fig. 3.2, the MINN contains M subnets that used to represent a M-category multiple-instance learning problem. Inside each subnet, an elliptic basis function (EBF) is used to serve as the basis function for each cluster ri
ϕ(x(t), ωi, Θri) = −1 2
XD d=1
(xd(t) − µrid)2
σr2id . (3.10)
After passing an exponential activation function, exp{ϕ(x(t), ωi, Θri)} can be viewed as a distribution described in Eq.(3.9).
3.3.2 The Energy Function of MINN
In multiple-instance learning problems, the best matched class fells within a region which is near to the intersection of the positive images, and far away from the negative images.
In other words, the best matched class fells within the area which contains most instances
of positive images and very few instances of negative images.
If there is only one image in the Xi, the energy function is reduce to (3.8), where the bounded number k of Eq.(3.8) is set to the number of instances in the image.
In order to rigorously test how the MINN and the proposed energy function (3.11) deal with multiple-instance learning problems, we generated the following data set. There are two classes in the data set. In each class, ten bags are generated, each with 100 instances.
The concepts of the class 1 and class 2 are the Gaussian distributions with the same variance 0.04. The mean of class 1 is (0.2, 0.8), and the mean of the class 2 is (0.8, 0.2) in the feature space. In each class, five bags are labeled positive and the rest are labeled negative. In each positive bag, 20 instances are generated randomly from the distribution of the concept class and 80 instances were generated uniformly at random. In each negative bag, 100 instances were generated uniformly at random, but none of the instances fell within the designated concept class. The distribution of the instances from each bag is shown in Figure 3.3. The instances from the positive and negative bags are denoted as ‘+’ and ‘x’ respectively. The trajectory of the predicted positions of the class concepts during the training phase of the MINN are shown in Fig.3.3.
The change of the energy during the training phase are shown in Fig.3.4. It is clear that the energy function decreased after each iteration in the training phase of the MINN.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Figure 3.3: The artificial data set (a) Class 1, and (b) Class 2. The instances from the positive and negative bags are denoted as ‘+’ and ‘x’ respectively. The trajectory of the predicted points of the class concepts during the training phase are denoted as ‘.’.
0 10 20 30 40 50 60 70 80 90 100
Figure 3.4: The class energy decreased monotonically during the training phase of the MINN, where the bold line implies the class 1 and the dotted line implies the class 2.
3.3.3 The Training Phase
It can see that minimizing E(Xi, Ωi) with respect to Ωi, the class ωi will be located where there are most instances of the images. On the other hand, maximizing E(Xi, Ωi) with respect to Ωi, the class ωi will be located where there are very fewer instances of the images.
After given a set of positive training images X+i and a set of negative training images X−i of the corresponding class, the following reinforced and anti-reinforced learning rules are applied to the corresponding subset.
Reinforced Learning:
Ω(m+1)i = Ω(m)i − η∇E(X+i , Ωi), (3.12)
Antireinforced Learning:
Ω(m+1)i = Ω(m)i + η∇E(X−i , Ωi). (3.13)
In (3.12) and (3.13), η is a user defined learning rate 0 < η ≤ 1, we set η = 0.5 in this dissertation. ∇E are gradient vectors, which are computed as follows:
∂E
is a posterior probability of the cluster ri in concept class ωi given xib(t)
As to the conditional prior probability Pri, since the EM algorithm can automatically satisfy the probabilistic constraints PRi
ri=1Pri = 1 and Pri ≥ 0, it is applied to update the
Threshold Updating: The threshold value of MINN can also be learned by the reinforced and anti-reinforced learning rules. For example, given a class ωi, if a positive image of ωi is misclassified, the threshold Ti needs to be increased because the Ti is too small to reject this positive image. On the other hand, if a negative image is misclassified, the Ti should be reduced since the Ti is too large to accept this negative image. An adaptive learning rule to train the threshold Ti is proposed as follows:
Ti(m+1) = Ti(m)+ γ∆Ti(m), (3.18)
where γ is a user defined learning rate 0 < γ ≤ 1, and the value of γ can set to γ = 0.5. ∆Ti(m) is defined as Ep(m)− En(m), where Ep(m) is the misclassified rate of the positive images in the mth iteration, and En(m) is the misclassified rate of the negative images in the mth iteration.
3.3.4 The Testing Phase
As shown in Fig.3.2, an unlabel image is input to the MINN, then the MINN labels the given image with one or several matched classes. First, an unlabel image is applied to all subnets in the MINN. In each subnet, computation is performed according to the discriminate function (Eq.3.8). Then the results of discriminate function are compared with the threshold Ti. Finally, the ith element of the retrieval result vector V is set to 1 if the value of the discriminate function is smaller than Ti, which implies that the given image belongs to the concept class i. Otherwise, the ith element of the recognition result vector V is set to 0 if the value of the discriminate function is larger than Ti, which implies that the given image does not belong to the concept class i. From the output vector, one can recall which concept classes the given image belongs to.
3.3.5 Image Indexing with Histogram Approximation
Suppose images are digitized as 24 bits RGB, meaning that 8 bits or 256 linear levels of brightness for red, green, and blue components. For the sake of the more perceptive to the human vision, we calculating the histograms of each components in the Lab color space,
then three histograms corresponding to the three color components L, a, b of the masked image are combined as a histogram vector H = [h1, h2, · · · , h256∗3−1]T. The nth element in H is evaluated as
H(n) = Σx,yGm(x, y), {(x, y); Ic(x, y) = n mod 256}, (3.19) where Gm(x, y) is the Gaussian-like mask image which will be defined in (5.2), Ic(x, y) is the intensity value of the cth color channel at the position (x, y) in the masked image Gm(x, y), and c is the quotient of the n divided by 256.
Instead of using color histograms as features of images, the proposed system used the parameters of mixture density functions which approximate color histograms of each image as features so as to decrease the dimensions of features and speed up the response time of querying. In information theory, the cross entropy[51] between two probability distri-butions measures the average number of bits needed to identify an event from a set of possibilities, and the cross entropy for two distributions over the same probability space can be used to measure the similarity for two distributions.
Define p(t|Θr) is a one-dimensional Gaussian distribution, and Θr represents the pa-rameter set {µr, σ2r} for a cluster r, where µr and σr2 are the mean and the variance of a
Let p(t|Θr) to be one of the Gaussian distributions that comprise P (t), and Pr denotes the prior probability of the cluster r. Then P (t) is a mixture Gaussian distribution, which can express as below:
P (t) = XR
r=1
Prp(t|Θr), (3.21)
where R is the number of clusters in P (t). By definitionPR
r=1Pr = 1. We can take the value of the Pr = 1/R, and set σr2 = 0.05.
Since a color histogram of an image can be regarded as an one-dimensional vector, given a color histogram H(n), where 0 ≤ n ≤ N, and a mixture of Gaussian distributions P (t),
the similarity between H(n) and P (t) is measured by using their cross-entropy
− XN n=0
H(n) ln P (n). (3.22)
It is well known that cross-entropy minimization is frequently used in optimization, in Eq.(3.22), the cross-entropy has the minimum value when P (n) is equalized to H(n), where n = 0, 1, · · · N. Hence, the EM algorithm is applied to adjust the parameters Θr and Pr of each cluster r in P (t) to minimize Eq.(3.22) so as to approximate H(n). The updating equations for the parameters in the cluster r of mixture model P (t) are
µnewr = PN
n=0H(n)pold(Θr|n)n PN
n=0H(n)pold(Θr|n) , (3.23)
(σrnew)2 = PN
n=0H(n)pold(Θr|n)(n − µnewr )2 PN
n=0H(n)pold(Θr|n) , (3.24) Prnew =
PN
n=0H(n)pold(Θr|n) PN
n=0H(n) , (3.25)
where p(Θr|n) is a posterior probability of the cluster r given n, which is defined as p(Θr|n) = p(n|Θr)
PR
r=1Prp(n|Θr). (3.26)
In each EM iteration, there are two steps: Expectation (E) step and Maximization (M) step.
The M step maximizes a likelihood function which is further refined in each iteration by the E step. In each iteration, we first compute the posterior probabilities of the clusters using Eq.(3.26) in E-step, and calculate the new parameters of the model using Eqs.(3.23),(3.24) and (3.25) in M-step.