Bipartite Majority Learning - 二元主體學習技術研究與張量流實作

The proposed resistant learning method, bipartite majority learning (BML), focuses on binary classification problem, and SLFN is used as the learning model. The SLFN archi-tecture is defined in (2) to (4). Table 1 shows the description of the notations. ai(x) is the net output value of the hidden layer, and f (x) is the output value of the SLFN. The activation function, tanh, is hyperbolic tangent. The SLFN can be a bipartite classifier by setting a threshold [12]. If the output value of an observation is greater or equal than the threshold, it will be considered as class 1; otherwise class 2. Figure 1 shows the tensor graph of SLFN.

Table 1: Table of Notations Notation Description

x^c x≡ (x1, , xm)^T, x^c is the c^th of input observations.

y^c The desired output corresponding to the c^th input observation of x^c. m The dimension of input observation x.

p The number of adopted hidden nodes.

w_i0^H The bias value θ of the i^th hidden node.

w_ij^H The weight between x_j and the i^th hidden node.

w^O₀ The bias value θ of the output node.

w^O_i The weight between the i^th hidden node and the output node.

f (x)≡ w0^O+

∑p i=1

w^O_i ai(x) (2)

a_i(x)≡ tanh(w^H_i0 +

∑m j=1

w^H_ijx_j) (3)

tanh(x)≡ e^x− e^−x

e^x+ e^−x (4)

Figure 1: The tensor graph of SLFN.

Algorithm 1 Define the SLFN tensor graph in python code

1: x← tf.placeholder(tf.float64)

2: y← tf.placeholder(tf.float64)

3: ht ← tf.V ariable(h t)

4: hw← tf.V ariable(h w)

5: hidden layer ← tf.tanh(tf.add(tf.matmul(x, hw), ht))

6: ot← tf.V ariable(o t)

7: ow ← tf.V ariable(o w)

8: y ← tf.add(tf.matmul(hidden layer, ow), ot)

9: sr ← tf.reduce sum(tf.square(y − y ))

10: train ← tf.optimizer(eta).minimize(sr)

Algorithm 1 is the corresponding code in Python. In tensorflow, we should define the calculation relationship between tensors. The binding tensors form a data flow graph, which is the tensor graph(i.e., Figure 1). Variables x and y are the tensors of tf.placeholder type that holds the input data. The variables hw(hidden layer weights, w^H_ij), ht(hidden layer theta, w_i0^H), ow(output layer weights, w^O_i ) and ot(output layer theta, w^O₀) are the tensors of tf.Variable type which will be modified by the optimizer. The structure of the SLFN is defined by the above tensors and certain tensor operations. The tf.matmul performs the matrix multiplication, the tf.add performs matrix addition and the tf.square squares all the elements in the tensor. The tf.reduce sum calculates the sum of all square residuals. Finally, the optimizer, tf.optimizer, applies gradient descent method to modify the variable tensors of neuron weights.

In the supervised learning scenario for binary classification, we should give an ap-propriate label to our training set. In general, the desired output will be set to [1, 0]

and [0, 1] for binary classification when the SLFN have two output nodes. But in this study, our SLFN has only one output node, the desired output of the observations are given dynamically by a specific method. The learning goal of the SLFN is to discern the majority of two classes data. We adopt the linearly separating condition (the condition

L) [12] to distinguish two classes of observations. The α in (5) is the minimum output value of the observations in class 1(C₁) and the β in (6) is the maximum output value of the observations in class 2(C₂). If α > β, for all f (x^c) : c∈ n, the condition L in (7) is satisfied. The two classes of observations can be separated by a threshold, ^α+β₂ . In practice, we label C1 as {1} and C2 as {-1} at the beginning of training process.

α = min

y^c∈C¹f (x^c) (5)

β = max

y^c∈C2

f (x^c) (6)

T he Linear Separating Condition : α > β (7)

Since we introduce the condition L to be the learning goal of the SLFN, it could be training more faster than the envelope method. The envelope method proposed by Huang et al. [9] ensures that the square error between y^c and f (x^c) should be less than two times of the standard deviation. The condition L is less restrictive than the envelope module but more appropriate to do bipartite classification.

Table 2 presents the proposed bipartite majority learning algorithm. Assume there are N observations, and γ is the majority rate while γN > m + 1. The BML algorithm is terminated when more than γN observations are correctly classified and the condition L is satisfied.

Let S(N ) be the set of N observations. Let the n^th stage be the stage of handling n reference observations (i.e., S(n)), and γN ≥ n > m + 1. Let ˆS(n) be the set of the observations which are classified correctly by the condition L at the end of n^thstage. Then, the acceptable SLFN estimate that leads to a set of {(x^c, y^c)} that can find a threshold to separate the two classes of observations for all c ∈ S(n). Meanwhile, | ˆS(n)| ≥ n since S(n) ⊆ ˆS(n). To put it another way, at the end of the n^th stage, the acceptable SLFN estimate presents a fitting function f can find a threshold to classify at least n

observations in {(x^c, y^c) : c ∈ ˆS(n)}.

Table 2: The bipartite majority learning algorithm

Step 1 Randomly obtain the initial m + 1 reference observations, two classes of observations each account for half of the m+1 observations.

Let S(m + 1) be the set of observations of these observations.

Set up an acceptable SLFN estimate with one hidden node regarding the reference observations (x^c, y^c) for all c∈ S(m + 1).

Set n = m + 2.

Step 2 If n > γN , STOP.

Step 3 Present the n-1 reference observations (x^c, y^c) that are the ones with the largest distances between C₁ and C₂.

Then select another observation (x^k, y^k) so that the value of α− β will be the largest.

Let S(n) be the set of observations selected in stage n.

Step 4 If n reference observations satisfy the condition L, go to Step 7.

Step 5 Set ˜w = w

Step 6 Apply the gradient descent algorithm to adjust weights w until one of the following cases occurs:

(1) If n reference observations satisfy L, go to Step 7.

(2) If the n observations cannot satisfy L, then restore the weights.

Set w = ˜w and apply the resistant learning mechanism by adding extra hidden nodes to obtain an acceptable SLFN.

Step 7 n + 1→ n; go to Step 2.

The proposed BML executes the following two procedures: (i) the ordering procedure implemented by Step 3 that determines the input sequence of reference observations and (ii) the modeling procedure implemented by Step 6 that adjusts the weights of the SLFN to minimize the sum of square residuals ∑_N

c=1(e^c)². If the gradient descent mechanism

cannot tune the weight to find an acceptable SLFN, then restore weights and adjust the number of hidden nodes adopted in the SLFN. Finally, all n observations S(n) at the n^th stage would satisfy the condition L. The detail operations are explained as follows.

(Step 1) It first randomly chooses m + 1 observations from N training data. Then, it calculates the weight of the initial neural network by using the m + 1 reference observa-tions in the initial training case. The initial weights of the neural network are given by formula (8) to (11). We firstly calculate w^O₀ and w₁^O in (8) and (9) by all of the reference observations. Next, we calculate w^H_i0 and w^H_ij in (8). There are m + 1 hidden weight variables, we can use m + 1 reference observations to obtain a set of m + 1 simultaneous equations. Then, we can solve the m + 1 simultaneous equations by using matrices [37]

to get the desired hidden weight values, and make f (x^c) = y^c ∀ c ∈ S(m + 1).

Algorithm 2 shows how we use the TensorFlow API to define the operations of equa-tions (8) to (11).

Algorithm 2 Calculate the first SLFN weights in python code

7: xc← sess.run(tf.concat([s x, h t vector]))

8: answer ← sess.run(tf.matrix solve ls(xc, yc))

9: h w ← answer[: m]

10: h t← answer[m :]

The purpose of using this method to set the weight is to ensure that the initial neu-ral network has met the condition: e^c = 0 for all c ∈ S(m + 1). That is, the initial SLFN perfectly represents the correspondence between x^c and y^c for the m + 1 reference observations.

(Step 2) It is the termination condition of the system. We set the majority rate, γ, to 95%. It guarantees the SLFN can correctly discern the observations in training set more than 95%.

(Step 3) The BML first computes all the possible values of α− β of n − 1 observations from all N observations. It then selects a set of n− 1 observations that has the maximal value of α−β. To find the n−1 observations, we firstly sort the values f(x^c) in C₁and C₂, respectively. Then we can get i maximum f (x^c) in C1 and get (n− 1 − i) minimum f(x^c) in C₂, i∈ [1, n−2], to calculate all possible α−β. The time complexity for obtaining such n−1 observations is O(NlogN) since the time complexity of sorting is O(NlogN) and the time complexity of calculating all possible α−β is O(n), N > n. Compare to the training process, this step does not significantly reduce the eﬃciency of learning. After selecting the n− 1 observations, it picks another observation, (x^k, y^k), so that the value of α− β will be the largest. The purpose of this selection mechanism is to select the n observations

which most likely to be classified by the condition L. The (n− 1)^th stage S(n− 1) asserts that there is at least one set of n− 1 observations that can let α − β > 0. Although the n-1 observations selected in the n^thstage may not necessarily equal to S(n−1), this select mechanism ensures that if the observation (x^k, y^k) is excluded, the n− 1 observations can satisfy the condition L.

(Step 4) It checks if the n selected reference observations satisfy the condition L. If true, the n^th stage can find at least one set of n observations that can let α− β > 0, then it goes to the next stage. If not, the BML would temp to find an acceptable SLFN for the chosen observations S(n).

(Step 5) It saves the current weights of SLFN for the resistant learning procedure. At the end of the (n− 1)^th stage, the ones in{(x^c, y^c) : c∈ S(n − 1)} satisfy the condition L.

The resistant learning procedure can cram a new observation (i.e., (x^k, y^k)) by adding two hidden nodes while not aﬀecting the output of other observations. Adjusting the weights by gradient descent mechanism in Step 6 might make the n− 1 observations picked in the stage n violating the condition L. Therefore, the current state of the neural network needs to be temporarily stored so that it can be restored in Step 6-2.

(Step 6) We apply gradient descent mechanism to find an acceptable SLFN. For the purpose of nominal supervised learning, the learning target was given dynamically rather than fixed value. Although we respectively give the desired output y^c for C1 and C2

observations at the beginning, the diﬀerence between the two classes observations is even more important. Let ¯S(n) be the subset of S(n), S(n) = {k} + ¯S(n). We first calculate max(f (x_C₁)) and min(f (x_C₂)) ∀c ∈ ¯S(n), then the supervised learning target changes to max(f (x_C₁)) ∀(x^c, y^c) ∈ C1, and the supervised learning target changes to min(f (x_C₂))

∀(x^c, y^c) ∈ C2. After setting the learning target of ¯S(n), we compute the values α and β of ¯S(n). Then, if y^k ∈ C1, the learning target of x^k is set to α; otherwise y^k ∈ C2, the learning target of x^k is set to β. Then, the gradient descent mechanism is applied to adjust the weights to find an acceptable SLFN.

(Step 6.1) If an acceptable SLFN is found, we move to the next stage. However, we

might encounter the problem of local optimum which is caused by the implementation of the gradient descent mechanism or by the SLFN model that does not have enough hidden nodes. Both situations lead to an unacceptable SLFN estimate regarding the n reference observations. Therefore, we adopt the resistant procedure proposed by Tsaih and Cheng [7] to cope with the observation x^k.

(Step 6.2) We apply the resistant learning procedure. Restore the ˜w that is stored in Step 5. Then, we add two hidden nodes to change the output value of x^k to the learning target. It can also represent the observation x^k is closer to the threshold than other same class observations. Other output values y^c′ ∀ c ∈ ¯S(n) will not be significantly aﬀected by the resistant learning procedure. The hidden weights formulas of newly hidden nodes are defined in (12) to (16).

w_p^H_−1,0 = ζ− λα^Tx^k (12)

w^H_p,0 = ζ + λα^Tx^k (13)

w^H_p₋₁ = λα^T (14)

w_p^H =−λα^T (15)

w_p^O₋₁ = w^O_p = |y^{k ′}− w0^O−∑_q

i=1(w^O_i w^k_i)|

2 tanh(ζ) (16)

ζ is a small constant number set to 0.05. λ is a large constant number set to 10⁵. α^T is an m-dimension vector which length equals 1 and satisfies the condition (17).

α^T(x^k− x^c)̸= 0 ∀ c ∈ I(n) − {k} (17)

By adding two hidden nodes in the hidden layer, the SLFN satisfies the condition L

for the n reference observations S(n). The output value f (x^k) will be very close to α if y^k∈ C1; otherwise y^k ∈ C2, the output value f (x^k) will be near to β.

(Step 7) We increase n by 1 and repeat step 2 to step 7. Most machine learning methods use all training data as the basis for adjusting weights. In the BML mechanism, in order to avoid anomaly observations aﬀecting SLFN learning and picking appropriate majority observations, BML will start with a small amount of data and gradually increase the amount of selected data. The advantage of this approach is that since we do not necessarily know in advance which data is the majority and which data is the anomaly in all training data, it is possible that anomaly data will be selected when m+1 data are first acquired. However, as the number of selected data n increases, the selection mechanism in step 3 will select those that are most easily classified by condition L. Since n observations selected in the n^th stage is most suitable for the current SLFN, the n observations do not necessarily include the n− 1 observations selected in the (n − 1)^th stage. Through this dynamic selection method, we can select appropriate majority data and avoid the anomaly’s impact on the eﬀectiveness of learning. Although n only increases by 1 at one time that would slower than training the SLFN with all data, we found that the SLFN does not need to retrain at every stage in our experiments due to the observations can satisfy the condition L. The BML mechanism can quickly move to the next stage when the most part of training data can be classified correctly.

在文檔中二元主體學習技術研究與張量流實作 - 政大學術集成 (頁 16-25)