• 沒有找到結果。

CHAPTER 1 INTRODUCTION

1.4 O RGANIZATION OF D ISSERTATION

The dissertation is arranged as follows.

Chapter 1 introduces the motivation, related work, approach, and organization of the dissertation.

Chapter 2 provides the fundamental information used in the dissertation. The foundation includes neural fuzzy network, genetic algorithm, standard and cooperative PSO, mean shift procedure and CMA-ES.

In Chapter 3, the proposed QPSO, TSR-EA, SD-CPSO and MS-CMA-ES are described.

In Chapter 4, two types of computer simulations, reinforcement learning control tasks and multi-funnel optimization functions, are performed to verify the performance of the proposed algorithms. We apply QPSO and TSR-EA to two reinforcement learning control tasks, cart-pole balancing system and two-pole inverted pendulum control. The SD-CPSO and MS-CMA-ES are applied to real-valued function optimization tasks.

In Chapter 5, the conclusions, contribution, and future works of the dissertation are discussed.

CHAPTER 2

FOUNDATIONS

The background material and literature review that relates to the major components of the research purpose outlined above (neuro-fuzzy controller, genetic algorithm, particle swarm optimization, and evolution strategy with covariance matrix adaptation) are introduced in this chapter. The concept of neuro-fuzzy controller is discussed in the first section. The concept of genetic algorithm (GA) is introduced in Section 2.2. In Section 2.3, the concept of particle swarm optimization (PSO) and some of its improvements are discussed. The final section focuses on some background knowledge related to the proposed mean shift-based evolution strategy with covariance matrix adaptation (MS-CMA-ES), such as kernel density estimation, mean shift procedure and standard CMA-ES

.

2.1 Neural Fuzzy System

In general, there are three typical types of neural-fuzzy system (NFS) and they are the TSK-type [34], Mamdani-type [16], and singleton-type. According to [69] and [70], the authors have shown that the TSK-type NFS can offer better network size and learning accuracy than the Mamdani-type and singleton-type NFS. Thus, in this dissertation, only the TSK-type NFS is introduced and such NFS is applied to reinforcement learning tasks.

A TSK-type NFS employs different implication and aggregation methods from a standard Mamdani fuzzy model. Instead of using fuzzy sets, the conclusion part of a rule is a linear combination of the crisp inputs.

IF x1

is A

1j

(m

1j

, σ

1j

)and x

2

is A

2j

(m

2j

, σ

2j

)…and x

n

is A

nj

(m

nj

, σ

nj

)

THEN y’=w

0j

+w

1j

x

1

+…+w

nj

x

n. (2.1)

The structure of a TSK-type NFS is shown in Fig. 2.1. It is a five-layer network structure. In a TSK-type NFS, the firing strength of a fuzzy rule is calculated by performing the following

“AND” operation on the truth values of each variable to its corresponding fuzzy sets. The functions of the nodes in each layer are described as follows:

Layer 1 (input node): Each node in this layer is called an input linguistic node, which corresponding one linguistic variable. These nodes only pass the input signal to the next layer.

) ,

Layer 2 (membership function node): each node in this layer acts as a Gaussian membership function, and its output value specifies the degree to which the given input value belongs to a fuzzy set. Thus, the membership value in layer 2 can be calculated by:

(1) 2 the width of the Gaussian membership function of the jth term of the ith input variable xi

respectively. In this paper, the reason of adopting the Gaussian membership function is that it can be a universal approximator of any nonlinear functions on a compact set [69].

x

i

u

i(1) =

u

(ij2)

Layer 3 (rule node): The output in this layer are used to perform precondition matching of fuzzy rules. In the TSK-type NFS, the firing strength of a fuzzy rule is calculated by performing the following “AND” operation:

(3) (2)

j ij

i

u

=

u

. (2.4) Layer 4 (consequent node): each node in this layer calculates the consequent value. Each consequent value (linear combination of the crisp inputs) is weighted by the firing strength of the fuzzy rule and it can be written by:

), where the summation is the consequent part and

w

ij is its corresponding parameters.

Layer 5 (output node): The node in this layer computes output signal. The output node integrates with links connected to it and acts as a defuzzifier with:

(4) (3)

where u(5) is the output of 5th layer , wij is the weighting value with ith dimension and jth rule node, and R is the number of a fuzzy rule.

Figure 2.1: Structure of TSK-type NFS.

2.2 Genetic Algorithm

Genetic algorithms (GAs) are search algorithms inspired by the mechanics of natural selection, genetics, and evolution. It is widely accepted that the evolution of living beings is a process that operates on chromosome-organic devices for encoding the structure of living beings.

The flowchart of the learning process is shown in Fig. 2.2, where Nc is the size of population, G denote Gth generation. The learning process of the GAs involves three major steps: reproduction, crossover, and mutation. Reproduction [71]-[73] is a process in which individual strings are copied according to their fitness value. This operator is an artificial version of neural selection. In GAs, a high fitness value denotes a good fit. In the reproduction step, the well-known method is the roulette-wheel selection method [73] (see Fig.2.3). In Fig.2.3, the intermediate population is P’, which is generated from identical copies of a chromosome sampled by spinning the roulette wheel a sufficient number of times.

Figure 2.2: Flowchart of the genetic algorithm.

Figure 2.3: The roulette wheel selection.

In crossover step [74]-[78], although reproduction step directs the search toward the best existing individuals, it cannot create any new individuals. In nature, an offspring has two parents and inherits genes from both. The main operator working on the parents is the crossover operator, the operation of which occurred for a selected pair with a crossover rate.

Figure 2.4 illustrates how the crossover works. Crossover produces two offspring from their parents by exchanging chromosomal genes on either side of a crossover point generated randomly.

Figure 2.4: Crossover operator.

In mutation step [79]-[85], although the reproduction and crossover would produce many new strings, they do not introduce any new information to the population at the site of an

individual. Mutation can randomly alter the allele of a gene. The operation is occurred with a mutation rate. Figure 2.5 illustrates how the mutation works. When an offspring is mutated, one of its genes selected randomly is changed to a new value.

Figure 2.5: Mutation operator.

Since GAs search many points in the space simultaneously, they have less chance to reach the local minima than single solution methods. The advantages of GAs are: 1) some individuals have a better chance to come close to the global optima solution, and 2) the genetic operators allow the GA to search optima solution. According to above reasons, GAs are suitable for searching the parameters space of neuro-fuzzy controller. For solving the problem that a neuro-fuzzy controller which performs gradient-descent based learning algorithms may reach the local minima very fast but never find the global solution, the GAs sample the parameters space of neuro-fuzzy controllers and recombine those that perform best on the control problem.

2.3 Particle Swarm Optimization

In this section, we will introduce the PSO. The standard PSO is introduced in section 2.3.1 and the CPSO is introduced in section 2.3.2.

2.3.1 Standard Particle Swarm Optimization

PSO is first introduced by Kennedy and Eberhart in 1995 [8]. It’s one of the most powerful methods for solving global optimization problems. The algorithm searches an

optimal point in a multi-dimensional space by adjusting the trajectories of its particles. The individual particle updates its position and velocity based on its previous best performance and previous best performance of other particles which denote y and respectively. A simple demonstration of how PSO learning proceeds can be shown in Fig. 2.7 as follows:

ˆy

Figure 2.6: Diagram of the PSO learning mechanism.

The position xi,d and velocity vi,d of the d-th dimension of i-th particle are updated as

follows:

, ( 1) = , ( ) 1 1 ( , ( ) , ( )) 2 2 ( ( ) , ( )), ( 1) ( ) ( 1),

i d i d i d i d d i d

i i i

v t v t c rand y t x t c rand y t x t

x t x t v t

+ + ⋅ ⋅ − + ⋅ ⋅ −

+ = + + (2.7)

where yi represents the previous best position yielding the best performance of the i-th particle;

c

1 and c2 denote the acceleration constants describing the weighting of each particle been pulled toward y and y respectively; and are two random numbers in the range [0, 1].

rand1 rand2

Let s denote the swarm size and f() denote the fitness function evaluating the performance yielded by a particle. After Eq. (2.7) is executed, the personal best position y of each particle is updated as follows:

( 1), if ( ( 1)) ( ( )),

and the global best position is found by:

( 1) arg min ( ( 1)), 1 . (2.9)

i i

y t

+ = y

f y t

+ ≤ ≤

i s

In 2002, Clerc [12] confirms the convergence of PSO by using a constriction factor which greatly enhances the applicability of PSO. The implementation of the constriction factor is shown in Eq. (2.10)-(2.12):

, ( 1) = [ , ( ) 1 1 ( , ( ) , ( )) 2 2 ( ( ) , ( ))], The flowchart of the PSO is shown in Fig. 2.7.

2.3.2 Cooperative Particle Swarm Optimization

The CPSO [9] is one of the most significant improvements to the original PSO. Van den Bergh presented a family of CPSOs, including CPSO-S, CPSO-SK, CPSO-H, CPSO-HK. Algorithm CPSO-HK is the hybrid from PSO and CPSO-SK and it is proposed to address the issue of “pseudominima.”

The concept of CPSO-S is that instead of trying to find an optimal n-dimensional vector, the vector is split into n parts so that each of n swarms optimizes a 1-D vector. The CPSO-SK

is a family of CPSO-S, where a vector is split into K parts rather than n, where . K also represents the number of swarms. Each of the K swarms acts as a PSO optimizer. The main

Kn

Figure 2.7: Flowchart of the PSO.

difference between the PSO and the CPSO is that the fitness of a single particle of the CPSO has to be evaluated through global best particles of the other swarms. Let Pj denote the j-th swarm and Pj‧xi represents the i-th particle in the swarm j. The concept of the CPSO can be illustrated as follows:

Figure 2.8: Schematic diagram of the CPSO.

The fitness of Pj‧xi is defined as:

f P x

( ji i)=

f P y

( 1i …, ,

P

j1i

y P x

, . , ,j i

P y

Ki . (2.13) ) The CPSO applies cooperative behavior to improve the PSO on find the global optimum in a high-dimensional space. This is achieved by employing multiple swarms to explore the subspaces of the search space separately to reduce the curse of dimensionality. However, there is no absolute criterion stating that the CPSO is superior to the PSO since independent changes made by different swarms on correlated variables will deteriorate its performance. In addition, in one generation of an n-dim CPSO-S operation, the computational cost is n times larger than that of a PSO operation.

2.4 Evolution Strategy with Covariance Matrix Adaptation

In this section, we introduce some background knowledge related to the proposed MS-CMA-ES. The standard CMA-ES is introduced in section 2.4.1, kernel density estimation is introduced in section 2.4.2 and mean shift procedure is introduced in section 2.4.3.

2.4.1 Standard CMA-ES

In the standard CMA-ES, a population of new search points is generated by sampling a multivariate normal distribution N with mean

m

n and covariance matrix

C

n n× . The equation of sampling new search points, for each generation number g = 0,1,2,…, reads

(g 1) ( )g ( )g (0, ( )g ) for 1, ,

x

i +

m

+

σ N C i

=

λ

, (2.14) where ~ denotes the same distribution on the left and right hand side; σ(g) denotes the overall standard deviation, step-size, at generation g and λ is the sample size. The new mean m(g+1) of the search distribution is a weighted average of the μ selected points from λ samples

, ,…,

( 1) 1

x

g+

x

2(g+1)

x

λ(g+1):

( 1) ( 1)

The index i:λ denotes the i-th rank individual and

, (2.17)

( 1) ( 1) ( 1)

1: 2: :

( g ) ( g ) (

f x

λ+

f x

λ+ ≤ ≤

f x

λ λg+ )

where

f(‧) is the objective function to be minimized. The adaptation of new covariance

matrix C(g+1) is formed by a combination of rank-μ and rank-one update [13]

( ) (

where μcov ≥ 1 is the weighting between rank-μ update and rank-one update; ccov∈[0,1] is the learning rate for the covariance matrix update, and

( 1) ( 1) ( ) ( )

:g ( :g g ) /

i i

y

λ+ =

x

λ+

m σ

g (2.19)

is a modified formula used to compute the estimated covariance matrix for the selected samples. The evolution path

p

c(g+1) for rank-one update is described as follows:

( 1) ( ) denotes the variance effective selection mass. The new step-size σ(g+1) is updated according to

( 1)

12 ( 1) ( ) the Euclidean norm of a N(0, I) reads

(0, ) 2 ( 1) / ( ) (1/ )

where Γ() denotes the gamma function and O() represents high-order terms.

2.4.2 Kernel Density Estimation

In parametric model estimation analysis, we need to suppose the distribution of data points coincides with certain model. Empirical evidence have shown that there tends to exist large differences between parametric estimation-based models and real-world physical models.

Based on above defects, Rosenblatt and Parzen proposed a non-parametric way called kernel density estimator [66] to estimate the unknown p.d.f. of a random variable. The kernel density estimator does not require prior knowledge of how data distribute; instead, it analyzes the characteristic of the distribution of data. Hence, it is highly valuable in both statistical theory and application.

In the proposed MS-CMA-ES, sampled search points in the search space are considered as data in the feature space. It is very intuitive since the location of search points tends to be the phenomenal feature in function optimization problems. The rationale behind density estimation-based clustering approach is that the feature space can be regarded as the empirical p.d.f. Due to the fact that search points are sampled from normal distribution with adjusted mean and adapted covariance matrix, and are further selected according to their fitness, dense regions in the search space correspond to local maxima of the p.d.f.; in other words, the modes of the unknown density. Consider n points xi, i = 1,…, n, in the d-dimensional space

definite matrix bandwidth matrix H, computed in the point x is given by

where Kh(x) is a d-variate kernel function satisfying

( ) 1

K x dxh

−∞ =

. (2.27)

Normally speaking, kernel functions are symmetric, unimodal probability density functions.

Uniform, normal and Epanechnikov kernel are the most common seen. It has been proven that, in certain routine conditions, kernel density estimator approximates the real density functions gradually with increasing sampling size [86]. Although the choice of different kernel functions have different effects on the results, but the effect appears small compared with the effect caused by the bandwidth, so researches focus more on the selection of bandwidth [87].

Theoretically, the selection of bandwidth is based on the mean integrated square error (MISE) between kernel density estimation and the real density function. However, the computation of MISE is too complicated. In practice, how selection of bandwidth affects the performance is analyzed by computing an asymptotic mean integrated error (AMISE) from a large number of samples. Recently, many literatures use plug-in method and cross-validation method to determine the optimal bandwidth, so that the selection of bandwidth no longer depends on the prior guess of true density function [86, 87]. In addition to the aforementioned fixed bandwidth mechanism, the variable bandwidth mechanism, bandwidth varies with different sample position, is also widely adopted in practice [88, 89]. Because it is very difficult for the fixed bandwidth mechanism to properly address multimodal density functions, especially in cases when density of each peak varies greatly. However, the analysis is relatively more complicated when compared with fixed bandwidth mechanism. In practice, the utilization of variable bandwidth mechanism is mostly based on rule of thumb [86]. If the variable

bandwidth mechanism is adopted, the kernel density estimator Eq. (2.25) becomes

H is the symmetric, positive definite bandwidth matrix.

2.4.3 Mean Shift Procedure

Mean shift procedure is a very versatile tool for feature space analysis and it is applicable to many field of tasks [90-92]. In the previous research [65], authors successfully extend this algorithm to computer vision applications, and have attracted huge attention. Mean shift procedure is an iterative algorithm based on kernel density estimation, which continually updates the mean shift vectors of data points according to the gradient of kernel function.

Although the mean shift algorithm is very simple in form, but in practice there is a high efficiency and stability. The most classic application is the mean shift-based clustering algorithm. If we can have a good estimation of bandwidth, mean shift-based clustering algorithm would be a nice alternative relative to algorithms that the number of clusters needs to be pre-set, such as K-means algorithm. In the proposed MS-CMA-ES, search points in the search space are considered as data in the feature space. It is very intuitive since location of search points tends to be the phenomenal feature in function optimization problems. Due to the fact that, in the MS-CMA-ES, sampled search points are further selected according to their fitness, dense regions in the search space correspond to local maxima of the p.d.f.; in other words, the modes of the unknown density.

Consider the density estimation kernel Kh(x) introduced earlier this section. If the profile notation [21] is employed, the kernel Kh(x) can also be written as

= 2

where kh(x) is a radially symmetric kernel defined as the profile of the Kh(x), and ck,d is the normalization constant which makes Kh(x) integrate to one. If we define

( )

g x

h = −

k x

h′( ), (2.31) the d-variate kernel Gh(x) can also be written as

2

( ) , ( )

h g d h

G x

=

c g x

, (2.32)

and similarly, cg,d denotes the normalization constant. The density estimation kernel Kh(x) is also called the shadow kernel of Gh(x) [65]. Consider n points xi, i=1,…, n, in the

d-dimensional space

d, the mean shift vector at x is given by

1

Intrinsically, mean shift procedure can be viewed as a mode seeking method [59], which determines the modes of p.d.f. estimated by kernel Kh(x). Denote {yj}j=1,2,… the sequence of successive search locations of kernel Gh, from Eq. (2.33) it has the form

1

and y1 is the initial search location. The corresponding sequence of density estimates computed with kernel Kh is given by

{ f

ˆh K, ( )

j }

=

f y

ˆh( ) j

j

=1, 2, . (2.35)

In the previous research [59], authors have proven that once search location yj gets sufficiently close to a mode of estimated density function ˆ,

f

h K, it converges to it, and the set of all locations converge to the same mode is defined as the basin of attraction of that mode. The general steps of applying mean shift procedure is listed as follows:

Step 1: Uniformly generate appropriate number of initial search points.

Step 2: Sequentially or parallelly run the mean shift procedure until the search points converge.

Step 3: Each convergence point defines a mode and each initial location converges to that mode defines the basin of attraction of that mode.

CHAPTER 3

EVOLUTIONARY ALGORITHMS

In this chapter, the proposed four algorithms are discussed. In section 3.1, a Q-valued based particle swarm optimization and the concept of using Lyapunov design principles for constructing safe reinforcement learning agents are introduced. In section 3.2, the proposed two-strategy reinforcement (TSR) learning mechanism and the group-based symbiotic evolution (GSE) which enables the learning agent to evaluate the fuzzy rule locally are introduced. In section 3.3, a separability detection approach to cooperative particle swarm optimization (SD-CPSO) for placing correlated variables into the same swarm is discussed. In section, 3.4, the proposed mean shift based evolutionary strategy with covariance matrix adaptation (MS-CMA-ES) is introduced. We cannot directly apply mean shift clustering to the sampled points generated by CMA-ES because the adopted mean shift clustering requires independent identity distribution of samples to perform density estimations. Several previous works such as importance sampling [93,94] and bandwidth estimation [86,87] are also discussed in this section.

3.1 Q-value based PSO

Thorough learning algorithm of QPSO is described in this section. The architecture is shown in Fig. 3.1. The whole learning process can be roughly divided into two parts: the Q-value and PSO operation part. The learning strategy for Q-values of particles is detailed in section 3.1.1 while the PSO operation and the flowchart of QPSO are described in section 3.1.2.

Figure 3.1: Architecture of QPSO.

3.1.1 Learning Q-values of Particles

In QPSO learning, if there are s particles in the swarm, s trials are taken in one generation.

The agent applies in each trial an action to the environment by selecting a particle based on its Q-value. Every time a particle is selected, the Q-value of the selected particle is updated based on the system’s reward. If the -th particle is selected, its Q-value q

i

i is updated as

fitness values for PSO evolution.

3.2.2 Q-value based PSO

The PSO operation used in QPSO consists of two major steps: swarm initialization and Q-valued base PSO evolution. Details of these two steps are described step-by step as follows.

‧ Swarm initialization:

The particle swarm is composed of particles encoded by the parameters on a NFS. Each particle is encoded by the mean, deviation of Gaussian membership functions and the weightings for output action strength. The number of fuzzy rules determines the length

The particle swarm is composed of particles encoded by the parameters on a NFS. Each particle is encoded by the mean, deviation of Gaussian membership functions and the weightings for output action strength. The number of fuzzy rules determines the length