CHAPTER 2 FOUNDATIONS
2.4 E VOLUTION S TRATEGY WITH C OVARIANCE M ATRIX A DAPTATION
2.4.3 Mean Shift Procedure
Mean shift procedure is a very versatile tool for feature space analysis and it is applicable to many field of tasks [90-92]. In the previous research [65], authors successfully extend this algorithm to computer vision applications, and have attracted huge attention. Mean shift procedure is an iterative algorithm based on kernel density estimation, which continually updates the mean shift vectors of data points according to the gradient of kernel function.
Although the mean shift algorithm is very simple in form, but in practice there is a high efficiency and stability. The most classic application is the mean shift-based clustering algorithm. If we can have a good estimation of bandwidth, mean shift-based clustering algorithm would be a nice alternative relative to algorithms that the number of clusters needs to be pre-set, such as K-means algorithm. In the proposed MS-CMA-ES, search points in the search space are considered as data in the feature space. It is very intuitive since location of search points tends to be the phenomenal feature in function optimization problems. Due to the fact that, in the MS-CMA-ES, sampled search points are further selected according to their fitness, dense regions in the search space correspond to local maxima of the p.d.f.; in other words, the modes of the unknown density.
Consider the density estimation kernel Kh(x) introduced earlier this section. If the profile notation [21] is employed, the kernel Kh(x) can also be written as
= 2
where kh(x) is a radially symmetric kernel defined as the profile of the Kh(x), and ck,d is the normalization constant which makes Kh(x) integrate to one. If we define
( )
g x
h = −k x
h′( ), (2.31) the d-variate kernel Gh(x) can also be written as2
( ) , ( )
h g d h
G x
=c g x
, (2.32)and similarly, cg,d denotes the normalization constant. The density estimation kernel Kh(x) is also called the shadow kernel of Gh(x) [65]. Consider n points xi, i=1,…, n, in the
d-dimensional space
d, the mean shift vector at x is given by1
Intrinsically, mean shift procedure can be viewed as a mode seeking method [59], which determines the modes of p.d.f. estimated by kernel Kh(x). Denote {yj}j=1,2,… the sequence of successive search locations of kernel Gh, from Eq. (2.33) it has the form
1
and y1 is the initial search location. The corresponding sequence of density estimates computed with kernel Kh is given by
{ f
ˆh K, ( )j }
=f y
ˆh( ) jj
=1, 2, . (2.35)In the previous research [59], authors have proven that once search location yj gets sufficiently close to a mode of estimated density function ˆ,
f
h K, it converges to it, and the set of all locations converge to the same mode is defined as the basin of attraction of that mode. The general steps of applying mean shift procedure is listed as follows:Step 1: Uniformly generate appropriate number of initial search points.
Step 2: Sequentially or parallelly run the mean shift procedure until the search points converge.
Step 3: Each convergence point defines a mode and each initial location converges to that mode defines the basin of attraction of that mode.
CHAPTER 3
EVOLUTIONARY ALGORITHMS
In this chapter, the proposed four algorithms are discussed. In section 3.1, a Q-valued based particle swarm optimization and the concept of using Lyapunov design principles for constructing safe reinforcement learning agents are introduced. In section 3.2, the proposed two-strategy reinforcement (TSR) learning mechanism and the group-based symbiotic evolution (GSE) which enables the learning agent to evaluate the fuzzy rule locally are introduced. In section 3.3, a separability detection approach to cooperative particle swarm optimization (SD-CPSO) for placing correlated variables into the same swarm is discussed. In section, 3.4, the proposed mean shift based evolutionary strategy with covariance matrix adaptation (MS-CMA-ES) is introduced. We cannot directly apply mean shift clustering to the sampled points generated by CMA-ES because the adopted mean shift clustering requires independent identity distribution of samples to perform density estimations. Several previous works such as importance sampling [93,94] and bandwidth estimation [86,87] are also discussed in this section.
3.1 Q-value based PSO
Thorough learning algorithm of QPSO is described in this section. The architecture is shown in Fig. 3.1. The whole learning process can be roughly divided into two parts: the Q-value and PSO operation part. The learning strategy for Q-values of particles is detailed in section 3.1.1 while the PSO operation and the flowchart of QPSO are described in section 3.1.2.
Figure 3.1: Architecture of QPSO.
3.1.1 Learning Q-values of Particles
In QPSO learning, if there are s particles in the swarm, s trials are taken in one generation.
The agent applies in each trial an action to the environment by selecting a particle based on its Q-value. Every time a particle is selected, the Q-value of the selected particle is updated based on the system’s reward. If the -th particle is selected, its Q-value q
i
i is updated asfitness values for PSO evolution.
3.2.2 Q-value based PSO
The PSO operation used in QPSO consists of two major steps: swarm initialization and Q-valued base PSO evolution. Details of these two steps are described step-by step as follows.
‧ Swarm initialization:
The particle swarm is composed of particles encoded by the parameters on a NFS. Each particle is encoded by the mean, deviation of Gaussian membership functions and the weightings for output action strength. The number of fuzzy rules determines the length of each particle. After the number of rules is set, the initial particles are generated according to the following equations:
[ ] [
min max]
p represents the i-th particle in the swarm; N represents the input dimension; R
irepresents the number of fuzzy rules; D represents the size of each particle, usually D equals to (N+1)R in most of cases where the dimension of output variable is 1;
[
mmin,mmax]
,[ σ
min,σ
max]
and[
wmin,wmax]
are the predefined ranges. The above equations result in the coding scheme between a neural fuzzy system and a particle shown in Fig. 3.2.Figure 3.2: Coding scheme between a particle and a TSK-type NFS in QPSO.
‧ Q-value based PSO evolution:
The Q-values derived in Eq. (3.3) are used as the fitness values for PSO evolution. The Q-value of each particle determines the performance of a particle for controlling the system. In the proposed QPSO, the Q-value of each particle indicates how soon a particle can guide the system’s state to reach the set of goal states. The learning processes proceed to new generation until a predefined stop criterion is met. The block diagram of whole learning process in QPSO is shown in Fig. 3.3.
3.2 Two-strategy Reinforcement Evolutionary Algorithm
The proposed two-strategy reinforcement evolutionary algorithm (TSR-EA) is introduced in this section. Two major modifications are proposed in this algorithm: a two-strategy
reinforcement signal design and the group-based symbiotic evolution (GSE). Details of these two operations are described as follows:
3.2.1 Two-strategy Reinforcement Signal Design
The TSR-EA is constructed on a TSK-type NFS model. The NFS model acts as a control network to determine a proper action according to the current input vector (environment state).
The feedback signal is the reinforcement fitness value that functions as a performance measurement. The reinforcement learning architecture adopted in the TSR-EA is the time-step reinforcement architecture [95]-[97]. In this architecture, the only available feedback is a reinforcement signal that notifies the model only when a failure occurs. This architecture is straightforward and easy to implement. However, its fitness function can only indicates how long can the controller work well instead of measuring how soon the system can enter the desired state, which is also very important. Most reinforcement learning algorithms offer no guarantee on stabilizing a system around a certain operating point, or keeping the state of a system within a certain range. In this dissertation, the proposed QPSO described in section 3.1 can meet the aforementioned goals by adopting the concept of safe reinforcement learning agents based on Lyapunov design principles proposed Perkins and Barto [57]. Using the concept proposed in [57], the QPSO can guide the state of a system to reach and remain in a desired set of goal states by constraining the action choices of the agents. Actions constrained by Lyapunov design principles cause the system to descend on an appropriate Lyapunov function. The feedback reinforcement signal of in the QPSO is the time step that indicates how soon the system enters the desired set of goal states. The QPSO provides not only
reliable initial learning performance but also accurate learning result. However, in order to apply Lyapunov design principles, we have to identify the Lyapunov function of a control plant in advance, which refers to the requirement of more information about the state of the control plant. For some real-world applications, some states are difficult or expensive to obtain. As a result, in the TSR-EA method, we proposed the TSR design so that our method can enjoy the convenience brought by the standard reinforcement learning architecture on one hand, and the accurate learning performance on the other. The TSR learning signal design for determining the fitness value of each learning trial is described as follows.
The proposed two strategies are judgment and evaluation. The judgment strategy measures the fitness value of a learning trial that fails to maintain the system’s state in a desired operating range, whereas the evaluation strategy measures the fitness value of a learning trial that works the system well in the original successful range, but fails under a stricter successful range is applied. At first, for each different control task, a corresponding operating range Original_Range is predefined. Then, we shrink the original successful operating range as the control time step increases, as defined in Eq. (3.7).
( )
Thres TimeStep A t Thres TimeStep A t Thres TimeStep A Thres TimeStep otherwise
where A is a parameter that simply prevents the modified range from becoming zero. This equation provides guidance to the controller to meet the control goal sooner. The
Original_Range and Strict_Range are both considered as stopping criteria. If a learning trial
fails because the system state falls beyond the Original_Range, this learning trial is then considered as failing under a “looser” constraint. Hence, a smaller fitness value is obtained from this learning trial. On the contrary, if a learning trial fails for the system state deviatingfrom the Strict_Range, this learning trial is then considered as failing under a “stricter”
constraint, and a relatively larger fitness value is obtained from this learning trial. The determining fitness values in both strategies are detailed as follows:
•Strategy 1. Judgment strategy:
If the system fails at time step t deviating from the original successful operating range, then
where is a predefined parameter. A learning trial is deemed unsuccessful if it is unable to meet the control goal before Thres_TimeStep.
_
Thres TimeStep
•Strategy 2. Evaluation strategy
Under the condition that the controller successfully maintains the system’s state in the original successful operating range, the fitness value is calculated by the following two cases.
Case 1 represents the system works well under the original successful operating range but falling beyond the range defined in Eq. (3.7). Case 2 represents the controller successfully controlling the system.
Case 1. If the system enters the set of goal states at time step t
1 but falls beyond the strict successful range defined in Eq. (3.7) at time step t2, then2 1
1
_ 1( ) .
Fitness Value t t
=t − (3.9)
Case 2. If the system enters the set of goal states at time step t
1 and stabilizes the system forStable_TimeSteps, then
1
_ _ +(
Stable TimeSteps
_Fitness Value Stable TimeSteps
=
t
) . (3.10)The reinforcement fitness value evaluates how soon the plant can meet the desired set of goal states and how long the controller maintains the plant within it. The advantage of the proposed TSR-EA method is that it provides a relatively accurate learning performance compared with standard time-step reinforcement architecture.
3.2.2 Group-based Symbiotic Evolution
In this section, the idea of GSE is introduced. Unlike traditional GA that uses each individual in a population as a full solution to a problem, GSE assumes that each individual in a population represents only a partial solution to a problem. In a standard evolution algorithm, a single individual is responsible for the overall performance, with a fitness value assigned to that individual according to its performance. In the GSE, in order to calculate the fitness of an individual (a partial solution), we have to combine the current individual with other “global best” individuals of other groups to form a context vector first. A context vector stands for a complete solution and can be used to evaluate the fitness value. This idea is adopted from the CPSO introduced earlier. Let xj denote the j-th chromosome and Pj‧xi represents the i-th chromosome in the group j. Then the fitness of Pj‧xi is defined as:
f P x
( ji i)=f P y
( 1i …, ,P
j−1iy P x
, . , ,j i …P y
Ki . (3.11) ) As shown in [96-100], partial solutions can be characterized as specializations. The specialization property ensures diversity, which prevents a population from converging to suboptimal solutions. A single partial solution cannot “take over” a population since there must be other specializations present. Unlike the standard evolutionary approach, which always causes a given population to converge, hopefully at the global optimum, the symbiotic evolution finds solutions in different, unconverted populations. With the fitness assignment performed by GSE, and the local property of a fuzzy rule, GSE and the fuzzy system design can complement each other.The structure of the GSE is shown in Fig. 3.4, where Ncs is the number of complete solutions the GSE will select individuals to form in one generation, R denotes the number of fuzzy rules in a NFS.
Figure 3.4: Structure of a chromosome in the GSE.
The coding structure of a chromosome is shown in Fig. 3.5, which describes that where mij
and σij represent a Gaussian membership function with mean and deviation, respectively, and
w
j is the weight of the jth rule node and n denotes the input dimension.Figure 3.5: Coding structure of a chromosome in the TSR-EA.
3.3 Mean Shift-Based Evolution Strategy with Covariance Matrix Adaptation
Evolution strategy with covariance matrix adaptation (CMA-ES) is very effective in optimization of unimodal functions, but inferior to other algorithms that emphasize the global search ability, such as particle swarm optimization (PSO) or differential evolution (DE), in optimization of multi-funnel functions. Enhancing the global search ability of CMA-ES has becoming urgent goals of many scholars within the field. In this dissertation, we propose a
mean shift based CMA-ES (MS-CMA-ES). The framework of proposed method is constructed on CMA-ES. In the traditional CMA-ES, new search points are sampled from normal distribution; however, in the MS-CMA-ES, new search points are sampled from mixture normal model. The introduced mean shift procedure is a clustering method, which allows us to apply multiple CMA-ES instances to explore multiple search directions in parallel according to the its clustering result. In the MS-CMA-ES, the mean shift procedure is also used to compute the mean vector of the mixture normal distribution. During the mean shift procedure, each search point is “shifted” toward their corresponding local optima area of the p.d.f. until all search points converge. The converge points represents new mean vectors of the mixture normal model; in other words, the initial sampling locations of the MS-CMA-ES.
In this dissertation, we mainly focus on studying how to apply mean shift based clustering approach in optimization of complex objective functions, detecting their modes, and try to preserve the advantage of CMA-ES that converge rapidly in optimization of single-funnel functions. In the chapter we will detail the architecture of our method and its learning process.
In section 3.3.1, we will describe our motivation, which is followed by a block diagram learning process. The detail of each block will be described in section 3.3.2 through 3.3.4.
3.3.1 Motivation
Before introducing the proposed MS-CMA-ES, we observe a drawback that CMA-ES may encounter as shown in Fig. 3.6:
(a) (b)
Figure 3.6: Computer simulation result of CMA-ES with initial search location at (0, 0).
Figure 3.6 is an computer simulation result of an optimization problem with two local optimal solution. The upper right region is a suboptimal region with steeper gradient toward it and the lower left region is the global optimal region with a smoother trend. As shown in Fig. 3.6, white lines represents the locus of average of search points and darker background color stands for higher fitness value. The initial search location is at (0,0) which is in the middle of two local optimal solutions. The ideal case is that the locus wanders between two local optimal regions then converge to the lower left global optimal region as shown in Fig. 3.6(a).
However, we found by simulation that most of times the locus only temporary wanders and converge to the upper right region as shown in Fig. 3.6(b). The search direction of CMA-ES cannot continually expand to two optimal regions and determine the real optimal solution according to their converge points.
In this dissertation we think this drawback is due to the fact that the sampling distribution is limited to normal distribution. In statistical learning [66], simple normal model is not enough to deal with complex problems, an advanced alternative is the mixture model.
Mixture normal model can effectively approximate the p.d.f. of multimodal functions, and reveal their important characteristics: number and location of the modes. In this dissertation,
we think the aforementioned drawback can be relieved if the distribution of search points are sampled by a mixture normal model.
Recently there are researches attempt to turn CMA-ES search pattern into multiple region search style; for example, authors [26] incorporates particle swarm optimization to enhance the global search ability of CMA-ES. In this dissertation, we adopt the different concept by altering the sampling model. After sampling the search points, all samples are clustering by mean shift procedure. A new cluster represents a new CMA-ES instance. By perspective of mixture model, a new cluster stands for a new component of the mixture.
Observed from the computer simulation result, the proposed mechanism can alleviate the deficiency that CMA-ES cannot search multiple directions in parallel. The block diagram of the proposed MS-CMA-ES is shown in the following figure and the detail of each block will be introduced in the subsequent sections.
Figure 3.7: The block diagram of the MS-CMA-ES.
3.3.2 Sampling from a Mixture Model
In the proposed MS-CMA-ES, the sampling of new search points is given by
( 1) ( ) ( ) ( ) ( )
mix( , , , ) for 1, ,
g g g g g
x
i + ∼N m σ C α i
=λ
, (3.12)where Nmix denotes a mixture Gaussian distribution with its p.d.f. pM(•) reads
( )
G(•; θ) denotes a multivariate Gaussian function parameterized by θ; k denotes the index of
denotes the total number of samples
. (3.13)
component of mixture model; K(g) is the total number of components at generation g; λ
( )
generation and is determined by the result of clustering.
reliable density estimator. Let us denote f the unknown p.d.f. of the search space and
(g)
(g)(1),…, C (g)(K(g
σ
( f search step size σ(g)≡{σ (g)(1),…, σ(g)(K(g))}; α (g) ixture weighting α(g)≡ α(g)(1),…, α(g)(K(g))}.3.3.3 Mean Shift-based Clustering
The total number of components K(g) is variable at each
In this dissertation, we don’t directly apply clustering method to the sampled points because the adopted mean shift based clustering method requires independent identity distribution of samples to perform density estimations. In our method, search points are sampled from a Gaussian mixture model. Due to the absence of independent identity distribution of samples, we introduce the importance sampling method [93, 94] to find a more
ˆfH the kernel density estimation with bandwidth matrix H. The kernel density estimation of f after
, introducing importance sampling method is given by
λ
which is almost identical with traditional kernel density estimation besides the importance
weighting ωi
where wi is the fitness weighting of xi; pM(•) is the p.d.f. of mixture normal distribution shown in Eq. (3.1w). The sequence of mean shift-based clustering of each search point after introducing importance sampling is given by
1 matrix for xm.
H
m is another important parameter needs to be determined. In general, when doing kernel density estimation, literatures process adequate, at least 50 to 100, amount of samples. In such cases, the selection of bandwidth matrix can be achieved based on the analysis of asymptotic mean integrated square error (AMISE) [87]. The proposed MS-CMA-ES is mainly constructed on traditional CMA-ES that only generates few samples at each generation; therefore, we cannot cite AMISE based bandwidth selection methods which have richer research results. In this dissertation, the selection of bandwidth matrix is according to Theorem 3.1, the analysis of MISE [87], which is more applicant to cases with1 matrix for xm.