Overview of Dissertation - 以改良安全性增強式學習為基礎的自我適應進化演算法應用於模糊類神經控制器設計之研究

Chapter 1 Introduction

1.5 Overview of Dissertation

This dissertation consists of six chapters. In Chapter 1, the introduction consists of motivation, review of previous works, research goal, approach, and overview of this dissertation.

In Chapter 2, the foundation for the four components of the proposed ISRL-SAEAs by providing background material on neuro-fuzzy controller, reinforcement learning, Lyapunov

In the second example, the tandem pendulum control performance of the proposed method of this disserta pendulum control system is too easy to find solutio performance of the proposed method of this disserta performance of the proposed method of this disserta performance of the proposed method of this disserta

In the second example, the tandem pendulum control In the second example, the tandem pendulum control performance of the proposed method of this disserta pendulum control system is too easy to find solutio performance of the proposed method of this disserta pendulum control system is too easy to find solutio performance of the proposed method of this disserta performance of the proposed method of this disserta

In the second example, the tandem pendulum control In the second example, the tandem pendulum control In the second example, the tandem pendulum control

pendulum control system is too easy to find solutio pendulum control system is too easy to find solutio

stability, and evolutionary algorithm.

In Chapter 3, the first part of the proposed ISRL-SAEAs is self adaptive evolution algorithms (SAEAs). The SAEAs consist of a hybrid evolutionary algorithm (HEA), self adaptive groups’ cooperation based symbiotic evolution (SAGC-SE), and self adaptive groups based symbiotic evolution using FP-growth algorithm (SAG-SEFA). These three algorithms are introduced in this chapter.

In Chapter 4, the second part of the proposed ISRL-SAEAs is the improved safe reinforcement learning (ISRL). The ISRL consists of novel reinforcement signal designed and Lyapunov stability analysis. Both of these two components will be introduced in this chapter.

In Chapter 5, to demonstrate the performance of the ISRL-SAEAs for temporal problems, two examples and performance contrasts with some other models are presented.

The examples of this chapter consist of the inverted pendulum control system and tandem pendulum control system.

In Chapter 6, the contributions and outlines some promising directions for future research are discussed.

The examples of this chapter consist of the inverte

the contributions and outlines some promising direc The examples of this chapter consist of the inverte

the contributions and outlines some promising direc the contributions and outlines some promising direc the contributions and outlines some promising direc the contributions and outlines some promising direc the contributions and outlines some promising direc the contributions and outlines some promising direc the contributions and outlines some promising direc the contributions and outlines some promising direc the contributions and outlines some promising direc the contributions and outlines some promising direc

Chapter 2 Foundations

The background material and literature review that relates to the major components of the research purpose outlined above (neuro-fuzzy controller, reinforcement learning, Lyapunov stability, and evolutionary algorithm) are introduced in this chapter. The concept of neuro-fuzzy controller is discussed in the first section. The reinforcement learning schema is introduced in Section 2.2. In Section 2.3, the Lyapunov stability that the improved safe reinforcement learning (ISRL) is discussed. The final section focuses on genetic algorithm, cooperative coevolution, and symbiotic evolution, the method on which the proposed self adaptive evolutionary algorithms (SAEAs) are based.

2.1 Neuro-Fuzzy Controller

Neuro-fuzzy modeling has been known as a powerful tool ([1]-[14]) which can facilitate the effective development of models by combining information from different sources, such as empirical models, heuristics and data. Neuro-fuzzy models describe systems by means of fuzzy if–then rules represented in a network structure, to which learning algorithms known from the area of artificial neural networks can be applied.

A neuro-fuzzy controller is a knowledge-based system characterized by a set of rules, which model the relationship among control input and output. The reasoning process is defined by means of the employed aggregation operators, the fuzzy connectives and the inference method. The fuzzy knowledge base contains the definition of fuzzy sets stored in the fuzzy database and a collection of fuzzy rules, which constitute the fuzzy rule base. Fuzzy cooperative coevolution, and symbiotic evolution, t

adaptive evolutionary algorithms (SAEAs) are based.

cooperative coevolution, and symbiotic evolution, t adaptive evolutionary algorithms (SAEAs) are based.

adaptive evolutionary algorithms (SAEAs) are based.

rules are defined by their antecedents and consequents, which relate an observed input state to a desired output. Two typical types of neuro-fuzzy controllers are Mamdani-type and TSK-type neuro-fuzzy controllers.

For Mamdani-type neuro-fuzzy controllers ([1]), the minimum fuzzy implication is used in fuzzy reasoning. The neuro-fuzzy controllers employ the inference method proposed by Mamdani in which the consequence parts are defined by fuzzy sets. A Mamdani-type fuzzy rule has the form:

IF x₁ is A_1j(m_1j , σσσσ1j ) and x₂ is A_2j(m_2j , σσσσ2j )…and x_nis A_nj(m_nj , σσσσnj)

THEN y’ is B_j (m_j ,σσσσj ) (2.1) where m , and_ij σ_ij represent a Gaussian membership function with mean and deviation with

ith dimension and jth rule node. The consequences Bj of jth rule is aggregated into one fuzzy set for the output variable y^’. The crisp output is obtained through defuzzification, which calculates the centroid of the output fuzzy set.

Besides the more common fuzzy inference method proposed by Mamdani, Takagi, Sugeno and Kang introduced a modified inference scheme ([5]). The first two parts of the fuzzy inference process, fuzzifier the inputs and applying the fuzzy operator are exactly the same. A Takagi-Sugeno-Kang (TSK) type fuzzy model employs different implication and aggregation methods than the standard Mamdani’s type. For TSK-type neuro-fuzzy controllers ([5]), the consequence of each rule is a function input variable. The general adopted function is a linear combination of input variables plus a constant term. A TSK-type fuzzy rule has the form:

IF x₁ is A_1j(m_1j , σσσσ1j ) and x₂ is A_2j(m_2j , σσσσ2j )…and x_nis A_nj(m_nj , σσσσnj )

THEN y’=w0j+w1jx1+…+wnjxn (2.2) where w_0j represents the first parameter of a linear combination of input variables with jth rule node and wij represents the ith parameter of a linear combination of ith input variable.Since

. The crisp output is obtained through defuzzificat calculates the centroid of the output fuzzy set.

Besides the more common fuzzy inference method prop calculates the centroid of the output fuzzy set.

. The crisp output is obtained through defuzzificat . The crisp output is obtained through defuzzificat calculates the centroid of the output fuzzy set.

Besides the more common fuzzy inference method prop Besides the more common fuzzy inference method prop calculates the centroid of the output fuzzy set.

calculates the centroid of the output fuzzy set.

. The crisp output is obtained through defuzzificat . The crisp output is obtained through defuzzificat . The crisp output is obtained through defuzzificat

Besides the more common fuzzy inference method prop Besides the more common fuzzy inference method prop

the consequence of a rule is crisp, the defuzzification step becomes obsolete in the TSK inference scheme. Instead, the model output is computed as the weighted average of the crisp rule outputs, which is computationally less expensive then calculating the center of gravity.

Recently, there are many researchers ([5], [35], and [44]) to show that using a TSK-type neuro-fuzzy controller achieves superior performance in network size and learning accuracy than that of Mamdani-type neuro-fuzzy controllers. According to this reason, in this dissertation, a TSK-type neuro-fuzzy controller (TNFC) is adopted to perform various dynamic problems. Therefore, the proposed SAEAs are used to tune free parameters of a TNFC.

The structure of a TNFC is shown in Fig. 2.1, where n and R are, respectively, the number of input dimensions and the number of rules. It is a five-layer network structure. The functions of the nodes in each layer are described as follows:

Layer 1 (Input Node): No function is performed in this layer. The node only transmits input values to layer 2.

Layer 2 (Membership Function Node): Nodes in this layer correspond to one linguistic label of the input variables in layer1; that is, the membership value specifying the degree to which an input value belongs to a fuzzy set ([3]-[4]) is calculated in this layer. In this dissertation, the Gaussian membership function is adopted in this layer. Therefore, for an external inputx , _i the following Gaussian membership function is used:

[ ]

functions of the nodes in each layer are described

No function is performed in this layer. The node on No function is performed in this layer. The node on No function is performed in this layer. The node on functions of the nodes in each layer are described

No function is performed in this layer. The node on No function is performed in this layer. The node on No function is performed in this layer. The node on No function is performed in this layer. The node on No function is performed in this layer. The node on No function is performed in this layer. The node on No function is performed in this layer. The node on No function is performed in this layer. The node on No function is performed in this layer. The node on No function is performed in this layer. The node on No function is performed in this layer. The node on

function of the jth term of the ith input variablex . _i

Layer 3 (Rule Node): The output of each node in this layer is determined by the fuzzy AND operation. Here, the product operation is utilized to determine the firing strength of each rule.

The function of each rule is

⁼

∏

i ij

j u

u ⁽³⁾ ⁽²⁾ (2.5)

Layer 4 (Consequent Node): Nodes in this layer are called consequent nodes. The input to a node in layer 4 is the output derived from layer 3, and the other inputs are the input variables from layer 1 as depicted in Fig. 2.1. The function of a node in this layer is

where the summation is over all the inputs and where w are the corresponding parameters _ij of the consequent part.

Layer 5 (Output Node): Each node in this layer corresponds to single output variable. The node integrates all the actions recommended by layers 3 and 4 and acts as a defuzzifier with

∑

where R is the number of fuzzy rule.

Each node in this layer corresponds to single outpu

node integrates all the actions recommended by layers 3 and 4 and acts as a defuzzifier with

∑

^R ⁽⁴⁾

Each node in this layer corresponds to single outpu Each node in this layer corresponds to single outpu Each node in this layer corresponds to single outpu node integrates all the actions recommended by laye

Each node in this layer corresponds to single outpu node integrates all the actions recommended by laye

node integrates all the actions recommended by laye

Each node in this layer corresponds to single outpu Each node in this layer corresponds to single outpu node integrates all the actions recommended by laye

node integrates all the actions recommended by laye

Each node in this layer corresponds to single outpu Each node in this layer corresponds to single outpu Each node in this layer corresponds to single outpu Each node in this layer corresponds to single outpu node integrates all the actions recommended by laye

node integrates all the actions recommended by laye

Each node in this layer corresponds to single outpu node integrates all the actions recommended by laye

̋_˄ ̋_˅

Figure 2. 1: Structure of the TSK-type neuro-fuzzy controller.

2.2 Reinforcement Learning

Unlike the supervised learning problem, in which the correct “target” output values are given for each input pattern, the reinforcement learning problem has only very simple

“evaluative” or “critical” information, rather than “instructive” information. Reinforcement learning algorithm is proposed for determining a sequence of decisions to maximize a reinforcement signal. At each time step, the agent in states_t∈ , chooses an action S a_t∈ A that transfers the environment to the state s_t₊₁ and returns a numerical reward, r , to the _t agent. To lack of knowledge of how to solve the problem, the agent should explore the environment by trial-and-error learning strategy. Unlike supervised learning, the desired output in each state is not known in advance in reinforcement learning. In such trial-and-error learning strategy, an action performs well in the current states may perform badly in the future states, and vice versa.

Reinforcement Learning

Unlike the supervised learning problem, in which th

Reinforcement Learning Reinforcement Learning Reinforcement Learning Reinforcement Learning

Unlike the supervised learning problem, in which th

Reinforcement Learning

programming ([61]). These methods are similarly to the reinforcement learning ([62]). The necessary component of reinforcement learning methods is shown in Fig. 2.2. The agent consists of a value function and a strategy. The value function represents how much reward can be expected from each state if the best known strategy is performed. The strategy represents how to choose suitable actions from the value function to environment. As shown in Fig. 2.2, at time step t , the agent selects an action a . The action is applied to the _t environment, causing a state transition from s to _t s_t₊₁, and a reward r is received. The _t goal of a reinforcement learning method is to find the optimal value function for a given environment.

There are several reinforcement learning algorithms such as the Q-learning ([63]-[64]) and Sarsa ([65]) algorithms are proposed for computing the value function. These methods are developed based on the temporal difference learning algorithm. In the temporal difference learning method, the value function of each state (V(s )) is updated using the value function _t of the next state (V(s_t₊₁)). The value function of each state is shown as follows.

[

( ) ( )

]

, )

( )

(s_t V s_t r_t V s_t ₁ V s_t

V = +α +λ ₊ − (2.8) where V(s ) increases by the reward _t r plus the difference between the next state _t λV(s_t₊₁) and V(s ); _t α is the learning rate between 0 to 1; and λ is the discount factor between 0 to 1.

developed based on the temporal difference learning learning method, the value function of each state (VVV(

)). The value function of each state is shown as fo learning method, the value function of each state (

learning method, the value function of each state ( learning method, the value function of each state ( developed based on the temporal difference learning developed based on the temporal difference learning learning method, the value function of each state (

)). The value function of each state is shown as fo learning method, the value function of each state (

developed based on the temporal difference learning developed based on the temporal difference learning developed based on the temporal difference learning

)). The value function of each state is shown as fo )). The value function of each state is shown as fo

Figure 2. 2: Reinforcement learning method.

In early research, these reinforcement learning algorithms were proposed in simple environments. Recently, the reinforcement learning algorithms focus on larger, high-dimensional environments. These reinforcement learning algorithms are developed base on the neural networks ([66]-[67]), radial basis functions ([68]), and neuro-fuzzy network ([18]-[20]).

More recently, there are several researches proposed time-step reinforcement architectures to provide an easier way to implement the reinforcement learning architecture when compared with temporal difference learning architectures ([18]-[20]). In time-step reinforcement architecture, the only available feedback is a reinforcement signal that notifies the model only when a failure occurs. An accumulator accumulates the number of time steps before a failure occurs. The goal of the time-step reinforcement method is to maximize the value function V. The fitness function is defined by:

V =TIME-STEP (2.9) where TIME-STEP represents how long the experiment is still a “success”. Equation 2.9 reflects the fact that long-time steps before a failure occurs means the controller can control environments. Recently, the reinforcement learning

high-dimensional environments. These reinforcement on the neural networks ([66]-[67]), radial basis fu high-dimensional environments. These reinforcement high-dimensional environments. These reinforcement high-dimensional environments. These reinforcement environments. Recently, the reinforcement learning environments. Recently, the reinforcement learning high-dimensional environments. These reinforcement on the neural networks ([66]-[67]), radial basis fu high-dimensional environments. These reinforcement on the neural networks ([66]-[67]), radial basis fu high-dimensional environments. These reinforcement high-dimensional environments. These reinforcement environments. Recently, the reinforcement learning environments. Recently, the reinforcement learning environments. Recently, the reinforcement learning

on the neural networks ([66]-[67]), radial basis fu on the neural networks ([66]-[67]), radial basis fu

the plat well. For example, in evolutionary algorithm, Eq. 2.9 reflects the fact that long-time steps before a failure occurs means higher fitness of the evolutionary algorithm.

2.3 Lyapunov Stability

According to [69], we have the following definition of stability Definition 2.3.1:

The Lyapunov stability theorems ([69]) are introduced as follows.

Theorem 2.3.1 ([69]): According to Eq. 2.13, the following equations hold:

1. If Vɺ(x)≤0 in D ,then x=0 is stable, where Vɺ(x) is defined by

Asymptotically stable, if it is stable and there ex lim )

( <γ ⇒

Asymptotically stable, if it is stable and there ex Asymptotically stable, if it is stable and there ex

⇒ )

( x(

Asymptotically stable, if it is stable and there ex γ ⇒

Asymptotically stable, if it is stable and there ex Asymptotically stable, if it is stable and there ex Asymptotically stable, if it is stable and there ex

⇒ δ ⇒

<δ

)) ( ( )

( f x

x x V

V ∂

= ∂

ɺ . (2.14)

2. If D =Rⁿ, Vɺ(x)<0 in D\

{ }

0 , and V(x)is radially unbounded. An equilibrium point x=0 is globally asymptotically stable, when x →∞ and V(x)→∞. 3. If Vɺ(x)<0 in^D^\

{ }

⁰ ^{, then} ^x⁼⁰ is asymptotically stable.

In reinforcement learning, the most well-known algorithm is Barto and his colleagues’

actor-critic architecture ([17]), which consists of a control network and a critic network.

However, Barto’s architecture is complicated and is not easy to implement. Therefore, there are several researches proposed time-step reinforcement architectures to improve Barto’s architecture ([18]-[20]). In time-step reinforcement architecture, the only available feedback is a reinforcement signal that notifies the model only when a failure occurs. An accumulator accumulates the number of time steps before a failure occurs. Even though the time-step reinforcement architecture is easier to implement when compared with Barto’s architecture, it only measures the number of time steps before a failure occurs; in other words, it only evaluates how long the controller works well instead of how soon the system can enter the desired state, which is also very important. In [32], Perkins and Barto proposed safe reinforcement learning based on Lyapunov function design. Once the system’s Lyapunov function is identified, under Lyapunov-based manipulations on control laws, the architecture can drive the system to reach and remain in a predefined desired state with probability 1. Then, the time step for the system entering the desired state can indicate the concept of how soon the system becomes stable. In this dissertation, the Lyapunov stability theorem is used to design the reinforcement signal; therefore, the improved safe reinforcement learning (ISRL) is based on the Lyapunov stability theorem. The details of the ISRL can be found in Chapter 4.

accumulates the number of time steps before a failu

在文檔中以改良安全性增強式學習為基礎的自我適應進化演算法應用於模糊類神經控制器設計之研究 (頁 26-0)