CHAPTER 4 SIMULATIONS
4.1 R EINFORCEMENT L EARNING T ASKS
Two computer simulations are discussed in this section. The first simulation is the cart-pole balance control and the second simulation is the control of a double-link inverted
pendulum system.
Example 1: Control of a cart-pole balancing system
Figure 4.1: Single-link inverted pendulum system.
Figure 4.1 depicts the cart-pole balancing system. The bottom of the pole is hinged to a cart that travels along a finite-length track to its right or left. Both the cart and pole can move only on the vertical plane; that is, each has only one degree of freedom. The only control action is F, which is the amount of force (in Newtons) applied to the cart to move it left or right. The system fails when the cart runs into the bounds of its track (the distance is 2.4 m from the center to each bound of the track) or when the pole deviates more than 90 degrees.
Using Lagrange’s method, the model of the cart-pole balancing system can be obtained as follows:
x:
(m M x mL+ ) + ( cosθ θ θ− 2sin )θ =F, (4.1)θ: xcosθ+Lθ−gsinθ=0, (4.2) where L = 0.5 m, the length of the pole; M = 1.0 kg, the mass of the cart; m= 0.1 kg, the mass of the pole, and g = 9.8 m/s, the acceleration due to gravity.
[
mmin,mmax]
,[ σ
min,σ
max]
and[
wmin,wmax]
are set as [0, 2], [0, 2] and [-30, 30], respectively.By letting q=( , )xθ T, we can rewrite Eqs. (4.1) and (4.2) into general dynamic forms as follows:
( ) (
,) ( )
D q q C q q q G q+ + =
τ
, (4.3)
( )
cos2The total mechanical energy of the system can be derived from:
(
,) ( ) (
G q
= . The purpose of this control task is to determine the sequence of forces applied alance the pole upright and keep the cart as stationary as possible. Hence, we define a goal set comprising near-upright and near-stationary states asto the b
{ }
1 ( , ) : ( , , ) 0.001
G
=q q x θ θ
≤ . When the state of the cart-pole balancing system is in G1, mechanical energy E of the system is mgL, denoting Etop. We define a Lyapunov function( )
according to Eq. (4.8), the total
( )
(
top)
, ,
L q q
= 2E
−E q q
. The purpose of this c ntrol problem can be transformed f pright and keeping the cart as stationary as possible” to “guiding the system’s mechanical energy(
,)
1 2
o rom “balancing the pole u
E q q to reach Etop
and maintaining it near Etop as long as possible;” that is, achieving L 0. In order to achieve the aforementioned goal, we have to make sure that the Ly ction of the system decrease at all time steps. The time derivative of
(
q,and the time derivative of E with respect to time is
, (
( ) ( ) ( ) ( )
which shows that the derivative of E is proportional to the product of the speed of the cart and input force. The time derivative of L q q
(
,)
with respect to time can be obtained from combing Eq. (4.9) and (4.10), which re( )
ads
(
top−E q q xF (
,) )
,
L q q
= −E
, (4.11) from which we can see that in order to make sure the Lyapunov function of the system(
,)
L q q decrease at all time steps, the direction of the control force has to be coherent with
( )
(
−) x
. Hence, for the QPSO, following [57], a Lyapunov-based control law for the learning agent based on the Lyapunov analysis can be derived as follows:(( top ) )
F sgn E E x u
the sign ofE
topE q q
,= − , (4.12)
where and u is the o
]. Initial parameters of the QPSO a
Table 4.1: The initial parameters of the le balancing system.
( ) {1 if 0, and -1 otherwise}
sgn x
=x
≥ utput force of the NFS limited in[-10,10 nd TSR-EA for controlling cart-pole balancing
system are listed in the following two tables:
QPSO for cart-po
Parameters Value Parameters Value
[ σ
min, σ
max]
[0, 2]c
1 2.01Table 4.2 : The initial parameters of the TSR-EA for cart-pole balancing system.
To verify with the performance of the based reinforcement learning (TDGAR) [53], the on-line clustering and Q-value based GA reinforcement learning
(CQGAF) [54] an th rec t w d rei ement group
cooperation-based symbiotic evolution (R-GCSE) algorithm [55] are applied to the same control task. In the TDGAR, there are five hidden nodes and five rules in the critic network and the action network. The population size is set as 200 and the maximum perturbation is set as 0.0005. In the CQGAF, after trial-and-error tests, the final average number of rules from 50 runs was 6 by using the on-line clustering algorithm. The population size is set as 50. The
QPSO, the TD and GA
d e urren avelet-base NFS with a nforc
parameters for Q-learning are set as α =0.01 and
γ
=0.9. In the R-GCSE, the population size is set as 50 and the mutation rate is set as 0.1.The control goal defined here is “bringing the plant’s state to G1 within 1,000 time steps.” The original successful region Original_Range of the variables are and -2.4m
2.4m. The initial state of the plant is set within Original_Range the control goal is met or a failure occurs. For, the QSPO, TDGAR, C
failure learning trial if the cart or the pendulum deviates beyond the e failure learning trial occurs if the cart or the pendulum
or the strict successful region defined in Eq. (3.7).
average. The performances of all these compared methods are shown in Table 4.3, from . A trail ends when QGAF and R-GCSE, a
Original_Range. For th
deviates beyond the The constraints of the
es to compute the output force is -10N F≤ ≤ 10N. If each algorithm is executed for 50 tim
which we can see that the QPSO and TSR-EA has superior control rate and requires fewer CPU-time cost. The reason could be due to the incorporating of the Lyapunov design principles in the QPSO, and the proposed TSR mechanism provides a more distinguishable performing index to the individuals that can accelerate their evolution process.
Table 4.3: Summary Statistics of Example 1.
Methods QPSO TSR-EA TDGAR CQGAF R-GCSE
% of learning trials meet the control goal.
100 96 68 74 88
Average Time to goal. 9.8±0.7 12.2±0.3 80.2±9.1 33.6±2.7 58.9 6.8±
The testing results, which lum angular velocity (in
degrees/seconds), and cart velocity (in meters/seconds) of the TSR-EA, TDGAR, CQGAF and R-GCSE are shown in Fig. 4.2-4.5 as follows. Each line in Fig. 4.2-4.5 represents a single run
consist of the pendulum angle, pendu
.
(a)
(b)
(c)
Figure 4.2: 50 control results of the cart-pole balancing system using the TSR-EA in Example 1. (a) Angle of the pendulum. (b) Angular velocity of the pendulum. (c) Velocity of the cart.
(a)
(b)
(c)
Figure 4.3: 50 control results of the cart-pole balancing system using the TDGAR in Example 1. (a) Angle of the pendulum. (b) Angular velocity of the pendulum. (c) Velocity of the cart.
(a)
(b)
(c)
Figure 4.4: 50 control results of the cart-pole balancing system using the CQGAF in Example 1. (a) Angle of the pendulum. (b) Angular velocity of the pendulum. (c) Velocity of the cart.
(a)
(b)
(c)
Figure 4.5: 50 control results of the cart-pole balancing system using the R-GCSE in Example 1. (a) Angle of the pendulum. (b) Angular velocity of the pendulum. (c) Velocity of the cart.
Furthermore, we complicate our control goal to “bringing the plant’s state to G1 within 5,000 time steps, and maintaining the state within G1 for 100,000 time steps,” and the original successful region Original_Range of the variables are modified to and -2.4m
90
θ
90− ≤ ≤
≤ ≤x 2.4m. The initial state of the plant is set within Original_Range. A trail ends when the control goal is met or a failure occurs. For the QSPO, TSR-EA, TDGAR, CQGAF and R-GCSE, a failure learning trial occurs if the cart or the pendulum deviates beyond the
Original_Range. Each algorithm is still executed for 50 times to compute the average. The
performances of all these compared methods are shown in Table 4.4.Table 4.4: Summary Statistics of Example 1 under a difficult control goal.
Methods QPSO TSR-EA TDGAR CQGAF R-GCSE
% of first 10% trials
From Table 4.4 we can see that the QPSO has the most successful control rate. The superiority can be seen especially from the first 10% learning trials where learning agents are not fully trained yet. The QPSO is able to apply a safe, reliable control result during initial leanings, which is crucial important in many applications. The testing results of the QPSO are shown in Fig. 4.6 and Fig. 4.7. Each line in Fig. 4.6 and Fig. 4.7 represents a single run that starts form a increased range of initial states. Figure 4.6 shows the results the first 1,000 of 100,000 control time steps while Fig. 4.7 shows the last 1,000. From Fig. 4.6 we can see that with the aid of Lyapunov design, the QPSO is able to control the single-link inverted pendulum system well under different initial conditions. Trajectories shown in Fig. 4.7 verify the ability of the QPSO marinating the environment into G1.
(a)
(b)
(c)
Figure 4.6: 50 first 1000 time steps control results the QPSO of the cart-pole balancing system. . (a) Angle of the pendulum. (b) Angular velocity of the pendulum. (c) Velocity of the cart.
(a)
(b)
(c)
Figure 4.7: 50 last 1000 time steps control results the QPSO of the cart-pole balancing system. . (a) Angle of the pendulum. (b) Angular velocity of the pendulum. (c) Velocity of the cart.
Example 2: Control of a double-link inverted pendulum system
Figure 4.8: Double-link inverted pendulum system.
Consider the double-link inverted pendulum system: m1 is the mass of link 1, m2 is the mass of link 2,
θ
1 is the angle that link 1 makes with the vertical,θ
2 is the angle that link 2 makes with link 1, l1 and l2 are the lengths of link 1 and 2, lc1 is the distance of the center of mass of link 1, lc2 is the distance of the center of mass of link 2, I1 and I2 are the moments of inertia of link 1 and link 2 about their centroids andτ
1 is the only control torque applied to the joint of link 1. We introduce the following five parameter equations:2 2
The model of the system can be obtained by using Lagrange’s method:
( ) (
,) ( )
( )
3 2 2 2 1The potential energy of the double-link inverted pendulum system can be defined as
4 1 5 1
( ) sin sin( )
P q
=p g q
+p g q
+q
2 , (4.19) and the total mechanical energy of the system is given by( ) ( ) ( )
The control objective is to stabilize the system around its top position, i.e.
=(0,0,0,0). Hence, another goal set is defined by
1 1 2 2
( , , , )
q q q q
G
2 ={
( , , , ) : ( , , , )q q q q
1 1 2 2q q q q
1 1 2 2 ≤0.01}
. (4.20) When the state of double-link inverted pendulum system is in G2, the total mechanical energyE of the system is given by
E(0, 0, 0, 0) = E
top = (p4+p5)g. (4.21) By defining a Lyapunov function( ) (
top( )
2, ,1
L q q
=2E
−E q q )
. The control objective can be either considered as guiding the system state into G2 or achieving . The action selection of the QPSO is to make sure that the Lyapunov function of the system decrease at all time steps. The time derivative of(
, 0)
Where the time derivative of E with respect to time is( ) ( ) ( ) ( )
which shows that the derivative of E is proportional to the product of the angular velocity of the first pole. The time derivative of L q q
(
,)
with respect to time is derived as follows: steps, the direction of the control torque is assigned to be coherent with the sign of. A Lyapunov-based control law for the QPSO can be derived as follows:
( )
( E
top−E q q q
,)
11 1
sgn E
(( topE q z
) )τ
= − , (4.25) where z is the output of the NFS limited in [-10,10]. Double-link inverted pendulum system parameters are L1=1m, L
2=2m, m
1=1kg, m2=2kg, g=9.8m/s. In designing the NFS, the four controller input are normalized between 0 and 1, the output z is limited between -10 and 10. Initial parameters of the QPSO and TSR-EA for controlling two-pole inverted pendulum system are listed in the following two tables:) , , , (
θ θ x x
Table 4.5: The initial parameters of the QPSO for two-pole inverted pendulum system.
Parameters Value Parameters Value
[ σ
min, σ
max]
[0, 2]c
1 2.01Table 4.6 : The initial parameters of the TSR-EA for two-pole inverted pendulum system.
Parameters Value Parameters Value
[ σ
min, σ
max]
[0, 2]Thres TimeStep
_5000 [
mmin, m
max]
[0, 2]Crossover Rate
0.5[wmin,
w
max] [-30, 30]Mutation Rate
0.2R
7s
50A
10N
cs 350In the TDGAR, there are five hidden nodes and five rules in the critic network and the action network. The pop tion size is set a 00 and the m mum perturbation is set as 0.0005. In the CQGAF, after trial-and-error s, the final a ge numb rules from 50 runs was 8 by using the on-line clustering algorithm. The population size is set as 50. The param
ula s 3 axi
test vera er of
eters for Q-learning are set as
α
=0.01 andγ
=0.9. In the R-GCSE, the population size1 2
successful region. The control goal is defined t is set as 50 and the mutation rate is set as 0.1.
For the TDGAR, CQGAF and R-GCSE, the original successful region of the variables is , and . Initial states of the plant are set within the original
o “maintaining the plant’s state within G2 for 100,
means that either pendulum deviates beyond the original successful region.
For the TSR-EA, the original successful region of the variables is , and
e control goal is defin
the control goal is met or a failure occurs, which means that either pendulum deviates beyond the either the original successful region or the strict successful region.
For the Q-PSO, the original successful region of the variables is , and
36
θ
36− ≤ ≤ −36 ≤
θ
≤36000 time steps.” A trail ends when the control goal is met or a failure occurs, which
1
36
θ
2 36− ≤ ≤ . The strict successful region designed by the TSR is defined in Eq. (3.7).
Initial states of the plant are set within the original successful region. Th
36
θ
36− ≤ ≤
ed to “maintaining the plant’s state within G2 for 100,000 time steps.” A trail ends when
90
θ
1 90− ≤ ≤
90
θ
2 90− ≤ ≤ . Initial states of the plant are set within the original successful region, which represents the whole input space. The control goal is defined to “bringing plant’s state to G2
trol rate. The ability of the QPSO to provide reliable control result
Table 4.7: Summary Statistics of Example 2.
withi
when the contr
during initial learning is
n 5,000 time steps and maintaining it within G2 for 100,000 time steps.” A trail ends ol goal is met or a failure occurs, which means that it exceeds 105,000 time steps.
Each algorithm is executed for 50 times to compute the average. The performances of all these compared methods are shown in Table 4.7, from which we can see that the QPSO and TSR-EA has better con
still obvious from control result of the first 10% learning trials.
Methods QPSO TSR-EA TDGAR CQGAF R-GCSE
% of first 10% trials meeting goal. 86 14 2 32 56
% of trials meeting goal. 94 88 46 68 82
Time to goal, first 10% trials. 40.8±1.9 66.3±2.4 308.2±0 90.6±7.2 45.91 ±19.8 Average Time to goal. 34.6±2.2 57.7±6.6 276.8±31.9 76.2±13.1 131.7±16.5
The testing results, which consist of the angle and angular velocity of both pendulums are shown in Fig. 4.9-4.13 as follows. Each line in Fig. 4.9-4.13 represents the first 1,000
e steps of a single run.
control tim
(a)
(b)
(c)
(d)
Figure 4.9: 50 first 1000 time steps control results of the double-link inverted pendulum system using the QPSO. (a) Angle of link 1. (a) Angle of link 2. (c) Angular velocity of link 1. (d) Angular velocity of link 2.
(a)
(b)
(c)
(d)
Figure 4.10: 50 first 1000 time steps control results of the double-link inverted pendulum system using the TSR-EA. (a) Angle of link 1. (a) Angle of link 2. (c) Angular velocity of link 1. (d) Angular velocity of link 2.
(a)
(b)
(c)
(d)
Figure 4.11: 50 first 1000 time steps control results of the double-link inverted pendulum system using the TDGAR. (a) Angle of link 1. (a) Angle of link 2. (c) Angular velocity of link 1. (d) Angular velocity of link 2.
(a)
(b)
(c)
(d)
Figure 4.12: 50 first 1000 time steps control results of the double-link inverted pendulum system using the CQGAF. (a) Angle of link 1. (a) Angle of link 2. (c) Angular velocity of link 1. (d) Angular velocity of link 2.
(a)
(b)
(c)
(d)
Figure 4.13: 50 first 1000 time steps control results of the double-link inverted pendulum system using the R-GCSE. (a) Angle of link 1. (a) Angle of link 2. (c) Angular velocity of link 1. (d) Angular velocity of link 2.
From Fig. 4.9-4.13 we can see that the proposed QPSO and TSR-EA have better control accuracy, which is one the major benefits of applying Lyapunov design principles or the TSR mechanism. The testing results of the last 1,000 control time steps of the QPSO and TSR-EA are shown in Fig. 4.14 and Fig. 4.15 as follows. From Fig. 4.14 and Fig. 4.15 we can see that, with two different kinds of mechanism, the QPSO and TSR-EA are able to attain accurate control results. Trajectories shown in Fig. 4.14 and Fig. 4.15 verify the ability of the QPSO and TSR-EA marinating their environment into G2.
(a)
(b)
(c)
(d)
Figure 4.14: 50 last 1000 time steps control results of the double-link inverted pendulum system using the QPSO.
(a) Angle of link 1. (b) Angular velocity of link 1. (c) Angle of link 2. (d) Angular velocity of link 2.
(a)
(b)
(c)
(d)
Figure 4.15: 50 last 1000 time steps control results of the double-link inverted pendulum system using the TSR-EA.
(a) Angle of link 1. (b) Angular velocity of link 1. (c) Angle of link 2. (d) Angular velocity of link 2.