Self-learning Fuzzy Controller with a Fuzzy Supervisor
Chiy-Ferng Perng, National Taiwan University, Taiwan, R.O.C.
Yung-Yaw Chen, National Taiwan University, Taiwan, R.O.C.
Abstract
Through the observation of the learning process of human, we can find punishments and rewards playing a somewhat important role. The judgment of whether the action should be punished or rewarded and the decision of the degree of punishments or rewar& are two major problems in such a scheme. In this paper, we propose a
fuzzy rulebase to provide the guiding criterion for the self- learning fuzzy controller to construct its f m z y control rulebase.
1. Introduction and Motivation
Since Prof. L. A. Zadeh proposed the concept of fuzzy sets in 1965 [I], fuzzy sets theory has become a popular and fast-growing research topic. Fuzzy sets theory is used in many fields to solve practical problems in the real world. The applications of fuzzy sets theory range fiom consumer electronics to medical diagnosis. One of the most successful applications is fuzzy control. The first
fuzzy
control steam engine was produced in 1975 [2]. Since then, in literature, one can find many fruitful examples of fuzzy control applications. However, there exists one problem: construction of a workable fuzzy controller.
A
fuzzy
controller is composed of 4 elements:k i f i c a t i o n , defuzzification, the inference engine, and the rulebase. Basically speaking, the processes of fuzzification, defuzzification, and inference engine are easily to establish because there are certain formulae to follow. Only the construction of the rulebase could be varied with different usage. The rulebase is usually derived from experts' experiences. Nevertheless, only in few control problems one can find so-called experts. Even though, the transformation of experts' experience into a linguistic rulebase is also a difficult job. Learning schemes seem a good way to solve such a problem.
Michie [3] divided the concerned space into several boxes. The learning machine helped the boxes to adjust their output according to the behaviors of the system. Barto and Sutton [4] used the similar structure as Michie's boxes system. They used associate critic elements (ACE) and associate search elements (ASE) to evaluate the effect of the default control effort and correct the output of each box. By the method called "reinforcement learning", the controller can learn how to control a complex plant. Anderson [5] used two neural nets to substitute the functions of ASE and ACE. He called them "evaluation
network (EN)" and "action network (AN)". C. C . Lee [6]
also used the scheme of the reinforcement learning. His controller's structure is the same as Barto's controller but
he used fuzzy rules to control the plant. His ASE is the combination of a fuzzy controller and an associate learning neuron (ALN). ALN adjusted the membership function of the output of each rule. Berenji [7][8] combined Anderson's work with a fuzzy controller. His fuzzy controller is also a neural network. This neural net learned the behavior of the f h z y controller. C. T. Lin and G. C.-S. Lee [9] also developed a hybrid fuzzy controller. This controller uses two learning algorithm, backpropagation and self-organization to construct a neural network. It succeeded to create a new rulebase and make a car pass a curve road through learning. Besides, several other methods have been proposed for the construction the controller by self-learning. [ 1 O ] [ 1 I]
These self-learning schemes indeed solve some problems caused by the construction of rulebases. However, the methods often entangle with complex computations or many meaningless weighting values. In fact, the learning process of a human is usually complex but meaningful and purposeful. The easiest learning process is punishments and rewards. Punishments make clear what should not be done. Rewards let one know what he or she could do or even do better. If the concept of punishments and rewards could be applied to the construction of a fuzzy controller's rulebase, it should make the construction of a rulebase reasonable and improve the efficiency of building the ruiebase.
In this paper we propose a fuzzy supervisor to realize the concept of punishments and rewards. In section 11,
details about the fuzzy supervisor are given. How to add this fuzzy supervisor to a fuzzy controller is explained in Section 111. Section IV contains the simulation results of the inverted pendulum. Our conclusions are given in the last section.
2.
The Fuzzy
Supervisor
Let's consider a plant, whose control purpose is to keep balance in zero state, position zero and velocity zero. When the previous states, velocity and position, are both positive large, the position of the plant's next states will become further from zero without any control signal. Obviously, the control force shouId let the velocity move towards negative to slow down the increase in position or even make the position moving reversely. If the controller- generating force makes velocity move toward negative, then this action should be encouraged, otherwise, and it should be punished..
At first, we have to defiie what the punishment is and what the reward is. The practical method to realize these concepts is to quantify them into small real number. Thus we define them in the domain [-l,l]. Positive number means a reward
and
negative number means a punishment. For a second-order system, the concerned area is divided into four blocks: ( position, velocity ), (+,+), (+,-), (-,+), and (-,-). Then it is necessary to find some parameters of the system to be the reference for the evaluation or criterion. Since we tried to implement a fuzzy rulebase to be the criterion generator. Appropriate sets of parameters are important. In our system, the parameters chosen are not ordinary errors and error dots. What we used are shown in figure 1. At fxst, the error and error dot should be normalized. Normalization makes thevalues of the concemed states bounded by 1 and -1. The criterion given by the supervisor is concerned the system state at time t, X(t), and the state at time t+l, X(t+l). Ra-1 represent the distance between origin and X(t). Ra-2 is the distance between X(t) and X(t+l). An-1 is the angle between e axis to X(t). An-2 is the angle between the vertical line through X(t) and ra-2. After all, the rule in the
fuzzy supervisor is like :
IF an-1 is big and an-2 is big and ra-1 issmall Then EV is a,
where EV is the criterion of the action from X(t) to X(t+l), which is given by the fuzzy supervisor. a is a constant, so the output is a singleton.
State at
-
State at time t*
e 180° 135 90..
ki
...
-'>a:r>c,
450oo
K
a n 2 O 0 45 O 90 an-1Figure 1. The parameters used by the
fuzzy
supervisorBased on the above-mentioned thoughts, we develop a fuzzy rulebase, which is so-called hz z y supervisor. The output a is a singleton. The domain (+,+) and (-,-) shares the same output function because the behavior are similar in these two domain. For example, both elements ofX(t) in the second order system are positive. If the control force makes the state toward more positive, the criterion should be negative. On the contrary, in the (-,-) domain, both elements of X(t) are negative and the control force makes the state toward negative. It is obvious that criterion should be also negative. We can see that the criterions are symmetric in the two domain.
As the same reason, in the domain (-,+), the criterion
functions are also symmetric to the ones in area(+,-). The strategy is to keep the state in this area and not move towards extreme situation. Compare X(t) and X(t+ I), when the error dot of X(t+l) moves towards positive, then the criterion mechanism will give more serious punishments and hope the controller to make the error dot move towards negative. According to the above description, we have construct total 144 rules. And the outputs of the rules of the fuzzy supervisor are shown as following.
The Rulebase Output on 1,3 Quadrant
1 0.5 0 0.5 -1 ' 10 20 30 40 50 60 70
The Rdebase Output on 2,4 Quadrant
I 0.5 0 0.5 -1 - ' 10 20 30 40 50 60 70
Figure 2. The output of the fuzzy supervisor
3. The Structure of the Heuristic Criterion
Self-learning
Fuzzy Controller
We have made some changes in the structure of the fuzzy controller in this paper. Every contol rule in the rule base will be simplified into one round cell in the concerned space. And the parameters of a rule are also simplified into center of the cell, radius of the cell, and output of the cell. Figure 3 shows the basic structure of this kind of cellular fuzzy controller.F U Z Z Y R u l e Cellular Fuzzy
I Rule
c _ _ _ _ _ _ _ _ _ - _ _ _ _ _ _ _ _ _ - - - - _ - -
Figure 3. A cellular fuzzy rule
In the cellular fuzzy controller, the fuzzification process slightly differs &om traditional fuzzy controllers. At first the fire strength depends on the distance between the rule center and the states :
r
-
J(q-
c;12+-
(x2 - - c ~ > ~r (1)
f - S =
where f-s means fwe strength of the rule, and r is the radius of the rule. (xI , x, ) is the current state. ( c;
,
c$ ) is the center of the n-th rule. Of course if, the distance between the default states and the rule center is larger than the rule radius, then the fire strength of the rule will be zero. The states will be normalized before they are sent into the controller. The output of one rule is only one crisp number not a fuzzy set. The detiuzification we used is like weighting sum :C
f - 5 , x f nC
f - s n 7 (2) = n=l, ..., N n=l, ..., Nwhere F is the final output of the fuzzy controller.f, means the nth rule output.
At the beginning of the time period ( t- l ), the system
sensor will feed the plant states at the end of time period (t-2) back to the controller and the criterion mechanism. The fuzzy controller will send a control signal to the plant according to the states. The plant acts under this control signal at the time period (t-1). The criterion mechanism will receive states at the end of time period (t-1). Then it
will evaluate the fitness of the control action adopted in time period (t-1) according to the plant states at the end of time period (t-2) and (t-1). It sends the
p
signal as the result of evaluation. The output of rules will be adjusted according to the following formula:f
(t + 1) =f
( t ) +a x EV(t) x h(t) x ra - 1 x sign(F(t) - F ( t
-
1))(3) where ci is the learning coefficient, EV is the evaluation
ffom the fuzzy supervisor, and h(t) is the fire strength history :
(4) h ( t ) = 0 . 9 x f -s(t)+O.lxh(t-l)
F(t) is the output force at the time period (t). The adjusted
rule outputs will be the bases of controller output at the time period (t). In (3), the sign function is used to express the moving direction of status’ error dot. The structure of our self-learning fuzzy controller with a fuzzy supervisor is shown in figure 4.
Figure 4.The structure of the fuzzy controller with a fuzzy supervisor
4.
$ i m u ~ ~ t i o n
Results
We use the control of the inverted pendulum to test the effect of the self-learning fuzzy controller with a fuzzy supervisor. In the inverted pendulum problem, we will concentrate on the angle control. The basic data of our cart and pole are as follows: length of the pole is 0.5m; mass of
the pole is 0.3kg; mass of the cart is 0.5kg; friction coefficient of pole is 0.000002; friction coefficient of cart is 0.0005; The concerned area: the angle ranges from -12 deg to 12 deg and the angular velocity range from -50 deglsec to 50 deglsec.
The learning coefficient is :
2 1 5 1 f B 0 5 8 5
g
& - 0 5 - - m -1 -1 5 -2 - 3 10 20x- 10+ trial,
. . . . . __. --.
.__
-
___
_____
- i_ --.---
--
._*--__
1 ' 8 ---
-..-
-
- i _ - y _ _ - ..-- o-.-- ' a ; _?_-...A _- --
-- _.-__-- -_ ~ _ _ _
__._. ___-
-I_.-
...-...
___.. --- .- ! I (4) 1.5where trial is the number of training process, trial = 1,
....
48. The concerned area is divided into 5*5=25 cells.In the learning process of the inverted pendulum, the time interval is 0.02 second. The initial outputs of control rules are generated at random and the amplitudes are limited between 10 Nts and -10 Nts. The number of learning processes is 48. The learning time steps in each learning process are 1000. The initial point of each learning process is also generated at random. That will reduce the dependence of a constant initial point.
Figure 5 shows the angle trajectory in the 27'h trial. Figure 6 is the trajectory of angular velocity during the 27" process. Figure 7 shows the final output of rules after 48 trials. After the learning procedure, we choose four initial states to test the controllability of the newly generated fuzzy controller and take off the learning part of the controller. The result is satisfactory. This controller indeed can achieve the request. Figure 8 shows the angle trajectory of the four tests. And figure 9 shows the detail
of the 31d test at final 200 steps. The 27th Training I ' 1 , : ! I : [ : ' i : . : . > : : / .. .. .... L ... i._ - :--. .. . . . . : , ! !: . !. :. . ... -..; 1
.-
5 10 15 20 25Figure 7. The outputs of rules after training Figure 5. The angle trajectory of the 27" train trial
The Rulebase Output After Training
10 8 6 4 - 2 t
g o
-2 -4 -6 -8 U _ ~ - ,.--I
-:I
-~
-~-:
-1 5 . --...
-..
-
.. 100 200 300 400 500 600 700 800 900 time ( 0 02 seclstep ) -2Test 1 Test 2
200 400 600 800 200 400 600 800
Time (0.02sedstep) , T m (0.02Sedstep)
1 -
.
1 1 1! /i ii ' G O . 5 . I 1 a 0 . 5 . i I . ' i . : . . :! I ! !
: ; /. .. . . _ / E / . . . . 200 400 600 800Time (0.02sedstep) Test 4
- 1 ; ; ' :
200 400 600 800
Time (0.02sedstep) Test 3
Figure 8. The angle trajectories in the four tests
T e s t 4 in time step 0 0 0 to 1 0 0 0 1 . 5 4 7 9 : 'O" 1 1 5 4 7 9 1 5 4 7 9 9 l 5 4 7 9 9 a 1 5 4 7 9
e
1 5 4 7 0 1 5 4 7 0 " " i" " li I " , " " ,II
Figure 9. The final 200 steps of the 31d test
V. Conclusions
I 5 4 7 8 I 5 0 1 0 0 1 5 0 2 0 0 2 5 0 1 5 4 7 8 Time (0 OZsechtep)The fuzzy supervisor indeed helps the self-learning fuzzy controller to establish a usable rulebase. The performance of the fizzy controller is good. But there are no mechanisms to inform the controller when it should stop learning. We have to stop learning at a fixed time. So the oscillation around the equilibrium is possible.
The learning process is reasonable, so we can easily tune the learning object as we wish. Moreover, the radius and the location of the cell should be also adjustable. This
will be our next object.
In our train, we introduce the history of punishments and rewards into the learning process. But the effects in not clear. We should think more about this in the future.
References
[l] L. A. Zadeh, 'Fuzzy sets', Information and Control, vol. 8, pp. 338-353, 1965
[2] E. H. Mamdani, 'Application of h z z y algorithms for control of simple dynamic plants', Proc. IEE 121(12), pp.1585-1588, 1974
[3] D. Michie and R. A. Chambers, 'Boxes' as a model of pattern-formation', in Toward a Theoretical Bioloa, vol.
1, Prolegomena, C. H. Waddington, ed. Edinburgh Univ. Press, 1968, pp. 206-215
[4] A. G. Barto, R. S. Sutton, and C. W. Anderson,
'Neuronlike adaptive elements that can solve difficult learning control problems', Trans. SMC, vol. 13, no. 5, 1983, pp. 834-846
[5] C. W. Anderson, 'Strategy learning with multilayer connectionist representation', Tech. Rep. TR87-509.3,
GTE Laboratories Inc., May 1988
C. C. Lee, 'Self-learning rule-based controller employing approximate reasoning and neural-net concepts', Int. J.
Intelligent Systems, vol. 6, pp. 71-93, 1991
H. R. Berenji, 'A reinforcement learning-based architecture for fuzzy logic control', Int. J. Approximate Reasoning, no. [6]
[7l
6, 1992, pp.267-299
[8] H. R. Berenji and P. Khedkar, 'Learning and tuning fuzzy logic controllers through reinforcements', ZEEE Trans. Neural Network, Sep. 1992, pp. 724-740
C.-T. Lin and C.-S. G. Lee, 'Neural-network-based fuzzy
control and decision system', IEEE Trans. Computer, Dec. [9]
1991, pp. 1320-1336
[lo] Y.-Y. Chen, K.-Z. Lin, and S.-T. Hsu, 'A self-learning fuzzy controller', IEEE Internationl Conference on Fuzzy Systems, San Diego, USA, March 8- 12, 1992
[ll] C.-F. Pemg and Y.-Y. Chen, 'A heuristic criterion
algorithm of the self-learning fuzzy controller',
NAFIPS '97, Syracuse, NY, USA, September 21-24, 1997