4.4 DCC Design
4.4.3 FQ-DCC Scheme
In this section, the detailed design of fuzzy-Q-learning based SOC (FQ-SOC) is provided.
As shown in Fig. 4.6, the interaction between FQ-SOC and the system at each time instant consists of the following procedures. Based on the information imported from the environ-ment, FQ-SOC identifies the state s. On the current state, takagi-sugeno FIS calculates the truth value of each rule for the input vector [66]. Also, feasible action set are approximated based on the current state. For each rule, an action is selected by using the exploitation and exploitation policies (EEP), and then eligibility factor for each rule can be calculated. The resultant action is converted to the pilot power level of the base station. The reward signal can be measured from the system, and feeded back to FQ-SOC to update the Q-function and q-value for each rule.
Identify
Fuzzy Q-Learning based Dynamic Cell Configuration Scheme
Base Station Base Station
Figure 4.6: Fuzzy Q-learning-based dynamic cell configuration (FQ-DCC) scheme.
A. Representation of Q-function by a FIS
FIS is based on the concept of fuzzy set theory, fuzzy IF-THEN rules, and fuzzy reasoning as shown in Fig. 4.7. The fuzzifier performs a mapping function from observed input x to a fuzzy set T (x). The fuzzy rule-base is characterized by a set of linguistic statements based on designer’s knowledge and experiences in a form of “IF-THEN” rules that describe a fuzzy logic relationship between the input x and the output y. According to the fuzzy rules and the input linguistic terms T (x), the inference engine performs an implication function, which is a decision making logic using an inference method to obtain the output linguistic terms T (y). The defuzzifier adopts a defuzzification function to convert T (y) into a crisp value y. Consider Sugeno fuzzy model [67], which is the most widely applied model due to its transparency and high interpretation of the fuzzy rule and the systematic approach. It can build up the fuzzy rule base by a crisp linear function, y= g(T(y)), where g(·) is a weighting sum in a polynomial form.
Assume the state of rule k is Sk = {Sk,1, · · · , Sk,Z}, where Sk,z is the fuzzy label for z-th input variable in rule k. For input variable x, the FIS function can be defined by the
Fuzzifier Inference
Engine Defuzzifier
Fuzzy Rule Base
x T( x) T( y)
y
Figure 4.7: The structure of the fuzzy inference system (FIS).
elementary rules k = 1 · · · , K with format
IF x is Sk THEN y= ak WITH qk,
where ak and qk are the action and q-value for each rule k. Thus, the representation of Q(x, a(x)) by a FIS is equivalent to determining qk. Assume the input and output variables are represented by x and y, respectively, we can associate the Q value of input x and action a(x) by the FIS function as
x → y = Q(x, a(x)) = F IS(x), (4.25)
where input vector is x = (x1, x2, · · · , xZ), which represents system state s, where Z is the size of the vector.
In FQ-SOC, there are two input linguistic variables are considered for input variable x. Moreover, the fuzzy term sets of ($M, $V) are defined as T ($M) = {Extremely Low, Medium Low, Low, Medium, High, Extremely High} ={EL, ML, LO, ME, HI, EH}. Also, T ($V) = {Extremely Low, Low, Medium, High, Extremely High} ={EL, LO, ME, HI, EH}.
The fuzzy rules have dimension |T ($M)| × |T ($V)|. In FQ-SOC, there are 30 (6 × 5) rules.
As for the output linguistic variable, it represents possible actions for the allocation of pilot power fraction fb. Thus, the pilot power of base station b, PbI, is equal to fb × ePb. Here, output dimension is designed as 13.
Assume J is the overall action set. For every rule k, assume j is possible solution of action ak and the q-value qk, denoted as ak(j) and qk(j), j ∈ J, respectively. Moreover, since greedy policy easily dragging system to approach local optimal solutions, it is necessary
for FQ-SOC scheme to visit all the set of possible actions for all states. This is so-called exploration/exploitation dilemma. Here, pseudo-exhaustive policy is applied, in which rule action j with the best q-value has a selection probability to be chosen, which is Pr(j|qk(j)) based on Boltzmann distribution, otherwise an action which is the least lately chosen in the given rule will be chosen. Through exploration/exploitation policy (EEP), action ak(j?) is selected from action set ak(j), where j? ∈ J.
x → ak(j?) = EEP (x). (4.26)
Therefore, in the context of fuzzy Sugeno model for FIS [66], the output of the FIS function, including inferred action A(x) and its associated Q value Q(x, A(x)), become:
A(x) = which ϕSk,i(xi) is the membership degree to the different fuzzy sets [62], and [63]. Note that the output inferred action represents the estimated optimal level of the pilot power, denoted as bPbI.
B. Fuzzy Q-Learning Strategies
In the following, to clarify the time series during learning process, subscripts of time index t are added to some symbols. To update the Q-value, the optimal Q-value, Q∗(xt+1, A(xt+1)), of the next state xt+1 in (4.22) then becomes:
Q∗(xt+1, A(xt+1)) = according to eligibility, the degree of visit of a state in the recent past. The trace of eli-gibility constitutes a short-term memory of the frequency of visit of a state. It decreases
exponentially in the time, unless to be reactivated by a new visit. Let ek(j) be the replace eligibility of possible action ak(j) in rule k. The eligibility factor is updated after the choice of the conclusions action ak(j?) by the EEP:
ek(j) =
γdλLek(j) + PKαk(xt)
k=1αk(xt)
, if j = j? γdλLek(j), otherwise.
(4.31)
Also, the elementary quality can be immediately updated by:
4qk(j) = ² × 4Q × ek(j). (4.32)
Thus,
qk(j) = qk(j) + 4qk(j). (4.33) C. Feasible Action Selection and Exploitation and Exploration Policy
Feasible action set As ⊂ A can be obtained based on the current state s. In our proposed FQ-DCC scheme, a simple strategy for feasible action selection is applied. State ωM can be adopted as an indicator to classify the feasible action sets. For example,
As =
½ {fmin, · · · , fΘ} , if $M ≥ Θ
{fΘ, · · · , fmax} , otherwise (4.34) where fΘ is the cutting value of the action set, fΘ ∈ [fmin, fmax], and Θ is the threshold of the mean power as the quality-of-service constraint. Since greedy policy can easily drag the system to approach local optimal solutions, it is necessary to visit all the set of possible actions for all states. This is so-called exploration/exploitation dilemma. An action of state s, a(s), is selected from feasible action set As using a exploitation and exploitation policies.
Here, pseudo-exhaustive policy is applied, in which the action with the best Q-value has a selection probability to be chosen based on Boltzmann distribution. Otherwise an action which is the least lately visit will be chosen. The resultant action is converted to the pilot power of the base station. The reward signal can be measured from the system, and feed back to update the Q-function.