FQ-DCC Scheme - DCC Design - 混合型分碼多工蜂巢網路中下行鏈路軟性遞移機制及細胞重組規劃之研究

4.4 DCC Design

4.4.3 FQ-DCC Scheme

In this section, the detailed design of fuzzy-Q-learning based SOC (FQ-SOC) is provided.

As shown in Fig. 4.6, the interaction between FQ-SOC and the system at each time instant consists of the following procedures. Based on the information imported from the environ-ment, FQ-SOC identifies the state s. On the current state, takagi-sugeno FIS calculates the truth value of each rule for the input vector [66]. Also, feasible action set are approximated based on the current state. For each rule, an action is selected by using the exploitation and exploitation policies (EEP), and then eligibility factor for each rule can be calculated. The resultant action is converted to the pilot power level of the base station. The reward signal can be measured from the system, and feeded back to FQ-SOC to update the Q-function and q-value for each rule.

Identify

Fuzzy Q-Learning based Dynamic Cell Configuration Scheme

Base Station Base Station

Figure 4.6: Fuzzy Q-learning-based dynamic cell configuration (FQ-DCC) scheme.

A. Representation of Q-function by a FIS

FIS is based on the concept of fuzzy set theory, fuzzy IF-THEN rules, and fuzzy reasoning as shown in Fig. 4.7. The fuzzifier performs a mapping function from observed input x to a fuzzy set T (x). The fuzzy rule-base is characterized by a set of linguistic statements based on designer’s knowledge and experiences in a form of “IF-THEN” rules that describe a fuzzy logic relationship between the input x and the output y. According to the fuzzy rules and the input linguistic terms T (x), the inference engine performs an implication function, which is a decision making logic using an inference method to obtain the output linguistic terms T (y). The defuzzifier adopts a defuzzification function to convert T (y) into a crisp value y. Consider Sugeno fuzzy model [67], which is the most widely applied model due to its transparency and high interpretation of the fuzzy rule and the systematic approach. It can build up the fuzzy rule base by a crisp linear function, y= g(T(y)), where g(·) is a weighting sum in a polynomial form.

Assume the state of rule k is Sk = {Sk,1, · · · , Sk,Z}, where Sk,z is the fuzzy label for z-th input variable in rule k. For input variable x, the FIS function can be defined by the

Fuzzifier Inference

Engine Defuzzifier

Fuzzy Rule Base

x T( x) T( y)

Figure 4.7: The structure of the fuzzy inference system (FIS).

elementary rules k = 1 · · · , K with format

IF x is S_k THEN y= a_k WITH q_k,

where a_k and q_k are the action and q-value for each rule k. Thus, the representation of Q(x, a(x)) by a FIS is equivalent to determining qk. Assume the input and output variables are represented by x and y, respectively, we can associate the Q value of input x and action a(x) by the FIS function as

x → y = Q(x, a(x)) = F IS(x), (4.25)

where input vector is x = (x1, x2, · · · , xZ), which represents system state s, where Z is the size of the vector.

In FQ-SOC, there are two input linguistic variables are considered for input variable x. Moreover, the fuzzy term sets of ($_M, $_V) are defined as T ($_M) = {Extremely Low, Medium Low, Low, Medium, High, Extremely High} ={EL, ML, LO, ME, HI, EH}. Also, T ($_V) = {Extremely Low, Low, Medium, High, Extremely High} ={EL, LO, ME, HI, EH}.

The fuzzy rules have dimension |T ($_M)| × |T ($_V)|. In FQ-SOC, there are 30 (6 × 5) rules.

As for the output linguistic variable, it represents possible actions for the allocation of pilot power fraction fb. Thus, the pilot power of base station b, P_b^I, is equal to fb × ePb. Here, output dimension is designed as 13.

Assume J is the overall action set. For every rule k, assume j is possible solution of action a_k and the q-value q_k, denoted as a_k(j) and q_k(j), j ∈ J, respectively. Moreover, since greedy policy easily dragging system to approach local optimal solutions, it is necessary

for FQ-SOC scheme to visit all the set of possible actions for all states. This is so-called exploration/exploitation dilemma. Here, pseudo-exhaustive policy is applied, in which rule action j with the best q-value has a selection probability to be chosen, which is Pr(j|q_k(j)) based on Boltzmann distribution, otherwise an action which is the least lately chosen in the given rule will be chosen. Through exploration/exploitation policy (EEP), action a_k(j^?) is selected from action set a_k(j), where j^? ∈ J.

x → a_k(j^?) = EEP (x). (4.26)

Therefore, in the context of fuzzy Sugeno model for FIS [66], the output of the FIS function, including inferred action A(x) and its associated Q value Q(x, A(x)), become:

A(x) = which ϕ_S_k,i(x_i) is the membership degree to the different fuzzy sets [62], and [63]. Note that the output inferred action represents the estimated optimal level of the pilot power, denoted as bP_b^I.

B. Fuzzy Q-Learning Strategies

In the following, to clarify the time series during learning process, subscripts of time index t are added to some symbols. To update the Q-value, the optimal Q-value, Q^∗(xt+1, A(xt+1)), of the next state xt+1 in (4.22) then becomes:

Q^∗(x_t+1, A(x_t+1)) = according to eligibility, the degree of visit of a state in the recent past. The trace of eli-gibility constitutes a short-term memory of the frequency of visit of a state. It decreases

exponentially in the time, unless to be reactivated by a new visit. Let e_k(j) be the replace eligibility of possible action a_k(j) in rule k. The eligibility factor is updated after the choice of the conclusions action a_k(j^?) by the EEP:

ek(j) =





γ_dλ_Le_k(j) + _P_K^α^k^(x^t⁾

k=1αk(xt)

, if j = j^? γ_dλ_Le_k(j), otherwise.

(4.31)

Also, the elementary quality can be immediately updated by:

4q_k(j) = ² × 4Q × e_k(j). (4.32)

Thus,

q_k(j) = q_k(j) + 4q_k(j). (4.33) C. Feasible Action Selection and Exploitation and Exploration Policy

Feasible action set A_s ⊂ A can be obtained based on the current state s. In our proposed FQ-DCC scheme, a simple strategy for feasible action selection is applied. State ωM can be adopted as an indicator to classify the feasible action sets. For example,

A_s =

½ {f_min, · · · , f_Θ} , if $_M ≥ Θ

{fΘ, · · · , fmax} , otherwise (4.34) where f_Θ is the cutting value of the action set, f_Θ ∈ [f_min, f_max], and Θ is the threshold of the mean power as the quality-of-service constraint. Since greedy policy can easily drag the system to approach local optimal solutions, it is necessary to visit all the set of possible actions for all states. This is so-called exploration/exploitation dilemma. An action of state s, a(s), is selected from feasible action set As using a exploitation and exploitation policies.

Here, pseudo-exhaustive policy is applied, in which the action with the best Q-value has a selection probability to be chosen based on Boltzmann distribution. Otherwise an action which is the least lately visit will be chosen. The resultant action is converted to the pilot power of the base station. The reward signal can be measured from the system, and feed back to update the Q-function.

在文檔中混合型分碼多工蜂巢網路中下行鏈路軟性遞移機制及細胞重組規劃之研究 (頁 105-110)