Fuzzy Q-learning Algorithm

Chapter 2 System Model

2.5 Fuzzy Q-learning Algorithm

Fig 2.6 is a general learning system which is consists of five elements. The learner will select the optimal action according to the knowledge of the state by interacting with the environment. After applying the action, the environment will give some reward feedback to the learner. These rewards can help justifying the action decision policy for the better system performance.

Figure 2.6: Block diagram of a learning system

The expectation of accumulated rewards is called Q function and it will be affected by the selected action at each time. Eq. (2-5) is the Q-function which depends on the system state denoted by x and the action denoted by a respectively: ,

(

0 0

)

0 0

, ⁿ ( ) (0) , (0) ,

Q x a E ^∞

γ

r n x x a a

 

=  = = 



∑

 ^(2-5)

where

γ

is the discount factor, r n is the reinforcement signal (reward), ( ) E{}⋅ is the expectation operation and n is the episode index. The Q-value is the output value of Q-function. Because the Q-value is the output value of the accumulation rewards in the future from now on, it will be affect by the selected action a under state ₀ x at current decision ₀ episode n= . Therefore, the reinforcement learning will choose the optimal action which can 0 maximize the accumulation of rewards, denoted by a as Eq. (2-6): ^*

15

Q-learning algorithm. The Q-learning algorithm can solve above problems and obtain the optimal Q function at next stage. It is given by [18] as:

[

more accurate Q-function approximation and use Eq. (2-7) to decide the optimal action.

The fuzzy Q-learning (FQL) algorithm can be regarded as the Q-learning algorithm which combined with fuzzy logic. Eq. (2-8) is the general form of fuzzy if-then rules which shows the characteristic of this combination scheme as:

Rule : if ( ) is j X n S_j, then a_k with q S a_n( _j, _k), 1≤ ≤k K. (2-8)

where X n( )=

[

x n1( ),...,x_H( )n

]

is the vector of input linguistic variables, H is the number of input variables, q S a_n( _j, _k)is the Q-value for the state-action pair (S a_j, _k) at episode n ,

{ }

S= S_j, j=1,...,J is the set of system state, and ^A=

{

a kk^, =^1,...,K

}

is the set of action candidate. Since each S containing a rule, there are J rules with different Q-values for each _j pair (S a_j, _k), k =1,...,K. The select-max strategy is adopted for each rule to choose the suitable

16

After applying the optimal action, the reward caused by the action is feedback to update the Q function which infers to Eq. (2-7):

* * *

17

the next stage Q-values q_n₊₁(S a_j, _k), j=1,..., , J k=1,...,K are not available,

*( ( 1), ( 1))

Q X nn + a n+ will be calculated by q S a_n( _j, ^*_j), j=1,...,J which is defined as:

( )

^, ¹

(

)

* 1

, 1 1

, ( )

( 1), ( 1) .

j n n j j

n J

j n j

q S a n

Q X n a n

µ

 × 

 

+ + =

∑

^(2-14)

18 3333

Chapter 3 Fuzzy Q-learning based MIMO HARQ Scheme

In this chapter, the design of the fuzzy Q-learning-based MIMO HARQ (FQLM-HARQ) scheme is proposed to choose the MIMO transmission mode and the MCS level for each initial transmission. Since the decision in the HARQ scheme is based on current and past system state, this process can be modeled as a Discrete-time Markov decision process (MDP). Applying the fuzzy Q-learning-based algorithm, the FQLM-HARQ scheme can solve this MDP problem.

The overall structure of the FQLM-HARQ scheme is given in Fig 3.1. The measure parameters BLER n , ( ) CQI n , and ¹( ) CQI n are translated into fuzzy rule base indicator ²( )

( )n

β

and input linguistic variables X n by the block of operation. ( )

β

( )n indicates which fuzzy rule base X n is suitable for. There are two fuzzy rule bases in FQLM-HARQ scheme. ( ) One is fuzzy rule base A (for mode 1), and the other is fuzzy rule base B (for mode 2). Note that, there are two transmission modes. They are spatial diversity (SD) and spatial multiplexing (SM).

Here, mode 1 denotes SD transmission is used and mode 2 denotes SM transmission is used.

After the calculation of the fuzzy rule base, we can get Q-value q for each system state. _n According to these Q-values, the inference engine can infer a optimal action for each mode. The action decision then decides the MCS level for each data stream. When the packet transmission is finished, the reinforcement signal generator will create a reinforcement signal ( )r n to the

19

block of Q-function update based on system information. The Q-values in fuzzy rule base will be updated by Q-function update. The detailed design is given as follows.

1 1,

CQI n are measured by user equipment (UE) through common pilot channel (CPICH), and are integer values between 0 and 14. In order to reduce the number of fuzzy rule, we define

20

1 2

( ) min{ ( ), ( )}

CQI nL = CQI n CQI n , (3-2)

Assume CQI is the threshold of channel quality indicator for the selection of ^th transmission mode. If CQI n is less than ^L( ) CQI , the MIMO HSDPA system will use SD ^th transmission. On the contrary, if CQI n is greater than ^L( ) CQI , the MIMO HSDPA system ^th will use SM transmission. However, since the CQI has report delay or measurement inaccuracies, there should be a fuzzy margin for the selection of MIMO transmission mode. Let

( )n

β

be the fuzzy rule base indicator and

β

th be the threshold for the selection of fuzzy rule base. We then have the following three rules:

(1) if CQI n^L( )−CQI^th< −

β

th, only fuzzy rule base A enabled, (2) if CQI n^L( )−CQI^th>

β

th, only fuzzy rule base B enabled,

(3) if −

β

th ≤CQI n^L( )−CQI^th≤

β

th, both fuzzy rule A and fuzzy rule B enabled.

Let

β

( )n = follow the (1) rule. This indicates the channel quality is really bad. In order 0 to enhance the reliability, the single data stream is adopted for SD transmission and only the fuzzy rule A is enabled. Let

β

( ) 1n = follow the (2) rule. It implies channel quality is very good. Hence, only the fuzzy rule B is enabled and dual data streams are used for SM transmission to increase the system throughput. Let

β

( )n = follow the (3) rule. It means that 2 the channel quality is in the indistinct region such that we can’t exactly decide which transmission mode is suitable. Hence, both fuzzy rule bases A and B are enabled. In this case, the actual transmission mode will be judged by the action decision.

2 3.2 Fuzzifier

Let ^{X n}^{( )}=

{

BLER n CQI n CQI n^{( ),} ^H^{( ),} ^L^{( )}

}

be the input linguistic variables in the FQLM-HARQ scheme. BLER n is the block error rate indicator, which is defined as the ( ) number of the packets with retransmissions over the total transmission packets at episode n .

21

for the judgment of channel quality indication which implies the QoS requirement of block error rate while using the corresponding MCS level during this term.

The membership function for fuzzy terms can indicate the intensity of the input variable belong to itself fuzzy label, and is designed with pre-knowledge of the system. Before designing the membership function for fuzzy terms in ^{T BLER n}

(

^{( )}

)

^, T CQI n and

(

^H^{( )}

)

T CQI n , we

(

^L^{( )}

)

22

where BLER denotes the BLER requirement. We then have the membership function of ^*

( )

BLER n as shown in Fig. 3.4.

23

24 ( ) / ( )

H L

CQI n CQI n ( CQI

( )) / n ( CQI n

( ))

µ µ

Figure 3.5: The membership function of CQIH( )n and CQI n L( )

3.3 Fuzzy Rule Base

The fuzzy rule base is consisted of if-then rules. In the FQLM-HARQ scheme, we have two fuzzy rule bases. One is fuzzy rule base A for mode 1, and the other is fuzzy rule base B for mode 2. Since the value of CQI n is greater than or equal to the value of ^H( ) CQI n , there are ^L( ) 108 kinds of combination of ^{T BLER n ,}

(

^{( )}

)

T CQI n and

(

^H^{( )}

)

T CQI n . Therefore, the state

(

^L^{( )}

)

in our system is S_j, j=1,...,108. Assume each state has one fuzzy Q-learning rule. We then have the following two rule bases.

3.3.1 Fuzzy Rule Base A

In the fuzzy rule base A, the Q-value for the state action pair q S a_n( _j, _1,_k) can be got through the fuzzy if-then rule as:

1, 1,

Rule : if ( ) is j X n S_j, then a _k with (q S a_n _j, _k), for k=1,...,8 (3-16)

where a is the action for mode 1 transmission, which means the FQLM-HARQ scheme 1,k

chooses the k-th MCS level for the single data stream. The design concept of MIMO-HARQ scheme is to discriminate the prefer actions a_1,k according to BLER n . If ( ) BLER n ( )

25

the learning procedure. In the following, we will divide the fuzzy rules into three parts based on the 3 terms, “Low”, “Middle” and “High” ofBLER n . ( ) we will choose k up to 8. By this selection region, we want to balance the QoS requirement

26

got through the fuzzy if-then rule as:

2, 2,

Rule : if ( ) is j X n S_j, then a _k with q S a_n( _j, _k), for k=1,..., 36 (3-17)

where a2,k is the action for mode 2 transmission, which means that the FQLM-HARQ scheme chooses the p-th MCS level, denoted by MCS , for the data stream with best channel quality ^*p

and the q-th MCS level, denoted by MCS , for the data stream with worst channel quality, ^*p levels. Here, MCS^H is the MCS level for the data stream with best channel quality and MCS^L is the MCS level for the data stream with worst channel quality.

27

Figure 3.6: The relationship between MCS levels and action a2,k

The design concept is also to discriminate the prefer actions a_2,k according to ( )

BLER n for each state which is the same form as fuzzy rule base A. In the following, we will divide the fuzzy rules into three parts based on the 3 terms, “Low”, “Middle” and “High” of

BLER n is in the safe region and we also take the action with more aggressive for mode 2 to increase the system throughput. Therefore, we only consider a2,k with

28

model 1, and the other is inference engine B for mode 2. Details are given as follows.

3.4.1 Inference Engine A

By using the select-max strategy, a suitable action for each rule in fuzzy rule base A can be

29

By using the fuzzy select-max strategy, a suitable action for each rule in fuzzy rule base B can be got by:

30

In the block, the final action at episode n , is denoted by:

( )

If l^* = , the mode 1 transmission is selected. Hence, only one data stream is transmitted. The 1 output of this block is set to MCS^H =a n1^*( ) and MCS^L= . Based on this output signal, the 0 MIMO HARQ system uses a n to transmit the data stream with largest delay time in one of 1^*( ) two HARQ processes. Note that, for each episode, there are two HARQ processes can be used.

If l^* = , the mode 2 transmission is selected. Hence, dual data stream is transmitted. The 2

31 3.6 Reinforcement Signal Generator

We design the reinforcement signal for each rule to update the Q-value of each action and accomplish the Q-learning operation. There are two reinforcement signal for two kinds of transmission mode. One is the reinforcement signal for mode 1 and the other is the reinforcement signal for mode 2. The details are as following:

3.6.1 Reinforcement Signal for Mode 1

The reinforcement signal are designed according to the three part in fuzzy rule base A.

Rules in the same parts will have the same reinforcement signal.

Green Part: For BLER n is Low ( )

B

represents the number of information bits in the packet of the single data stream and

redun.

B

represents the required redundancy bits of successful transmission which included the initial transmission and the retransmission in this packet. It can expect the higher successful transmitted data rate of the packet will get the larger reward feedback. On the contrary, if the packet is dropped after three failed decoding, we give it a punishment as - (5+

α

1×BLER n( )). It will get the more punishment when the larger block error rate at this episode n . This is because the larger average block rate, the dropped packet will increase the more load for the system. We will update the Q-value after each packet transmission is completed.

Yellow Part: For BLER n is Middle ( )

32

mode 1 when the packet transmission in single data stream is completed at episode n. It means the more retransmission times, the less reward feedback by expecting the more conservative policy than Green Part. If the packet is dropped, it will get - (5+

α

2×BLER n( )) for punishment.

We give it the more punishment than Green Part because of the worse BLER n performance. ( ) Red Part: For BLER n is High ( )

If the packet is correctly received at initial transmission, the reward feedback is design as

. ( . .)

infor infor redun

B B +B which is the same as previous part. However, if the packet is correctly received with retransmission, we give it a punishment according to the successful transmitted data rate of the packet while proportional to normalized (1+RT n1( ))by dividing 3. This is because the purpose for this part is to maintain the QoS requirement for BLER . If the packet is ^* dropped, we will give it the severest punishment - (5+

α

3×BLER n( ))with the limit of

( )

BLER n no more than 1.5 BLER× ^*.

3.6.2 Reinforcement Signal for Mode 2

The designed reinforcement signal are also according to the three part in fuzzy rule base B.

33

Rules in the same parts will have the same reinforcement signal.

Green Part: For BLER n is Low ( )

B represents the total number of information bits of the packets in dual data streams and Bredun total^. represents the total required redundancy bits which included the initial transmission and the retransmission of these packets. The higher data rate for the successful transmission of the packets, the larger reward feedback will get. N^dropped represents the number of the dropped packets in dual data streams. As long as the dropped packet occurs, it will get the punishment as - (5+

α

1×BLER n( ))(N^dropped 2) which is proportional to the number of the dropped packets. We also update the Q-value after the transmission streams is completed.

Yellow Part: For BLER n is Middle ( ) transmission in dual data streams are completed at episode n. We can see that the more average retransmission time, the less reward feedback. Here, the punishment is the same form as Green Part if the dropped packet happened. However, we will give it the more punishment than Green Part, while it has the worse BLER n performance. ( )

34

If both packets are successful received at initial transmission, the reward feedback is according the achievable data rate Binfor.total (Binfor.total+ Bredun.total). If both packets correctly received with retransmissions, we will give it a punishment which is proportional to normalized (1+RT n2( )) by dividing 3. However, in this part, just one packet is dropped through three failed decoding and we will give it the severe punishment according to ⁻

(

⁵⁺

^α

³^×^{BLER n}^{( )}

)

with the limit of BLER n no more than ( ) 1.5 BLER× ^*.

3.7 Q-function Update

According to the feedback reinforcement signal, the purpose of Q-function updating operation is to get new q_n₊₁(S a n_j, _l^*( )), j=1,...,J for fuzzy rule base A (for mode 1) if l^*= 1 in Eq. (3-24) or for fuzzy rule base B (for mode 2) if l^*= in Eq. (3-24). In the following, we 2 describe how Q-function updates for these two transmission mode, respectively.

When the FQLM-HARQ scheme decides using the mode 1 transmission, we use the

35

36

reflect the performance in time and achieve the expected goal.

37 4444

Chapter 4 Simulation Results and Discussions

4.1 System Environment and Parameters

We consider a hexagonal grid cell structure in our simulation. There are 19 base stations (BS) in the multi-cell system. We assume that the HS-DSCH is allocated at maximum up to 80%

of the total power of a BS. Hence, we define the HSDPA service power ratio (HSPR) to represent the ratio of transmission power on HS-DSCH to the total transmission power each antenna at BS for the user. The residual power will be used other service and control channels within the same cell. Here, HSPR controls the amount of the transmission power and the interference from self cell. The channel condition is assumed to be constant within a TTI and is described in section 2.4. We assume the user always has data to be transmitted for simplifying the simulation complexity and reach the higher data rate.

In the simulation, we assume the CQI delay is set to be 10ms. The system performance considers the different HSPR with fixed user mobility at mean user mobility 60 km/hr and the different user mobility with fixed power allocation at 60%. The detailed simulation environment parameters are shown in Table 4.1. In order to evaluate the performance, we will discuss about the system throughput, BLER and dropping rate that comparing to other schemes.

38

Table 4.1: Simulation parameters

Parameter Assumption

Cellular layout Hexagonal grid, 19 sites, 1000m cell radius Path loss model

( ^ξ ( ) ^r )

128.1 + 37.6log₁₀(r)

r is the base station separation in kilometers Decorrelation length (lcor ) 30m

σ

L 8.0

Mobility assignment 0 km/hr to 120 km/hr

Carrier frequency 2.0 GHz

Channel bandwidth 5.0 MHz

Chip-rate 3.84 Mcps

Spreading factor 16

Thermal noise density -174 dBm/Hz

Forgetting factor (γ) 0.1

Learning rate (η) 0.9

TTI length 2ms

CQI th 4

β

th 2

α

1 100

α

2 125

α

3 150

BLER * 0.1

Power for HSDPA data transmission

Maximum of 80% of total maximum available transmission power

ACK/NACK delay 10ms

HARQ IR

39 4.2 Conventional Schemes

In the simulation, we will compare the proposed FQLM-HARQ scheme with two conventional schemes, which are described in the following:

(1) Fixed threshold selection (FTS):

FTS sets the SINR threshold for each MCS based on the pre-known BLER performance [1], [26]. The SINR threshold is the required SINR that the MCS has BLER equal to the requirement 0.1. This scheme will choose the MCS whose corresponding threshold is just under and closest to the measured SINR at each TTI.

(2) Q-learning based HARQ (QL-HARQ) [9]:

QL-HARQ uses the Q-learning algorithm to learn an optimal policy for each initial transmission. The reinforcement signal is designed to be the normalized difference square of the last received SINR and the required SINR of the last MCS decision. After learning, QL-HARQ will choose an optimal MCS to meet the BLER requirement.

In the following section, we will show the simulation results and discuss about them.

4.3 Performance Evaluation and Discussions

Fig 4.1 is the BLER versus HSPR with fixed user mobility at 60 km/hr. The motion incurs the Doppler Effect and the channel variance with CQI delay. Hence, the actual channel condition will be different from the channel information used for determination. It can be seen in Fig 4.1 that the proposed FQLM-HARQ satisfies the BLER requirement with HSPR more than 60%, the next is QL-HARQ with HSPR more than 70%. However, the FTS violates the BLER requirement even with HSPR up to 80%. The FQLM-HARQ has better learning way by considering the situation in different part of BLER n and then adjusts the selection of MCS ( )

40

level to do its possible maintaining the BLER requirement. As the more BS transmission power is allocated for HS-DSCH service, the more MCS level can adaptively select in Red Part to reduce the effect of the channel variance. The MCS level selection of FTS just depends on the current channel condition while it is regardless of the channel variation. This results the BLER performance violating the BLER requirement. Although QL-HARQ also adjusts the MCS level based on the last transmission decision, it does not take the information of transmission results into account. Therefore, it is not flexible enough to accommodate to the channel variation like FQLM-HARQ.

Fig 4.2 is the dropping rate versus HSPR with fixed user mobility at 60 km/hr. In the simulation, when the total transmission times including retransmissions of the same transmission block is more than three, this block will be dropped. It can see that FQLM-HARQ has the lowest dropping rate and FTS has the highest dropping rate despite the HSPR. This has the relation between initial BLER shown in Fig 4.1 and the MCS level selection with the conservative or aggressive way. The smaller BLER performance can result in the lower dropping rate. In FQLM-HARQ, once the dropping block occurs, the design of the reinforcement signal will give the most punishment at each part according the value of

( )

BLER n . After updating the Q-function, we can expect conservative MCS level selection. This design can limit the dropping rate. The QL-HARQ just considers the difference between the received SINR and the required SINR at the last transmission. Therefore, the MCS level will be more aggressive than the FQLM-HARQ and results the more dropping rate.

Fig 4.3 is the system throughput versus HSPR with fixed user mobility at 60 km/hr. It can see that when HSPR increases, the throughputs of the three schemes increase, absolutely. With Fig 4.1, Fig 4.2 and Fig 4.3, FQLM-HARQ can select the optimal MCS level for the largest throughput than the other two schemes while endeavoring to maintain the BLER requirement and result the least dropping rate simultaneously.

41

Figure 4.1: The BLER versus HSPR with fixed user mobility at 60 km/hr.

Figure 4.2: The dropping rate versus HSPR with fixed user mobility at 60 km/hr.

在文檔中在多輸入多輸出高速下行封包擷取系統中採用乏晰Q-Learning技術之混合自動重傳機制 (頁 23-0)