CHAPTER 3 SCENARIO AND PROBLEM FORMULATIONS 15
3.5 Analysis of the Convergence Formulation
3.5.2 Observation of SINR in different CQI and PMI
In the previous subsection, we learned that the distribution of SINR is asso-ciated with the PMI, CQI, and co-scheduled CQI mathematically, but the actual distributions are still unknown. As a result, we run the simulations for different CQI and PMI to observe the pattern of the distribution.
5 10 15
Figure 14: Real CQI and estimated CQI. Real CQI is calculated by h, the esti-mated CQI is the CQI returned by UE.
doi:10.6342/NTU201900453 3.5. ANALYSIS OF THE CONVERGENCE FORMULATION 29
In Fig. 14, only four different PMI cases are presented for the convenience of observation. The x-axis is the real CQI value calculated based on real channel response. The y-axis is the estimated CQI value calculated based on limited feedback. The color represents the number of the pair of certain estimated CQI and real CQI out of the overall cases. Basically, if the point tends to be red, it implies such a pair appears more frequently. This figure indicates that what is the possible real value of CQI corresponding to the estimated value. Thus, the pattern of the distribution of estimated CQI and real CQI are able to be observed in Fig. 14. It can be seen that the distribution for different PMI is various.
That is, for a certain estimated CQI, the possibility of the corresponding CQI is various with PMI. In fact, the variation is large. Therefore, when designing the mechanism, we pay more attention to the difference of each PMI. The rest of the other CQI distribution for different PMI is shown in Fig. 16.
We also investigate the impact of the cos2θ and CQI.
In Fig. 15, the distributions for different CQI, which is calculated as M(|h|), are distinguished in terms of estimated CQI and cos2θ as well. Real CQI is denoted by M
P
|S||hkfk|2 1+|S|P P
i∈S\k|hkfi|2
. It is noticing in Fig. 15 that the variation of CQI is associated with to quantization error, cos2θ. These observations can be further exploited for the design of the model in the chapter 4. In short, the relationship between the estimated CQI and real CQI is more complicated than ever before.
3.5. ANALYSIS OF THE CONVERGENCE FORMULATION 30
Figure 15: Real SINR in different Quantization Error(cosθ) and Interference for different CQI. The dots in the same quantization error are represented as different interferences.
doi:10.6342/NTU201900453 3.5. ANALYSIS OF THE CONVERGENCE FORMULATION 31
5 10 15
Figure 16: Real CQI and estimated CQI. Real CQI is calculated with channel vector h, the estimated CQI is the CQI return by UE. The estimated CQI is calculated following Eq. (3.12), which is a lower-bound in MU-MIMO cases.
CHAPTER 4
PROPOSED REINFORCEMENT LEARNING BASED LINK ADAPTION
In this chapter, we would like to demonstrate how the reinforcement learning based link adaption work in the LTE communication system. Also, the implemen-tation and design of the reinforcement learning technique will be further explained.
4.1 Motivation of Reinforcement Learning
We found that current studies regarding convergence are limited to the partial observations of problems. To be further explained, [12] said that dynamically change the step size can improve the performance. Larger step size can increase the convergence speed. It proposed a method to give a larger compensation step size while estimated BLER is large. [14] says that step size need to be large not only when the BLER is large, but also when BLER is far away for the target BLER region. And then, they will analyze the estimated BLER model or design certain decision maker based on their simulation environment. [30] said that the initial value is important, so it gathered large data in the beginning to get a proper initial value. However, gathering large data in the beginning is not always the possible in practice. It can be seen that all of these method are limited to their partial observations on the environment. And it looks like an endless work, there are always a new observation. However, reinforcement learning provide us the possibility to explore the strategy with less dependence on humans’ intuition. Also, it is known for the capability of capturing the complicate relationship between the large parameters. In this case, we think that applying the RL technique properly can allow us to explore more strategy without humans’ blind spot. What is proper initial value? When the step size have to be large? Except for the factors that previous research mentioned, is there any factors in the environment affecting the performance OLLA mechanism. What is the relationship between the initial value, step size, and information base station can get? Is there A well-designed reinforcement learning can explore the strategy in a more flexible and efficient way to improve the OLLA mechanism if the requirement is clear.
doi:10.6342/NTU201900453 4.1. MOTIVATION OF REINFORCEMENT LEARNING 33
Table 3: Notation Table for Reinforcement Learning Symbols Definition
s, s0 states
a action
r reward
A(s) set of all possible actions in state s A set of all possible actions
R set of all possible rewards st state at time t
at action at time t rt reward at time t
Gt return (cumulative discounted reward) fol-lowing t
π policy, decision-making rule π(s) action taken in state s
π(a|s) probability of taking action a in state s p(s0, r|s, a) probability of transition to state s’ with
re-ward r, from state s taking action a
p(s0|s, a) probability of transition to state s’, from state s taking action a
Vπ(s) value of state s under policy π V∗(s) value of state s under optimal policy
4.1. MOTIVATION OF REINFORCEMENT LEARNING 34
Taking a look into Eq. (3.13). We attempt to formulate all the relevant prob-ability in Eq. (3.13). Thus, it can be rewritten into
maxπ Eπ Eq. (4.1) looks like a Markov decision process. However, the conventional methods, such as dynamic programming, are unable to solve the optimal function because the model is unknown to the agent. Thus, the transition probabilities are unknown. Nevertheless, if we take a look at the concept of reinforcement learning, we can see that the problems, which reinforcement learning aims to deal with, are similar to the problem in the thesis. Reinforcement learning is used to find a policy that achieves maximal reward over a long run. A bunch of researchers has focused on the decision-making issues with reinforcement learning. Thus, we can adopt suitable decision-making techniques among these researches. And then, we make some modifications to optimize the performance while adopting reinforcement learning.
Following the basic concept of reinforment learning [27], the framework of problem formulation, which is suitable for reinforcement learning, is defined as below,
where π is denoted by the policy of selecting the action, a, based on the state, s, which is denoted by the observation from the environment. π(a|s) is the proba-bility of the policy selecting a in s. Gt is the cumulative return function, which could be defined according to the situation. Basically, if any problems can be transformed into this form, it is proper to adopt reinforcement learning methods to solve the problem.
doi:10.6342/NTU201900453 4.2. TRAIN AN OLLA AGENT BASED ON REINFORCEMENT LEARNING 35
In comparison with Eq. (4.1), these two formulas looks similar. We can eas-ily transform Eq. (4.1) into the form of problem in the reinforcement learning, Eq. (4.2). Gt, π(a | s), p(s0, r|s, a) and can be expressed as,
Gt =
T
X
k=0
rt+k
π(a | s) = P (M CSi | st)
p(s0, r|s, a) = P (s0, ri | st, M CSi)
(4.3)
From the above equations, we can see that the problem formulation is able to be easily transformed into the form of a reinforcement learning problem.
In our case, s is the information about the environment, which is observable to the base station, as well as the historical information. The design of s will be further explained in the next chapter. a is the actions the base station can make.
In this case, a ∈ M.
4.2 Train an OLLA Agent based on Reinforcement Learn-ing
In this section, we will elaborate how do we train a agent to to assigned MCS based on the observation of the environment, realizing the OLLA mechanism.
In this work, tensor flow [31] and keras [32], which are developed based on python, are adopted to train the agent. As a result, we have to duplicate the crucial part of the communication environment in VIENNA for the Python platform in order to train the agent. Although our environments are simulated through VIENNA [21], there are advantages of duplicating the communication environment in python instead of applying the machine learning algorithm in VIENNA. One is that it can save our plenty of time to implement the machine learning algorithm, which is a really complicate and delicate work. Our major work is to train a agent, which is suitable for the problem in the this, rather than implement the machine learning algorithm. For another reason, training agent in VIENNA could cost much more time, because a communication simulator has lots of works, such as scheduling, calculating channel response, waiting for the response of UE or the base station, and so on. With carefully duplicating the essential part of the environment into the Python platform could save us plenty of time. The details of how we duplicate the environment in VIENNA will be elaborated in the following.
We have known that the success of the transmission can be decided by the mismatch gap between real SINR and the tolerable SINR of certain MCS. If the mismatch is large, it means that the assigned is either too aggressive or conserva-tive. That is to say, once the real channel is known, we can predict if the feedback
4.2. TRAIN AN OLLA AGENT BASED ON REINFORCEMENT LEARNING 36
VIENNA
Python:
Collect data
Implement Trained Agent
Build Communication environment
Implement Training Algorithm
Data of ℎ for different PMI and CQI.
Trained model
Figure 17: Training Procedure for the proposed Algorithm
is NACK or ACK without waiting for response UE, which is a time-consuming process. As a result, we can gather all the necessary data from the simulator be-fore training. The simulator used in the thesis is VIENNA. The simulator is able to generate the channel response based on the setting. Although the base station is unable to know the perfect channel information, we still can find the informa-tion of the perfect channel informainforma-tion in the simulator. Thus, channel responses can be stored and utilized afterward. Fig. 17 demonstrate the components we implemented in VIENNA and Python, respectively. The training process is listed in the following:
1. Simulator generates all the necessary data, including the real channel re-sponse hk, PMI, and CQI (the feedback estimated by the UE), for training.
2. Duplicate the essential part of the communication system, built the environ-ment in python.
3. Train the agent in python based on reinforcement learning algorithm.
4. Implement the trained agent as mentioned in Fig. 18.
doi:10.6342/NTU201900453 4.3. PROPOSED MECHANISM IN COMMUNICATION SYSTEM 37
While exploiting the machine learning technique, the data have to be carefully obtained. Otherwise, the trained agent could be biased due to the improper database. This criterion also applies to reinforcement learning because it is also one of the machine learning technique. Thus, we run the simulation with the random assignment of PMI. The assignment of SNR for each UE is based on the distribution of SNR given by MTK.
It is noticing that when training the agent, the agent will do exploration and exploitation and update is policy through interactions. By contrast, trained agent will on exploit current policy and do not update the policy. The relationship between the trained agent and the other components in communication system is shown in the next section.
4.3 Proposed Mechanism in Communication System
Fig. 18 shows how we implement the trained agent in the communication sys-tem. Firstly, the UEs return their channel information(CSI) following the standard of LTE including PMI and CQI. The diagram block, which says ’Modify the es-timated CQI with Trained Agent’, is where the proposed method activates. In this diagram, it considers not only CQI but also the historical data and HARQ information returned by UE. And then, it will produce the modified CQI’. The scheduler applies the CQI’ to estimate the channel capacity of the UEs. Then, the scheduler uses the information to calculate the efficiency of the different com-bination, and choose the best combination based on following its policy.
Our mission is to train a agent, which is capable to choose MCS in response to their knowledge of the environment. Thus, we focus on the diagram block, which says ’Modify the estimated CQI with Trained Agent’. In the case, the information of the base station is the feedback of CSI, ACK/NACK and the historical data preserved in the base station. Every time the base stations assign a selected MCS to the scheduled UE, the base stations record that the scheduled UE. And then, the scheduled UEs return ACK or NACK for the MCS assigned to them.
The next section demonstrated how we train the agent.
4.3.1 The Design of State, Reward, and Neural Network 4.3.1.1 State and Reward
The basic components in reinforcement learning are s(state), r(rewarding), and a(action) as we have introduced in the chapter 2. s includes the observable parameters for the agent only. a is the decision made by the agent according to the observable s. That is, a = π(s). In the communication system, s is the
4.3. PROPOSED MECHANISM IN COMMUNICATION SYSTEM 38
PF-scheduler
UE
𝑃𝑀𝐼
CSI from each user including PMI,CQI
NACK/ACK from the scheduled UE
Transmitted data to the scheduled UE
Modify the estimated CQI with Trained Agent 𝐶𝑄𝐼
𝐶𝑄𝐼’
Historical Data
Base Station
Figure 18: Diagram of proposed Mechanism in Communication System information that the base station can obtain. We define a as the MCS selected in next TTI according to s.
The designing process of s is very similar to the feature extraction since the output a is highly dependent on the s. It is known that the feature extraction has a significant impact on the performance of the system. Thus, we will discuss how we design s. The features have to include the relevant factors and exclude irrelevant factors as much as possible. In addition, the normalization of the feature is also an important issue. In practical, it is hard to know which feature is irrelevant or relevant. Nevertheless, having the knowledge of the problem is very helpful for overall performance. In this work, our target function is defined as Eq. (4.1).
That is, we want to find the target MCS as fast as possible. Intuitively, the historical data, which indicating the record of the assigned MCSs and the response (ACK/NACK) of the assigned MCSs, have an impact on the next selection of MCS.
Thus, s should take the historical data into consideration. In this work, the form of historical data is designed carefully. To the best of our knowledge, we have known that the next selected MCS should be larger the maximal known MCS, which can be transmitted successfully. Likewise, the next selected MCS should be smaller than the minimal MCS, which leads to failed transmission. As a result, instead of recording all transmission results, we attempt to simplify the form of recording historical data. Without simplification, the features are
h
ACK N ACK Not used
iNcomb×NM CS
.
The base station maintains the matrix for each user to record the result for
doi:10.6342/NTU201900453 4.3. PROPOSED MECHANISM IN COMMUNICATION SYSTEM 39
each MCS while co-scheduling with different users. The result could be ACK or N ACK. For the MCSs have not been assigned can be recorded as Not used.
With simplification, the matrix for each user becomes h
maxAckM CS minN ackM CS iNcomb
,
which records only the maximal MCS receiving ACK and minimal MCS receiving NACK within assigned MCSs. In theory, eliminating irrelevant feature should improve the performance and prevent it from overfitting. The simplified matrix is chosen in the thesis because it shows better capability. The results are verified in chapter 5 in Fig. 32.
Based on our analysis, the combination of the other beams has an influence on the desired MCS. Furthermore, it can be observed in Fig. 15 that if the cos θ is known, the range of possible MCSs will be narrowed down. For example, assuming there are five possible co-scheduled PMI for a certain UE. If we can get suitable MCS corresponding to the UE scheduled with different PMI. The more accurate MCSs can we get, the narrower the range of cos θ is. As a result, the possible range of MCS for the rest unknown combination can be smaller. That is to say, knowing the correct MCSs of UE scheduled with distinct PMIs might accelerate the speed of finding MCS.
Also, since it can be observed in Fig. 16 that different PMI has distinct char-acteristics, we consider PMI as one of the features.
In short, we put the historical data, the PMI and CQI of the user, and the PMI of the co-scheduled user into s. Also, how to transform the feature is a crucial issue. PMI and CQI are treated as category features, we transform them by hot-encoder, which is a well-known encoder for category features. After several comparison and experiments, our final choice of s is shown as Eq. (4.4). The comparison of different design of features is shown in chapter 5 in Fig. 32.
Fig. 19 demonstrates how the proposed OLLA actually works according to the proposed feature.
The rewarding function indicates the score from si to sj while applying a.
Since our aims are to train a agent able to be applied to the base stations, the rewarding rule has to take the practical environment into consideration.
The base station is unable to make sure about if the assigned ACK is correct or not until it finds the lower bound of unavailable MCS and upper bound of the avail-able MCS. Thus, we take else if |maxAckM CSi,tP airi,j− minN ackM CSi,tP airi,j| 6 1 as a stop criteria. Also, the last assigned MCS should be available so the last assigned MCS should be available. It can be seen that the punishment score is different according to the condition. Since it is obvious that the MCS, which is higher than current unavailable MCS and lower than current available MCS,
4.3. PROPOSED MECHANISM IN COMMUNICATION SYSTEM 40
Table 4: Design of s, r, and a State:
si,t =
h
maxAckM CS minN ackM CS iNcomb
i,t
P airi,j CQIi,t P M Ii,t
.(4.4)
Reward:
ri,t =
−6, if ai,t < maxAckM CSi,tP airi,j or ai,t > minN ackM CSi,tP airi,j 0, else if |maxAckM CSi,tP airi,j − minN ackM CSi,tP airi,j| 6 1
and ai,t == maxAckM CSi,tP airi,j
−1, otherwise
(4.5)
Action:
ai,t is the selected MCS in next TTI according to si,t for U Ei. ai,t ∈ M.
Parameters Description
maxAckM CS The maximal MCS is able to get ACK within MCSs, which have been allocated previously.
minN ackM CS The minimal MCS is able to get NACK within MCSs, which have been allocated previously.
Ncomb The maximal possible number of co-scheduled PMI.
NM CS The number of the MCSs
M Set of MCSs
CQIi,t The CQI of the scheduled U Ei. P M Ii,t The PMI of the scheduled U Ei. P M Ij,t The PMI of the co-scheduled U Ej.
P airi,j The index of the Pair while the scheduled U Ei is co-scheduled with U Ej. ai,t The selected MCS for U Ei .
realSIN Ri,j The real SINR of U Ei while it is co-scheduled with U Ej.
doi:10.6342/NTU201900453 4.3. PROPOSED MECHANISM IN COMMUNICATION SYSTEM 41
Trained Neural network
Figure 19: Block of ’Modify the estimated CQI with Trained Agent’ in Fig. 18 in Detail
should never been tried, the punishment is given more. The rewarding function can be represented as The improper magnitude the rewarding might lead to failed converge of the agent. We find the proper magnitude of the rewarding throughput experiments.