自組式網路中的無線資源管理：分散式學習與穩當策略

(1)

國

立

交

通

大

學

電子工程學系電子研究所

博士論文

自組式網路中的無線資源管理：分散式學習與穩當策略

Radio Resource Management in Self-organized Networks: Distributed

Learning and Robust Strategies

研究生：曾理銓

(2)

自組式網路中的無線資源管理：分散式學習與穩當策略

Radio Resource Management in Self-organized Networks: Distributed

Learning and Robust Strategies

研究生：曾理銓 Student：Li-Chuan Tseng

指導教授：黃經堯 Advisor：Ching-Yao Huang

國立交通大學

電子工程學系電子研究所

博士論文

A Dissertation

Submitted to Department of Electronics Engineering and Institute of Electronics

College of Electrical and Computer Engineering National Chiao Tung University

in Partial Fulfillment of the Requirements for the Degree of

Doctor of Philosophy in

Electronics Engineering December 2013 Hsinchu, Taiwan

(3)

自組式網路中的無線資源管理：分散式學習與穩當策略

學生：曾理銓

指導教授：黃經堯

國立交通大學電子工程學系電子研究所博士班

摘

要

E

由於其在頻譜使用上的彈性，自組式網路被視為滿足不斷增加的行動通訊流

量需求的一個重要方案。在自組式網路中，共享頻譜的節點是分散的，必須由個

別的節點進行無線資源管理。此外，各個節點的無線資源管理決定會影響彼此的

效能，因此我們需要能考慮節點的相互作用的分散式無線資源管理方法。為達此

目的，本論文將包括博弈論，信息論，隨機學習在內的多元數學工具，用於無線

資源管理問題的建模與解決方案。雖然自組式網路中的當紅議題，如異構網絡和

無線感知網路等，已有了深入的研究，我們的工作的新穎性在於基於分散式學習

演算法，各節點在資訊有限的條件下，仍具有自組與調整的能力。

本論文首先介紹相關的數學工具，包括賽局理論的基礎知識與隨機學習演算

法的簡介。接著是關於無線感知網路的一份文獻探討。隨後，我們提供了四個應

用實例。在每個例子中，我們針對一個在分散式網路中可能會遇到的無線資源管

理問題，建構賽局理論模型。網絡中的節點被視為具備自主學習能力的自動機，

並能藉由個別行為-回報歷史，習得適當的資源管理策略。我們亦透過數值模擬，

評估學習過程的收斂性及其性能。

關鍵詞：自組式網路、無線資源管理、賽局理論、隨機學習演算法

(4)

Radio Resource Management in Self-organized Networks:

Distributed Learning and Robust Strategies

Student：Li-Chuan Tseng

Advisor：Dr. Ching-Yao Huang

Department of Electronics Engineering

& Institute of Electronics

National Chiao Tung University

ABSTRACT

Self-organized network (SoN) has been considered as an important solution to the increasing demand of mobile traffics, due to its flexibility in spectrum access. In SoNs, the nodes sharing the spectrum are located in a distributed manner, and the radio resource management (RRM) must be performed by individual nodes. Moreover, since the RRM decisions of the nodes affect the performance of each other, distributed RRM methods considering the interactions of nodes are desirable for SoNs. To this aim, a diversified class of mathematical tools including game theory, information theory, and stochastic learning are involved in this thesis, for the problem formulation and solution of the RRM in SoNs. While the rising topics of SoNs such as heterogeneous networks and cognitive radio networks (CRNs) have been intensively studied, the novelty of our work lies in the capability of self-organization and adjustment under limited information, based on distributed learning methods.

We start our presentation with the underlying mathematics, including game theory fundamentals and an introduction to the stochastic learning algorithm. A survey on CRNs follows. Four application examples are provided afterwards. In each example, game theoretical framework is adopted to formulate an RRM problem we may encounter in distributed networks. The nodes in networks are modeled as self-organized learning automata, which learn proper RRM strategies through individual action-reward history. The convergence of the learning procedure and its performance are evaluated via numerical simulations.

(5)

A

誌

謝

E

隨著這本論文的完成，漫長的博士班生涯也即將畫下句點。首先感謝我的指

導教授黃經堯博士，一路上的指導與支持，使我能順利完成論文。加上大學部專

題與碩士班時期，近十年的提攜之情，畢生難忘。

博士班四年級時，我暫別交大，在一個大雪紛飛的清晨抵達巴黎並赴法國

Télécom Sudparis 研修。留法期間，承蒙 D. Zhaglache 與 A. Marzouki 兩位教授的

協助與指導，讓我一年半的留學生涯過得十分充實。另外要特別感謝法國 Supélec

的 H. Tembine 教授，對研究方向的提點與數學理論的詳細解說，使我的研究有

重要的突破。

由於一開始低估了讀博士的難度，我的研究過程難免挫折。所幸除了台法雙

方的指導教授的支持，在各階段都有貴人相助。特別感謝電子所簡鳳村教授與中

研院張佑榕博士、鍾偉和博士，對於我的關鍵論文的貢獻。三位在自身研究與教

學工作繁忙之餘，仍仔細修改我的拙作並提供建議，這個過程著實獲益良多。當

然也要謝謝與同窗好友冠穎、Robert、Maria、金鑫等，在研究工作上的互相切磋

及在生活上的照顧。

最後，感謝我的家人，在我的求學階段關懷鼓勵與經濟支援，使我可以無後

顧之憂，得以順利完成學業。

曾理銓謹誌

中華民國一○二年十二月

(6)

I

The Backgrounds

9

2 Stochastic Learning in Games 10 2.1 Introduction . . . 10

2.2 Non-cooperative Game Theoretical Concepts . . . 10

2.2.1 Game with External State . . . 11

2.2.2 Mixed Strategy Extension . . . 12

2.2.3 Potential Games . . . 13

(7)

2.3 Evolutionary Game and Replicator Dynamics . . . 14

2.3.1 Replicator Dynamics . . . 15

2.3.2 Stochastic Game . . . 16

2.4 Stochastic Learning Algorithm . . . 17

2.4.1 Generic SLA Structure . . . 19

2.5 Update Rules . . . 20

2.5.1 Bush-Mosteller (BM) Update Rule . . . 21

2.5.2 Multiplicative-weight Update Rule . . . 22

2.6 Convergence of the Proposed Algorithm . . . 24

2.6.1 Potential Games . . . 25

2.6.2 Non-potential Games . . . 28

2.7 Applications: Game Theoretic Modeling . . . 28

Appendix 2.A Assumptions for Stochastic Approximation . . . 29

3 A Survey on the Spectrum Access of Cognitive Radio Networks 31 3.1 Cognitive Spectrum Access . . . 31

3.2 Opportunistic Spectrum Access . . . 33

3.3 Spectrum Trading with Single Seller . . . 34

3.4 Spectrum Trading with Multiple Seller . . . 34

3.4.1 Exclusive Access . . . 35

3.4.2 Shared Access . . . 36

(8)

II

Examples of Fully Distributed Learning

39

4 Network Selection in Cognitive Heterogeneous Networks 40

4.1 Introduction . . . 40

4.1.1 Game-theoretic Problem Mapping . . . 41

4.2 System Model . . . 42

4.3 Self-Organized Network Selection . . . 43

4.3.1 Game Model . . . 44

4.3.2 Analysis of Nash Equilibrium . . . 45

4.3.3 Stochastic Learning Procedure . . . 46

4.4 Numerical Results . . . 48

4.5 Concluding Remarks . . . 50

5 Spectrum Trading in Multiple-Seller Cognitive Radio Networks 52 5.1 Introduction . . . 53

5.2.1 Spectrum Trading Mechanism . . . 55

5.2.2 Two-level Competition as a Stackelberg Game . . . 56

5.3 Service Selection of Secondary Users . . . 57

5.3.1 Game Model . . . 58

5.3.3 Stochastic Learning Procedure for Service Selection . . . 61

5.3.4 Social Welfare and Price of Anarchy . . . 63

5.4 Price Competition among Service Providers . . . 64

5.4.1 Game Model . . . 64

5.4.2 Stochastic Learning Procedure for Price Competition . . . 66

5.5.1 Convergence Behavior of the Lower-level Game . . . 70

5.5.2 Performance Comparison in the Lower-level Game . . . 72

5.5.3 Convergence Behavior of the Upper-level Game . . . 75

(9)

5.6 Conclusion and and Open Issues . . . 79

III

Examples of Distributed Learning with Partial

Coopera-tion

81

6 Self-organized Channel Assignment in Two-tier Distributed Networks 82 6.1 Introduction . . . 83

6.1.1 Examples of Two-tier Distributed Networks . . . 83

6.1.2 Contributions . . . 85

6.1.3 Game-theoretic Problem Mapping . . . 86

6.2 Related Works . . . 86

6.2.1 Variations of Frequency Planning . . . 87

6.2.2 Learning-based Methods . . . 87

6.4 Game-theoretic Model . . . 93

6.4.1 Problem Formulation and Game Model . . . 94

6.5 Stochastic Learning Procedure . . . 97

6.6.1 Convergence of the proposed SL-based learning algorithm . . . 98

6.6.2 Capacity performance . . . 102

6.7 Concluding Remarks . . . 105

7 Distributed Channel Allocation in Network MIMO 106 7.1 Introduction . . . 107

7.2 Related Works . . . 108

7.2.1 Precoding with BS Cooperation . . . 108

7.2.2 Spectrum Sharing . . . 109

7.3.1 The Network MIMO System . . . 110

(10)

7.4 Channel Selection for Network MIMO . . . 115

7.4.1 Game-Theoretic Formulation . . . 116

7.4.2 Existence of Nash Equilibrium . . . 117

7.4.3 Acquisition of the Interference Information . . . 119

7.5 Stochastic Learning-based Channel Selection Algorithm . . . 119

7.5.1 Algorithm Description . . . 120

7.5.2 Convergence Properties of the Proposed Algorithm . . . 120

7.6 Numerical Results and Discussions . . . 123

7.6.1 Convergence Behaviors of the Proposed Learning Algorithm . . . . 123

7.6.2 Capacity Performance for Diﬀerent Channel Selection Strategies . . 124

7.6.3 Capacity Performance and Fairness for Diﬀerent Precoding Schemes 126 7.6.4 Performance Results for Distributed Networks with Random Geo-metry . . . 128

7.7 Concluding Remarks and Open Issues . . . 131

8 Conclusion and Perspectives 133

(11)

List of Figures

1.1 Two scenarios in cooperative communications. . . 3

2.1 Replicator dynamics in evolutionary game and stochastic game. . . 17

2.2 Generic SLA structure. . . 19

3.1 Cognitive radio network architecture. . . 32

4.1 An exemplary heterogeneous network with 2 SPs, 3 PUs, and 4 SUs. The filled and blank blocks in the licensed band of each SP denote the busy channels currently used by the PUs and the residual channels available for serving the SUs, respectively. . . 42

4.2 Evolution of the mixed strategies (choice probability of actions) of some players, using diﬀerent learning rates. . . 48

4.3 Test of unilateral deviation from the resulting strategy profile of each of the 10 players, using diﬀerent learning rates. . . 49

5.1 An exemplary cognitive radio network with 2 SPs, 3 PUs, and 4 SUs. The filled and blank blocks in the licensed band of each SP denote the busy channels currently used by the PUs and the residual channels available for serving the SUs, respectively. . . 55

5.2 Evolution of the mixed strategies (probability of taking diﬀerent actions) of all players. Each pair of pi,1(j) and pi,2(j) shows the behavior of a player i∈ N . . . 71

5.3 Test of unilateral deviation from the learned strategy profile of each of the N = 6 players, with learning rates b = 0.3 and b = 0.5. . . . 72

5.4 Evolution of the actions ai(j) for selected players. . . . 73

5.5 Comparison of the average (normalized) utility per SU for diﬀerent service selection schemes. . . 74

(12)

5.7 Evolution of the mixed strategies (probability of taking diﬀerent actions) of the M = 2 sellers. . . . 76 5.8 Price and revenue dynamics of the M = 2 sellers. . . . 76 5.9 Test of diﬀerent strategies. For each seller, the four bars show its revenues

when taking the four diﬀerent pricing strategies, while its opponent sticks to the learned strategy. . . 77 5.10 Drift in user loads. For each seller, the dynamics of prices, revenues, and

estimated revenues are shown. . . 79 6.1 Possible interference scenarios related to femtocell communications. . . 84 6.2 Dual-stripe deployment of sensor clusters. . . 91 6.3 Exemplary time slot allocation in a frame. In the first slot, cluster head A

and C assign channels for sensor node A1 and C1, respectively. . . 92 6.4 Evolution of the mixed strategies (probability of taking diﬀerent actions)

of all players. Each pair of pi,1(t) and pi,2(t) shows the behavior of player i. 100 6.5 Test of unilateral deviation from the resulting strategy profile of each of

the 10 players. . . 100 6.6 Evolution of the actions ai(j) for some players. . . 101 6.7 Evolution of the mixed strategies (probability of taking diﬀerent actions)

of all players with active ratios of 50% and 75%. Each pair of pi,1(t) and

pi,2(t) shows the behavior of a player i∈ N . . . 102 6.8 Test of unilateral deviation from the resulting strategy profile of each of

the 10 players. . . 103 7.1 Illustration of distributed channel selection with joint precoding in multicell

networks. For MS1, C1 = {1, 3} and D1 = {1}, where BS1 and BS3 both

receive CSI feedback from MS1 and perform interference mitigation but

only BS1 serves MS1. For MS2, C2 = {1, 2, 3} and D2 = {2, 3}, where

BS2 and BS3 jointly serve MS2 while all three BSs perform interference

mitigation. For MS3, C3 ={3} and D3 ={3}, where only BS3 serves MS3. . 112

7.2 Evolution of the mixed strategies (probability of taking diﬀerent actions) of four selected players when joint processing is adopted. . . 125 7.3 Evolution of the estimated cost of taking diﬀerent actions for two selected

players (marked by blue and red colors, respectively) when joint processing is adopted. . . 126 7.4 Cost and capacity for each player for the NE strategy and unilateral

(13)

7.5 Comparison of the achievable capacity for three channel selection strategies when joint processing is adopted. . . 128 7.6 A snapshot of the nodes’ positions and network topology. The link ID is

shown in parenthesis next to the link. . . 129 7.7 Evolution of the mixed strategies of four selected players when local

pre-coding is applied to the distributed network. . . 130 7.8 (a) Achievable capacity for diﬀerent channel selection strategies. (b) Cost

for each player for the NE strategy and unilateral deviation from the NE strategy. . . 131

(14)

List of Tables

2.1 Comparison . . . 16

2.2 Information Available to the Players . . . 18

2.3 Summary of Notations in Game-theoretic Formulation . . . 18

2.4 Summary of Symbols for Game-theoretic Formulation . . . 29

4.1 Mapping to game-theoretic formulation. . . 42

4.2 Summary of Symbols for Game-theoretic Formulation . . . 44

4.3 Comparison of the achievable expected system throughput of three network selection schemes . . . 50

5.1 Elements in a Stackelberg game . . . 57

5.2 Summary of Notations for Game-theoretic Formulation . . . 58

5.3 Simulation Parameters . . . 70

6.1 Mapping to game-theoretic formulation. . . 86

6.3 Simulation Parameters . . . 99

6.4 Comparison of the capacity and fairness for diﬀerent channel assignment schemes . . . 104

7.2 The Simulation Setup . . . 124

7.3 Capacity per MS (bps/Hz) for Diﬀerent Combinations of Channel Selection and Precoding Schemes . . . 127

7.4 JFI (7.31) for Diﬀerent Combinations of Channel Selection and Precoding Schemes . . . 128

(15)

Chapter 1 Introduction

The continuous evolution of wireless networking technology in the last decade has sig-nificantly changed the way of communication, information acquisition, and entertainment. Through the mobile Internet, today’s personal communication devices provide more and more services, like social networking, video streaming, etc. As a consequence, there has been an increasing demand in the wireless resource. To better utilize the shared wireless medium, new network topology and resource management scheme are important. This constitutes the goal of this thesis: the improvement of spectrum eﬃciency in wireless sys-tems through cooperative communications and self-organized, distributed radio resource management.

1.1 Background and Motivations

A

chieving reliable and high data rate communications over wireless links remains a challenging problem. In fact, the inherent nature of the wireless medium has created a number of new research topics. Compared to the wire-line communications, the wireless medium is a ubiquitous resource which is accessible simultaneously by multiple transmissions. The sharing of the medium by multiple links results in a mutually interfered environment, and gives rise to challenges in resource management. In conventional cellular

(16)

networks consisting of multiple base stations, frequency planning is adopted. However, we have to consider universal frequency reuse. The reasons are two-fold. First, frequency reuse factor larger than one limits the spectrum eﬃciency in that only a fraction of spectrum is utilized by each cell regardless of the actual interference condition. Second, in newly developed network topology, the base stations can be deployed in a distributed manner, which makes cell planning hard. Obviously, universal frequency reuse among nearby cells results in inter-cell interference (ICI) and degrades the performance. This statement, though straightforward, lies at the basis of many research topics within wireless communications. Let us mention two examples as follows.

Cooperative communications.

The broadcast nature of wireless communications suggests that a receiver node can

overhear the source signal transmitted towards a neighboring nodes. Instead of

treating the overheard information as interference and trying to mitigate the neg-ative effect, cooperneg-ative communication takes advantage of the proximity of nodes to create spatial diversity, thereby to improve the spectrum efficiency and reliab-ility. In practice, the cooperation can be implemented in different ways. In the relay (multi-hop) networks, the signal is received and processed at the surround-ing nodes, then re-transmitted towards the destination. On the other hand, when multi-antenna system is considered, signal processing techniques can be applied to transmit the signal simultaneously from multiple nodes. In this case, the signal to be transmitted is pre-processed to suppress the ICI and obtain the diversity gain. Assuming perfect back-haul connection, the network consisting of multiple cells can be viewed, and we end up with a virtual MIMO system. The two scenarios are shown in Figure 1.1.

Self-organized resource management.

The limitations on coordination of distributed networks gives rise to new challenges for resource management. On top of that, self-organized network (SoN) capability has received much attention because, unlike the negotiation-based approaches, it does not suﬀer from the information exchange overhead. SoN has been considered

(17)

source Relay network relay source source Virtual MIMO

Figure 1.1: Two scenarios in cooperative communications.

in diﬀerent examples. From the spectrum utilization perspective, dynamic spectrum access (DSA) suggests a distributed decision-making mechanism with consideration on a possibly varying environment. Another example is the heterogeneous networks, in which the spectrum is owned by multiple service providers, and users need proper network selection. Two fundamental mathematical tools frequently involved in SoN are the game theory and the reinforcement learning (RL). Game theory investigates the interaction among self-acting agents, in either cooperative or non-cooperative ways. Game theoretic formulation defines possible solution concept of equilibrium at which unilateral deviation from an equilibrium point brings no better results. On the other hand, RL algorithms helps individual agents learn a better strategy based on their own action-reward history. Interestingly, the two tools may be com-bined; several reinforcement learning techniques have been proved to achieve the equilibrium point.

This thesis aims at investigating the distributed resource management in wireless communications. Specifically, we study the use of reinforcement learning under game-theoretic formulations. The motivation behind is that, while the problem structures can be quite diﬀerent, we would like to propose a unifying scheme which is suitable for various applications. A general guideline of the proposed scheme is described as follows. First, some components (e.g., base stations or users) in the network are identified as the agents (players) of the game. Second, the utility function is defined in order to reflect the agents’

(18)

interests, either individual or common ones. Finally, assuming they are rational and selfish devices, agents act as learning automata to learn their strategies that maximize their individual payoﬀs. Notice that in addition to the interaction among players, the time-varying external state is also considered in the learning procedure.

Starting with the seminal contributions of Von Neumann, Morgenstern [1] and Nash [2], game theory has been extensively investigated in the previous century. While early works focused on the studies of economy, game theory has become a popular choice for the researchers in wireless networks. Comprehensive surveys on the game-theoretic studies for different wireless network applications can be found in [3,4]. On the other hand, we also see rapid development of RL algorithms over the past few decades. Q-learning [5] is a simple way for agents to learn how to act optimally in controlled Markovian domains. It works by successively improving its evaluations of the quality (Q-value) of particular actions at particular states. Another learning method, referred to as the stochastic learning (SL), is based on the update of probability. Using the techniques in stochastic approximation [6], the SL process tracks the ODE of different dynamics. The resulting state depends on the learning rule adopted. The hybrid learning was discussed [7], where the agents may adopt different learning rules to obtain the strategy. SL has been applied to several areas in wireless networks, for example, precoder selection [8], network selection [9], and cognitive radio [10]. The connection between learning and game has been investigated by Sastry

et al. [11]. The authors have proposed an SL algorithm and pointed out that NE can

be achieved when the algorithm is applied to common-payoﬀ games. In this thesis we will further show that the same algorithm achieves NE for potential games, of which the common-payoﬀ game is a special case.

1.2 Thesis Outline and Contributions

The main content of the thesis is divided into three parts. In Part I (Chapter 2 and 3) we review the fundamental mathematical tools and provide a survey on cognitive radio networks. Part II (Chapter 4 and 5) provide two application examples of fully distributed

(19)

learning in distributed resource management. Part III (Chapter 6 and 7) studies the case of distributed learning with partial cooperation. The following is an overview of each chapter.

Chapter 2. This chapter introduces the diﬀerent concepts that will be used throughout the thesis, together with the fundamental mathematics. The basic ideas in game theory is first reviewed. This problem is formulated as a non-cooperative game. The existence and multiplicity of the Nash equilibrium (NE) solution will be investigated for two diﬀerent network models. In the second part of this chapter, the stochastic learning algorithm is explained in detail. We give the structure of SLA, and present several update rules. At the end, we show that under certain conditions, the SLA converges to NE.

Chapter 3. The first three examples in this thesis are all related to the spectrum access behaviors of cognitive radio networks (CRNs). Therefore, before entering the examples, we open up one chapter to review the previous works on CRNs. The spectrum access in CRNs is classified as diﬀerent models according to the way the spectrum is granted to the secondary users. Then the representative works of each model are summarized.

Chapter 4. This chapter presents the first application: the network selection problem in cognitive heterogeneous networks (HetNets) where multiple radio access technologies (RATs) coexist. We formulate the network selection problem as a non-cooperative game where the secondary users (SUs) are the players. In particular, under a cognitive access scenario, the availability of channels for SUs depends on the traﬃc demands of PUs, and is considered as the time-varying external state. With a reasonably designed utility function, we prove that the game is an OPG. SLA is adopted and each SU’s strategy progressively evolves toward the Nash equilibrium (NE) based on its own action-reward history, without the need to know actions in other SUs. The convergence property and the performance in terms of throughput and fairness are again shown through simulations. Chapter 5. As the second application example of SLA, this chapter studies the spectrum trading in CRNs. Diﬀerent from the first example, now the licensed spectrum opportun-ities are sold to multiple unlicensed secondary users by multiple service providers. The spectrum trading is modeled as a multi-leader multi-follower Stackelberg game with two

(20)

levels of competition. In the lower-level competition, each secondary user selects a ser-vice provider with time-varying channel availability. The serser-vice selection is determined by the prices and the quality of service, which depends on the number of residual chan-nels and the behavior of other secondary users. In the upper-level competition, service providers adjust their pricing strategies to maximize their individual revenues. We fur-ther propose decentralized, stochastic learning-based algorithms for both levels, where a player’s strategy progressively evolves toward the Nash equilibrium (NE) based on its own action-reward history without information of other players’ actions. The convergence properties of the proposed algorithms toward NE points are theoretically and numerically verified. The proposed method demonstrates good utility and fairness performances for the secondary users as compared to other service selection schemes.

Chapter 6. The third example considers channel assignment in OFDMA-based two-tier distributed networks. The secondary users are formulated as the players, and the strategy is the channel assignment. There are two major diﬀerence from the previous examples. Firstly, unlike the previous examples where a resource unit is granted by the owner to a specific user, here we consider the case that all users access the same spectrum. On top of that, an interference mitigation game is formed. Secondly, each player is allowed to know the action of its neighbors. In this way, a proper utility function can be defined, and the channel assignment problem is formulated as an ordinal potential game which has at least one pure-strategy Nash equilibrium (NE). Then the stochastic learning algorithm discussed in Chapter 2 is applied. The convergence property toward pure strategy NE points is verified through system-level simulations. In addition, performance evaluation is carried out by comparing the proposed algorithm with other methods.

Chapter 7. The last example addresses the joint processing and distributed channel assignment in network MIMO systems. The cooperative frequency reuse among base stations (BSs) can improve the system spectral eﬃciency by reducing the intercell in-terference (ICI) through channel selection and precoding. We presents a game-theoretic study of channel selection for realizing network MIMO operation under time-varying wire-less channel. We propose a new joint precoding scheme that carries enhanced interference

(21)

mitigation and capacity improvement abilities for network MIMO systems. We formulate the channel selection problem as a noncooperative game with BSs as the players, and show that our game is an exact potential game (EPG) given the proposed utility function. A de-centralized, stochastic learning-based algorithm is proposed where each BS progressively moves toward the Nash equilibrium (NE) strategy based on its action-reward history and not actions taken by others. The convergence properties of the proposed learning algorithm toward a pure-strategy NE point are theoretically shown and numerically veri-fied for diﬀerent network topologies. The proposed learning algorithm also demonstrates a fine capacity and fairness performance as compared to other schemes through extensive link-level simulations.

1.3 Publications

The research work conducted during the three years of the thesis has led to several publications.

International Journal Articles

• L.-C. Tseng, F.-T. Chien, D. Zhang, R. Y. Chang, W.-H. Chung, and C.-Y. Huang, “Network Selection in Cognitive Heterogeneous Networks Using Stochastic Learn-ing,” to appear in IEEE Communications Letters.

International Conference Proceedings

• C.-H. Lin, L.-C. Tseng, C.-Y. Huang “Cognitive Radio Networks: Game Modeling and Self-organization Using Stochastic Learning,” in Proc. IEEE PIMRC 2013, Sept. 2013, pp.3006-3010.

• L.-C. Tseng, X. Jin, A. Marzouki, and C.-Y. Huang, “Downlink Scheduling in Network MIMO Using Two-Stage Channel State Feedback,” in Proc. IEEE VTC

(22)

• L.-C. Tseng, C.-Y. Huang and A. F. Hanif, “Dynamic resource management for OFDMA-based Femtocells in the Uplink,” in Proc. IEEE IWCMC ’11, July 2011, pp. 528-533.

Submitted Articles

• L.-C. Tseng, F.-T. Chien, C.-Y. Huang, R. Y. Chang, W.-H. Chung and A. Mar-zouki, “Self-Organized Cognitive Sensor Networks: Distributed Channel Assignment for Pervasive Sensing” (Submitted).

• L.-C. Tseng, F.-T. Chien, C.-Y. Huang, R. Y. Chang, W.-H. Chung and A. Mar-zouki, “Distributed Channel Allocation for Network MIMO: Game-Theoretic For-mulation and Stochastic Learning,” (Submitted).

(23)

Part I

(24)

Chapter 2 Stochastic Learning in Games

This chapter aims at introducing the stochastic learning algorithm which is used in game models.

2.1 Introduction

T

here has been much interest in designing learning algorithms toward NE in non-cooperative games. However, the external state (CSI) is unknown and the action is selected by each player simultaneously and independently in each play. Therefore, previous algorithms requiring complete information and implicit ordering of acting players (e.g., those based on better response dynamics (BRD) [12] and fictitious play (FP) [13]) may not be feasible in our self-organized multicell resource allocation problem. In this chapter, we develop a decentralized SL-based algorithm where the BSs move toward the equilibrium strategy based on their individual action-reward history.

2.2 Non-cooperative Game Theoretical Concepts

In this section, we briefly review some game-theoretical concepts which can be seen as the basis throughout the manuscript. We consider the rational and selfish game players

(25)

in the sense that a player chooses its best strategy to maximize its own benefit [12].

2.2.1 Game with External State

The four basic components of a non-cooperative game G with external state are: • The external state space X . The state is represented by an independent random

variable, and the transitions between the states are independent of the chosen ac-tions.

• The set of players, N = {1, . . . , N}, where N is the total number of players

• The action spaces A = {A1, . . . ,AN}, where Ai is the set of actions that player i can take. These nonempty sets can be discrete or continuous, finite or infinite. • The preference structure of the players. {ui}i∈N is the utility function of player i

that depends on its own action as well as the actions of other players.

The strategic form (also called normal-form) of a gameG is represented by a 4-tuple:

G = (X , N , {Ai}i∈N,{ui}i∈N) (2.1) For a game with external state, the utility is defined as the expectation of the random reward, i.e.,

ui(ai, a−i) = EX[ri|(ai, a−i)], whereE[·] denotes the mathematical expectation operator.

In the case of non-cooperative games, in which the players act in a selfish and inde-pendent manner, the Nash equilibrium (NE) introduced in [2] provides a solution concept of the game. It represents an operating point which is both predictable and robust to unilateral deviations (which is realistic considering the fact that the players are assumed to be non-cooperative and act in an isolated manner). This means that once the system

(26)

is operating in this state, no player has any incentive to deviate because it will lose in terms of its own benefit. The mathematical definition of the NE is as follows:

Definition 2.2.1 (Nash equilibrium). A strategy profile a∗ = (a∗₁, . . . , a∗_N) is a (pure-strategy) Nash equilibrium if

ui(a∗i, a∗−i)≥ ui(a′i, a∗−i),∀i ∈ N , a′i ∈ Ai (2.2) where a∗_−i = (a∗₁, . . . , a_i∗₋₁, a∗_i+1. . . , a∗_N) denotes the set of the other players’ actions.

2.2.2 Mixed Strategy Extension

We can easily extend the non-cooperative game into a mixed strategy form as in [11]. Let pi,si be the probability that player i selects strategy si ∈ Ai, and pi = [pi,1, . . . , pi,K]T be the mixed strategy of player i,∀i ∈ N . Let Pi be the set of probability distribution over the action space of player i, i.e.,

Pi := { pi | pi,si ∈ [0, 1], ∑ si∈Ai pi,si = 1 } (2.3)

Then, the mixed extension of utility function ψi :×i∈NPi 7→ R is defined as

ψi(pi, P−i):=Ep1,...,pN[ui] = ∑ a1,...,aN ui(a1, . . . , aN) ( _N ∏ j=1 pj,aj ) . (2.4)

where p_−i is the mixed strategy of players other than i. We have the definition of NE in mixed strategy as follows.

Definition 2.2.2 (mixed-strategy NE). A strategy profile P∗ is a mixed-strategy Nash equilibrium (NE) point of the non-cooperative gameG if and only if

(27)

2.2.3 Potential Games

While the concept of NE describes a possible steady state for a non-cooperative game, NE points do not always exist. An important class of games for which the existence of NE is guaranteed is the potential game introduced in [12]. We first define diﬀerent kinds of potential games:

Definition 2.2.3. A strategic form gameG = (N , {Ai}i∈N,{ui}i∈N) is an exact potential game (EPG) if there exists a potential function Φ :A 7→ R+ such that

ui(a′i, a−i)− ui(ai, a−i) = Φ(a′i, a−i)− Φ(ai, a−i),∀i ∈ N . (2.6) Definition 2.2.4. A strategic form game G = (N , {Ai}i∈N,{ui}i∈N) is a weighted po-tential game (WPG) if there exists a popo-tential function Φ :A 7→ R+ and a weight vector

w = [w1, . . . , wN]∈ R+ such that

ui(a′i, a−i)− ui(ai, a−i) = wi[Φ(a′i, a−i)− Φ(ai, a−i)],∀i ∈ N . (2.7) Definition 2.2.5. A strategic form gameG = (N , {Ai}i∈N,{ui}i∈N) is an ordinal poten-tial game (OPG) if there exists a potenpoten-tial function Φ :A 7→ R+ such that

ui(a′i, a−i)≥ ui(ai, a−i)⇔ Φ(a′i, a−i)≥ Φ(ai, a−i),∀i ∈ N . (2.8)

An important property of potential games is that the objectives of all players align to a common objective, that is, the maximization of potential function Φ. Following [12], the local maxima of the potential function are NE points of the game. Thus, every potential game has at least one pure strategy NE.

(28)

2.2.4 Achieving NE: Previous Methods

We briefly discuss two previously developed methods to achieve NE.

• Fictitious Play

Introduced by G.W. Brown [13], in fictitious play, each player presumes that the opponents are playing stationary (possibly mixed) strategies. At each round, each player thus best responds to the empirical frequency of play of his opponent. Such a method is of course adequate if the opponent indeed uses a stationary strategy, while it is flawed if the opponent’s strategy is nonstationary. The opponent’s strategy may for example be conditioned on the fictitious player’s last move.

• Best response dynamics

Each of the players select actions sequentially. In each time slot, a player selects the action that is best response to the action chosen by the other players in the previous time slot. A best response BR(.) is a correspondence (multi-valued mapping) from ∏

Ai 7→ 2|Ai|:

ai = BR(a1, . . . , ai−1, ai+1, . . . aN). (2.9) Furthermore, in finite games, the iterative best-response type algorithms converge to one of the NE states depending on the initial point.

2.3 Evolutionary Game and Replicator Dynamics

Evolutionary game theory studies the behaviors of large populations of agents who repeatedly engage in strategic interactions. Here we review the replicator dynamics, an important part of evolutionary games [14]. When considering the replicator dynamics, it is useful to think of a large population of agents who play a pre-programmed pure strategies and are randomly matched to play against each other. The growth rate of the proportion of players using a certain pure strategy is the diﬀerence between the expected

(29)

payoﬀ of that pure strategy, given the proportions of players using every pure strategy, and the average expected payoﬀ in that population. The strategy is inherited.

2.3.1 Replicator Dynamics

Consider a population of players. Suppose that there is some evolutionary game (two-player and symmetric) that these critters play with each other. This game has a set of pure strategies S, and a payoﬀ function π(s, s′) being the payoﬀ to an agent playing strategy s against another agent playing s′.

Let ϕs(t) be the measures of the set of players using pure strategy s at time t, and

θs(t) =

ϕs(t)

∑

s′ϕs′(t) be the fraction of players. Then the expected payoﬀ to using pure strategy s at time t is us(t),

∑

θs′(t)π(s, s′), and the average utility of the whole popu-lation is ¯u(t),∑_sθs(t)us(t). Suppose that each individual is genetically programmed to play some pure strategy, and that this programming is inherited1_{. Suppose that the net}

reproduction rate of each individual is proportional to its score in the stage game, i.e.,

˙

ϕs(t) = ϕs(t)us(t). (2.10)

Then a continuous time dynamics of the portion can be found as

˙ θs(t) = ˙ ϕs(t) ∑ s′ϕs′(t)− ϕs(t) ∑ s′ϕ˙s′(t) (∑ s′ϕ˙s′(t) )2 = θs(t)[us(t)− ¯u(t)]. (2.11)

Equation (2.11) says that strategies with negative scores have negative net growth rates. The population size is varying; if all payoﬀs are negative, the entire population is shrinking. This is reasonable with the biological interpretation; in economic applications we tend to think of the number of agents playing the game as being constant. But note that even if the rewards are negative, the sum of the population shares is always unity. Note also 1_{Indeed, mutation is also considered in the studies of evolutionary game theory, however it is out of}

(30)

that if the initial share of strategy s is positive, then its share remains positive: the share can shrink towards zero, but zero is not reached in finite time. Notice that the population share of strategies that are not the best responses to other players current action can grow, as long as these strategies perform better than the population average. This is a key property that distinguishes the replicator dynamic from best-response dynamic and fictitious play.

2.3.2 Stochastic Game

Now we change the population concept into a stochastic form of standard game. Form-ally, we may write the ODE:

dpi,si(t) dt = pi,si(t)  ψi(esi, p−i)− ∑ s′i∈Ai ψi(es′_i, P)pi,s′_i(t)   . (2.12)

Although the game setting looks quite diﬀerent, the concepts in replicator dynamics can be applied. The two interpretations are shown in Figure 2.1. Figure 2.1(a) shows an evolutionary game with two types of players. The population of players taking each strategy changes. On the other hand, the strategic game in Figure 2.1(b), the number of players is fixed, and the weighting of each strategy changes. A detailed comparison is given in Table 2.1.

Table 2.1: Comparison

Property Evolutionary Game Stochastic Game

Player rationality not rational rational

Strategy adopted by

each player same strategy as inherited

mixed-strategy with vary-ing weight

Variables in replicator

equations population share probability of each strategy

Strategy with higher

(31)

(a) Evolutionary Game s1 s2 _s 1 s2 s1 s2 s1 s2 _s 1 s2 s1 s2 Player 1 Player 2 Player 3 Player 1 Player 2 Player 3 t t+1 (b) Stochastic Game

Figure 2.1: Replicator dynamics in evolutionary game and stochastic game.

2.4 Stochastic Learning Algorithm

In this section, we present the structure of the stochastic learning algorithm (SLA), which will be used in later section. When the learning is applied, it has two major ad-vantages over conventional methods. First, the SLA is robust against external states: the learned strategy for each player. Second, Learning under limited information. According to the available information for individual players, the learning is classified as follows.

1. Fully-distributed learning: The available information is restricted to action-reward history of each individual player. A player knows nothing about its oppon-ents. Fully distributed learning is usually applied when the payoﬀ is given by an

outsider which is not a member of the player set.

2. Distributed learning with partial cooperation: Sometimes, the reward is cal-culated by individual player instead of obtained from the environment. In this case, the players may own partial knowledge of other players including their past

(32)

actions and the observation on external states. However, each player keeps its own learning process, and the decision making is uncoupled. Notice that the major dif-ference between uncoupled learning and BRD is that the former allows simultaneous strategy updates of players, while the latter requires an implicit ordering of strategy updates.

In this thesis, the examples considered include both cases. Table 2.2 summarizes the information available to the players.

Table 2.2: Information Available to the Players

Information Fully-distributed

Learning

Distributed learning with partial cooperation

Awareness of being in a game No Yes

Existence of opponents No Partial

Observation of external state No Partial

Action spaces of the others No No

Joint strategy No No

Current action of others No No

Last action of the opponents No Partial

Last own-action Yes Yes

Observation of own reward Yes Yes

Own reward function form No Yes

Reward function form of the others No No

Table 2.3: Summary of Notations in Game-theoretic Formulation

Symbol Meaning

X external state space

X random matrix for the external state

N set of players

Ai set of actions of player i

si ∈ Ai an element of Ai

ai(n)∈ Ai action of player i at iteration n

a_−i(n)∈ Ai actions of players except for i at iteration n

Pi := ∆(Ai) set of probability distribution over Ai pi(n)∈ Pi mixed strategy of player i at slot n

ri(n)∈ R instantaneous reward of player i at slot n ˆ

(33)

2.4.1 Generic SLA Structure

Under the SLA, the players can learn their expected payoﬀs and their optimal strategies by using some simple iterative techniques based on their action-reward history. The ac-tions that give good performance are reinforced and new acac-tions are explored. Therefore, such an approach belongs to the reinforcement learning.

Dynamic Environment - current action profile - current reward profile - external state Player 1 Strategy updates Player N Strategy updates Play strategy a1(t) Play strategy aN(t) r1(t) rN(t) Negotiation?

Figure 2.2: Generic SLA structure.

The generic stochastic learning algorithm is described in Algorithm 2.1. In each Algorithm 2.1 Generic Stochastic Learning

1: Initially, set n = 0, and the action probability vector as

pi,si(0) = 1/|Ai|, ˆui,si(−1) = 0, ∀i ∈ N , si ∈ Ai.

2: At the beginning of the nth iteration, each player selects an action ai(n) according to the current action probability vector (i.e., mixed strategy) pi(n).

3: At the completion of the nth iteration, each player calculates or receives the instant-aneous reward ri(n).

4: All players update their utility estimation and action probability vector according to the update rules.

(34)

set of each player. After each iteration, a player obtains the instantaneous reward from the outsider or through calculation. It updates the action probability vector (i.e., mixed strategy) pi(n) as well as the utility estimation vector ˆui(n). The utility estimation serves as a reinforcement signal so that higher utility (lower cost) leads to higher probability in the next play. Notably, the proposed learning algorithm is distributed: the strategy selec-tion is based on individual observaselec-tions instead of the guide from a centralized controller. The update rules for action probability vector and utility estimation are investigated in the next section.

2.5 Update Rules

The general form of an update rule can be expressed as:      ^ ui(n + 1) = fi(λi(n), ai(n), ri(n), ˆui(n), pi(n)) pi(n + 1) = gi(νi(n), ai(n), ri(n), ˆui(n), pi(n)) (2.13)

where λ(n), ν(n) are the learning rates for the utility estimation and action probability, respectively. There values are carefully chosen so that

λi(t)≥ 0, ∑ t λi(t) = +∞, ∑ t λ2_i(t) <∞, νi(t)≥ 0, ∑ t νi(t) = +∞, ∑ t ν_i2(t) <∞. (2.14)

In this section, we introduce two probability update rules, namely, the Bush-Mosteller update rule and the multiplicative-weight update rule. We investigate their ODE approx-imations.

(35)

2.5.1 Bush-Mosteller (BM) Update Rule

With BM update rule, the mixed strategies are updated as follows: 

   

pi,si(n + 1) = pi,si(n) + b˜ri(n)(1− pi,si(n)), si = ai(n)

pi,si(n + 1) = pi,si(n)− b˜ri(n)pi,si(n), si ̸= ai(n)

(2.15)

where ˜ri(n)∈ [0, 1] is the normalized instantaneous reward, i.e., ˜

ri(t) =

ri(t)− rmin

rmax− rmin

, (2.16)

where rmax and rmin proposition.

Proposition 2.5.1. With suﬃciently small b, the probability matrix sequence {P(n)}

converges to P∗ which is the solution of the following ODE: dpi,si(t)

dt = pi,si(t) [ψi(esi, P−i)− ψi(P)] (2.17) The boundary condition is given by P(0) = P0, where P0 is the initial action probability

matrix.

Although the SL-based algorithm with BM rule converges to NE points for poten-tial games, it requires the normalization of the instant reward. This requirement makes the algorithm inapplicable when the extreme values of reward functions are unavailable. Therefore, another update rule is also considered in our works.

(36)

2.5.2 Multiplicative-weight Update Rule

The multiplicative-weight update rule consists of the iterative updates for utility es-timations and mixed strategies. The rule is described as follows:

       ˆ

ui,si(n + 1)− ˆui,si(n) = ηi1l{ai(n)=si}(ri(n)− ˆui,si(n))

pi,si(n + 1) = p_i,si(n)(1+ϵi)ui,sˆ i(n) ∑ s′_i∈Aipi,s′_i(n)(1+ϵi) ˆ u i,s′ i (n) (2.18)

where ηi and ϵi are the learning rates for utility estimation and action probability, re-spectively.

It’s ODE approximation is discussed in the following proposition. First, by using the ordinary diﬀerential equation (ODE) approximation we characterize the long-term behavior of the sequence {P(n)}. Second, we establish a suﬃcient condition for the arrival at NE points for the proposed learning algorithm and prove that the game G satisfies this condition.

Proposition 2.5.2. With suﬃciently small learning rates η and ϵ:

1. The estimated utility converges to

ˆ

ui,si → ψi(esi, P−i). (2.19)

2. Asymptotically, the probability matrix sequence {P(k)} can be approximated by the trajectory of the following ODE:

dpi,si(t)

dt = pi,si(t) [ψi(esi, P−i)− ψi(P)] (2.20) where pm,si(t) is the continuous-time version of pi,si(n), and the boundary condition

is given by P(0) = P0, where P0 is the initial mixed strategy matrix.

Proof: For better understanding, we reproduce the proof from [7, Section 4.3].

(37)

can be given as

ˆ

ui,si → ψi(si, P), if ηi → 0, (2.21) and the tracked ODE can be given as [7, Section 4.3], [11, Theorem 3.1]

dpi,si(t)

dt = limϵi→0

pi,si(n + 1)− pi,si(n)

ϵi

. (2.22)

Next we will show the RHS of the above is exactly that of (2.20). Let S = ∑

s′i∈Ai

pi,s′_i(t)(1− ϵi)−ˆui,s′i. Then we have

pi,si(t + 1)− pi,si(t) ϵi = pi,si(t) ϵi [ (1− ϵi)−ûi,si S − 1 ] = pi,si(t) S [ (1− ϵi)−ûi,si − 1 + 1 − S ϵi ] = pi,si(t) S  (1 − ϵi)−ûi,si − 1 ϵi − ∑ s′i∈Ai pi,s′_i ( (1− ϵi)−ûi,s′i − 1 ϵi )  ,

where we have employed the update rule for pi,si(t + 1) in (12) of the manuscript to obtain the first equality.

With the result of (2.21) and limϵ→0(1−ϵ)

−u₋₁ ϵ = u, it follows that lim ϵi→0 pi,si(t + 1)− pi,si(t) ϵi = pi,si(t)  ψ_i(esi, P)− ∑ s′i∈Ai ψi(es′_i, P)pi,s′_i(t)   = pi,si(t) [ψi(esi, P−i)− ψi(P)] , (2.23) where we have used the fact that limϵ→0S = 1. Combining (2.22) and (2.23) above we complete the proof.

(38)

other player m′,∀m′ ∈ M, m′ ̸= m employs a mixed strategy pm′, and its value is learned by player m as the estimated utility ˆum,sm, as shown in (2.19). On the other hand, the ODE for mixed-strategy in (2.20) is the replicator equation [14] in which the probability of taking one strategy increases if the current estimated utility of this strategy is larger than the average utility over all strategies and decreases otherwise. Compared to the best response dynamics [12] where a player changes its strategy in the next iteration to the best action according to other players’ actions (i.e., the best response), with the replicator dynamics, a player selects an action according to a probability distribution over the strategy set, and adjusts the weighting for each possible action in each iteration based on the utility estimation.

2.6 Convergence of the Proposed Algorithm

Convergence toward pure strategy NE points is an important feature of the proposed learning algorithm. Similar to the discussions in [11] and [10], here we theoretically demonstrate the convergence properties of the proposed SL-based algorithm. First, by using the ordinary diﬀerential equation (ODE) approximation we characterize the long-term behavior of the sequence {P(n)}. Second, we establish a suﬃcient condition for the arrival at NE points for the proposed learning algorithm and prove that the game G satisfies this condition.

Note that the ODE in (2.20) is the replicator equation [14] in which the probability of taking one strategy grows if this strategy’s current estimated utility is larger than the average utility over all strategies and declines otherwise. Compared to the best response dynamics where a player changes its strategy in the next iteration to the best action according to other players’ action, a player adjusts the weighting for each possible action in each iteration with the replicator dynamics.

Proposition 2.6.1 (Folk theorems). The proposed learning algorithm has the following

(39)

1. All Nash equilibria are stationary points of (2.20); 2. All stationary points of (2.20) are Nash equilibria.

Proposition 2.6.1 is an instance of the Folk theorems in evolutionary game theory [14], and these properties follow directly from the replicator equation in (2.20). For an intuitive explanation, observe that ψi(esi, P−i) is the expected reward function of player i if it employs pure strategy si while other player j,∀j ∈ N , j ̸= i employs a mixed strategy pj. From the definition of Nash equilibrium, the condition

ψi(es∗_i, P∗_−i) = ψi(P∗), ∀i ∈ N , si ∈ Ai with p∗i,si > 0 (2.24) must hold for an NE strategy profile P∗. Therefore any Nash equilibrium must lead the right-hand side of (2.20) to zero, and thus constitutes a stationary point of (2.20). It is worth noting that, for a mixed-strategy NE, all survived pure strategies (i.e. si with

pi,si > 0) of player i perform equally well when other players follow the mixed strategy P∗_−i.

From the ODE approximation, we find a way to describe the asymptotic behavior of the discrete updates of the mixed strategies for diﬀerent update rules. In the following, we investigate the convergence property.

2.6.1 Potential Games

We first consider the case that the game is a potential game.

Proposition 2.6.2. Suppose that there exists a bounded diﬀerentiable function Ψ :R|A| → R such that

Ψ(esi, P−i) =

∂Ψ(P) ∂pi,si

(2.25)

is an increasing function of ψi(esi, P−i). Then, the SL-based algorithm converges to an

(40)

Proof. First, we rewrite the ODE in (2.20) as follows: dpi,si(t) dt = pi,si(t) ∑ s′i∈Ai pi,s′_i(t) [ ψi(esi, P−i)− ψi(es′_i, P−i) ] . (2.26)

Given that Ψ(esi, P−i) = ∂Ψ(P)/∂pi,si is an increasing function of ψi(esi, P−i), and let

Di,si,s′_i = ψi(esi, P−i)− ψi(es′_i, P−i), Ei,si,s′_i = Ψ(esi, P−i)− Ψ(es′_i, P−i), we may write

Di,si,s′_i > 0⇔ Ei,si,s′_i > 0. (2.27)

By applying (2.26) and (2.27), the derivation of Ψ(P) with respect to t is given by

dΨ(P) dt = ∑ i∈N ∑ si∈Ai ∂Ψ(P) ∂pi,si dpi,si dt =∑ i∈N ∑ si,s′_i∈Ai

pi,sipi,s′_iΨ(esi, P−i)· Di,si,s′_i

= 1 2 ∑ i∈N ∑ si,s′i∈Ai si<s′_i

pi,sipi,s′_iEi,si,s′_i · Di,si,s′_i

≥ 0 (2.28)

where the last inequality holds since given the condition in (2.27), Di,si,s′_i and Ei,si,s′_i always have the same sign.

Thus Ψ(·) is non-decreasing along the trajectories of the ODE, and asymptotically all the trajectories will be in the set {P ∈ P : dΨ(P)_dt = 0}. From (2.26) and (2.28), the following is known: dΨ(P) dt = 0 ⇒ pi,sipi,s′_i [ ψi(esi, P−i)− ψi(es′_i, P−i) ]2 = 0, ∀i, si, s′i ⇒ dpi,si dt = 0, ∀i, si, s ′ i

⇒ P is a stationary point of the ODE (2.20). (2.29) In other words, when starting from an interior point of the simplex of the mixed strategy

(41)

space P, the sequence P(n) converges to a stationary point of the ODE in (2.26). By Proposition 2.6.2, we complete the proof.

Proposition 2.6.2 establishes a suﬃcient condition that guarantees the convergence toward NE. In what follows, we prove that an ordinal potential game G satisfies this condition and hence it converges to a pure-strategy NE point by using the SL-based algorithm.

Proposition 2.6.3. When applied to OPGs, the proposed SLA with both update rules

converges to a (possible mixed-strategy) NE point.

Proof. For OPGs, let Ψ(P) be the mixed extension of the potential function,

Ψ(P) = ∑ al,l̸=i Φ(a1, . . . , aN) ∏ j̸=i pj,aj. (2.30)

By extending the definition of OPG into mixed-strategy, we have that for OPGs

Ψ(e_s′

i, P)− Ψ(si, P) > 0⇔ ψi(s ′

i, P)− ψi(si, P) > 0 (2.31)

∀si, s′i ∈ Ai,∀i ∈ N . By Proposition 2.6.2, we complete the proof.

Corollary 2.6.1. When applied to WPGs and EPGs, the proposed SLA with both update

rules converges to a (possible mixed-strategy) NE point.

Note that the learning rates (ϵi, ηi) play an important role in the convergence behavior of the proposed SL-based learning algorithm. In particular, smaller learning rates lead to a slower convergence. The choice of learning rates poses a trade-oﬀ between accuracy and speed, and may be determined by training in practice.

Remark 2.6.1. Propositions 2.6.2 and 2.6.3 do not guarantee the convergence toward a

pure-strategy NE. However, our simulation shows that a pure-strategy NE rather than a

(42)

2.6.2 Non-potential Games

While the OPG already relaxes the constraints of problem formulation, there are cases that a potential game cannot be formed. When trying to apply the SLA, we encounter two major questions:

(1) Does the SLA still converge?

(2) If the SLA converges, what are the properties of the resulting strategy profile (e.g., is it NE point)?

Similar to MAQL, the convergence is not theoretically guaranteed but usually observed in practical applications. We may also set the limitation of maximum number of rounds to avoid an infinite loop. Furthermore, due to the stochastic approximation to the trajectory of replicator dynamics, the remaining mixed strategy is a kind of good strategy against the opponents, though may not be NE point. Therefore, while the proposed SLA possesses some good properties when applied to potential games, we believe that it is still suitable for other problem formulations in which learning is required.

Proposition 2.6.4. If the proposed algorithm converges to a stationary point of (2.20),

the limiting point must be a (possibly mixed-strategy) NE point.

2.7 Applications: Game Theoretic Modeling

The last section of this chapter is devoted to an overview of how to establish a game theoretic formulation for a radio resource management (RRM) problem in wireless com-munication systems. A mapping of game theory components to RRM problem is given in Table 2.4.

The players in the game are the mobile users and/or the networks. Players seeking to maximize their payoffs can choose between different strategies, such as: available band-width, subscription plan, or available service providers. The payoffs can be estimated

(43)

Table 2.4: Summary of Symbols for Game-theoretic Formulation

Game Component Network Selection Environment Correspondent

Players The agents who are playing the game: users or/and

networks Strategies

A plan of actions to be taken by the player during the game: available/requested bandwidth, subscription plan, oﬀered prices, available service providers, etc.

Payoﬀs

The motivation of players represented by profit and estimated using utility functions based on various parameters: monetary cost, quality, network load, QoS, etc.

Resources The resources for which the players involved in the game are competing: bandwidth, power, etc. External State

The external state for the game that is not controlled by the players: channel availability, channel

coeﬃcients, etc.

using utility functions based on various decision criteria: monetary cost, energy conserva-tion, network load, availability, etc. The games can be formulated so that they can target diﬀerent objectives, such as maximizing or minimizing diﬀerent resources - bandwidth, power, etc.

Appendix 2.A

Assumptions for Stochastic

Approx-imation

In this appendix, we summarize the basic assumptions for stochastic approximation. Please refer to [6] for more details.

Consider the diﬀerence equation p(n + 1) = p(n) + λ(n)(f (x(n)) + M (n + 1) in R|A| and assume that

(A1.) f is Lipschitz.

(44)

(A3.) M (n + 1) is a martingale diﬀerence sequence with respect to the increasing family of sigma-fields F(n) = σ(x(n′), ˆu(n′), M (n′), n′ ≤ n), i.e., E[M(n + 1)|F(n)] = 0. (A4.) M (n) is square integratable and there is a constant c > 0 such that

E[M(n + 1)|F(n)] ≤ c(1 + ∥x(n)∥2₎ _(2.32)

almost surely, for all t≥ 0. (A5.) sup_n∥p(n)∥ < ∞ almost surely.

Then, the asymptotic pseudo-trajectory of the diﬀerence equation is given by the ordinary diﬀerential equation (ODE), ˙p(n) = f (p(n)), with p(0) fixed.

(45)

Chapter 3 A Survey on the Spectrum Access of

Cognitive Radio Networks

Before showing the application examples, we provide a survey on cognitive radio net-works in this chapter. An overview on diﬀerent access scenarios in cognitive radio netnet-works is given first. Then the examples are given in brief, with the pros and cons.

3.1 Cognitive Spectrum Access

I

n cognitive radio networks (CRNs), the cognitive radio (CR) users obtain the spectrum access rights in diﬀerent ways. By extending the work of Akyildiz et al. [15] with investigations afterwards, we categorized the spectrum access scenario of CRNs into four diﬀerent types. The four spectrum access scenarios of CRNs are depicted in Fig. 3.1 and also briefly introduced as follows.

Opportunistic spectrum access. Nodes in CRNs communicate with each other in an ad-hoc manner on both licensed and unlicensed spectrum bands. Each connection opportunistically access the spectrum with consideration on the co-tier and cross-tier interference.

(46)

Primary BS Cognitive BS PU SU PU SU SU SU Opportunistic Spectrum Access Primary BS Spectrum Broker Cognitive BS SU Other CRNs SU SU SU SU SU Primary network CR Primary Access Spectrum Trading with Single Seller Spectrum Trading

with Multiple Sellers

Figure 3.1: Cognitive radio network architecture.

Spectrum trading with single seller. CR users communicate via their own CR base stations (CRBSs), on both licensed and unlicensed spectrum bands. The CRBS determ-ines the amount of resources (i.e., bandwidth) to request from the spectrum seller. When the spectrum is granted by the seller, since all interactions occur inside the CR cluster, the spectrum sharing policy can be independent of that of the primary network.

Spectrum trading with multiple sellers. The CRBS may choose to request the spectrum from diﬀerent sellers. Therefore, it has to determine the best seller as well as the amount of requested spectrum. Then the CR users access the spectrum through their own CRBSs.

CR primary access. CR users can access the primary base station through the licensed band. CR users require a re-configurable medium access control (MAC) protocol, which enables roaming over multiple primary networks with diﬀerent access technologies.

(47)

3.2 Opportunistic Spectrum Access

Opportunistic spectrum access (OSA) is a promising technique for tackling the spec-trum scarcity problem by exploiting the temporally unutilized specspec-trum bands [16, 17]. In the CR ad hoc access model, the CR links access the spectrum based on sensing and contention. We provide four examples.

Learning-Based OSA with adaptive hopping. Derakhshani and Le-Ngoc [18] presen-ted an adaptive hopping transmission strategy for secondary users (SUs) to access tem-porarily idle frequency-slots of a licensed frequency band in consideration of the random return of primary users (PUs), aiming to maximize the overall SU throughput.

Learning the hidden Markov model. The work of Choi et al. [19] is based on learning consider the hidden Markov model (HMM) and partially observable Markov decision process (POMDP).

Learning under unknown environments. Xu et al. [10] considered opportunistic spectrum access in which the CR links contend for the spectrum. The strategy is the channel selection. A CR can access a channel if it wins the contention and the PU is not using this channel. Under the unknown dynamic environment, stochastic learning algorithm is applied to learn the equilibrium of the expected game.

Adaptive channel recommendation. Chen et al. [20] proposed a dynamic spectrum access scheme where secondary users cooperatively recommend “good" channels to each other and access accordingly. The spectrum access problem was formulated as an average reward-based Markov decision process (MDP).

OSA for mobile CR. While most existing work focuses on enabling OSA for stationary CRs, Min et al. [21] considered mobility of secondary users (SUs). In this work, the chan-nel availability experienced by a mobile SU was modeled as a two-state continuous-time Markov chain (CTMC). To protect PU communications from SU interference, the authors introduce guard distance in the space domain and derive the optimal guard distance that

自組式網路中的無線資源管理：分散式學習與穩當策略

國

立

交

通

大

學

電子工程學系 電子研究所

博 士 論 文

自組式網路中的無線資源管理：分散式學習與穩當策略

Radio Resource Management in Self-organized Networks: Distributed

Learning and Robust Strategies

研 究 生：曾理銓

自組式網路中的無線資源管理：分散式學習與穩當策略

Radio Resource Management in Self-organized Networks: Distributed

Learning and Robust Strategies

研 究 生：曾理銓 Student：Li-Chuan Tseng

指導教授：黃經堯 Advisor：Ching-Yao Huang

國 立 交 通 大 學

電子工程學系 電子研究所

博 士 論 文

自 組 式 網 路 中 的 無 線 資 源 管 理 ： 分 散 式 學 習 與 穩 當 策 略

學生：曾理銓

指導教授：黃經堯

國立交通大學 電子工程學系 電子研究所 博士班

摘

要

由於其在頻譜使用上的彈性，自組式網路被視為滿足不斷增加的行動通訊流

量需求的一個重要方案。在自組式網路中，共享頻譜的節點是分散的，必須由個

別的節點進行無線資源管理。此外，各個節點的無線資源管理決定會影響彼此的

效能，因此我們需要能考慮節點的相互作用的分散式無線資源管理方法。為達此

目的，本論文將包括博弈論，信息論，隨機學習在內的多元數學工具，用於無線

資源管理問題的建模與解決方案。雖然自組式網路中的當紅議題，如異構網絡和

無線感知網路等，已有了深入的研究，我們的工作的新穎性在於基於分散式學習

演算法，各節點在資訊有限的條件下，仍具有自組與調整的能力。

本論文首先介紹相關的數學工具，包括賽局理論的基礎知識與隨機學習演算

法的簡介。接著是關於無線感知網路的一份文獻探討。隨後，我們提供了四個應

用實例。在每個例子中，我們針對一個在分散式網路中可能會遇到的無線資源管

理問題，建構賽局理論模型。網絡中的節點被視為具備自主學習能力的自動機，

並能藉由個別行為-回報歷史，習得適當的資源管理策略。我們亦透過數值模擬，

評估學習過程的收斂性及其性能。

關鍵詞：自組式網路、無線資源管理、賽局理論、隨機學習演算法

Radio Resource Management in Self-organized Networks:

Distributed Learning and Robust Strategies

Student：Li-Chuan Tseng

Advisor：Dr. Ching-Yao Huang

Department of Electronics Engineering

& Institute of Electronics

National Chiao Tung University

ABSTRACT

誌

謝

隨著這本論文的完成，漫長的博士班生涯也即將畫下句點。首先感謝我的指

導教授黃經堯博士，一路上的指導與支持，使我能順利完成論文。加上大學部專

題與碩士班時期，近十年的提攜之情，畢生難忘。

博士班四年級時，我暫別交大，在一個大雪紛飛的清晨抵達巴黎並赴法國

Télécom Sudparis 研修。留法期間，承蒙 D. Zhaglache 與 A. Marzouki 兩位教授的

協助與指導，讓我一年半的留學生涯過得十分充實。另外要特別感謝法國 Supélec

的 H. Tembine 教授，對研究方向的提點與數學理論的詳細解說，使我的研究有

重要的突破。

由於一開始低估了讀博士的難度，我的研究過程難免挫折。所幸除了台法雙

方的指導教授的支持，在各階段都有貴人相助。特別感謝電子所簡鳳村教授與中

研院張佑榕博士、鍾偉和博士，對於我的關鍵論文的貢獻。三位在自身研究與教

學工作繁忙之餘，仍仔細修改我的拙作並提供建議，這個過程著實獲益良多。當

然也要謝謝與同窗好友冠穎、Robert、Maria、金鑫等，在研究工作上的互相切磋

及在生活上的照顧。

最後，感謝我的家人，在我的求學階段關懷鼓勵與經濟支援，使我可以無後

顧之憂，得以順利完成學業。

曾理銓 謹誌

中華民國一○二年十二月

Contents

I

The Backgrounds

9

II

Examples of Fully Distributed Learning

39

III

Examples of Distributed Learning with Partial

Coopera-tion

電子工程學系電子研究所

博士論文

研究生：曾理銓

研究生：曾理銓 Student：Li-Chuan Tseng

國立交通大學

電子工程學系電子研究所

博士論文

自組式網路中的無線資源管理：分散式學習與穩當策略

國立交通大學電子工程學系電子研究所博士班

曾理銓謹誌