Cognitive Medium Access: Exploration, Exploitation and Competition

(1)

arXiv:0710.1385v1 [cs.IT] 6 Oct 2007

Cognitive Medium Access: Exploration, Exploitation and Competition

Lifeng Lai, Hesham El Gamal, Hai Jiang and H. Vincent Poor

Abstract— This paper establishes the equivalence between cognitive medium access and the competitive multi-armed bandit problem. First, the scenario in which a single cognitive user wishes to opportunistically exploit the availability of empty fre- quency bands in the spectrum with multiple bands is considered.

In this scenario, the availability probability of each channel is unknown to the cognitive user a priori. Hence efficient medium access strategies must strike a balance between exploring the availability of other free channels and exploiting the opportunities identified thus far. By adopting a Bayesian approach for this classical bandit problem, the optimal medium access strategy is derived and its underlying recursive structure is illustrated via examples. To avoid the prohibitive computational complexity of the optimal strategy, a low complexity asymptotically optimal strategy is developed. The proposed strategy does not require any prior statistical knowledge about the traffic pattern on the different channels. Next, the multi-cognitive user scenario is considered and low complexity medium access protocols, which strike the optimal balance between exploration and exploitation in such competitive environments, are developed. Finally, this formalism is extended to the case in which each cognitive user is capable of sensing and using multiple channels simultaneously.

I. INTRODUCTION

Recently, the opportunistic spectrum access problem has been the focus of significant research activity [1]–[3]. The underlying idea is to allow unlicensed users (i.e., cognitive users) to access the available spectrum when the licensed users (i.e., primary users) are not active. The presence of high priority primary users and the requirement that the cognitive users should not interfere with them define a new medium access paradigm which we refer to as cognitive medium access.

The overarching goal of our work is to develop a unified framework for the design of efficient, and low complexity, cognitive medium access protocols.

The spectral opportunities available to the cognitive users are expected to be time-varying on different time-scales. For example, on a small scale, multimedia data traffic of the primary users will tend to be bursty [4]. On a large scale, one would expect the activities of each user to vary throughout the day. Therefore, to avoid interfering with the primary network, the cognitive users must first probe to determine whether there are primary activities in each channel before transmission.

Under the assumption that each cognitive user cannot access

L. Lai and H. V. Poor ({llai,poor}@princeton.edu) are with the De- partment of Electrical Engineering at Princeton University. H. El Gamal ([email protected]) is with the Department of Electrical and Computer Engineering at the Ohio State University and is currently visiting Nile University, Cairo, Egypt. H. Jiang ([email protected]) is with the Department of Electrical and Computer Engineering at the University of Alberta. This research was supported by the National Science Foundation under Grants ANI-03-38807 and CNS-06-25637.

all of the available channels simultaneously, the main task of the medium access protocol is to distributively choose which channels each cognitive user should attempt to use in different time slots, in order to fully (or maximally) utilize the spectral opportunities. This decision process can be enhanced by taking into account any available statistical information about the primary traffic. For example, with a single cognitive user capable of accessing (sensing) only one channel at a time, the problem becomes trivial if the probability that each channel is free is known a priori. In this case, the optimal rule is for the cognitive user to access the channel with the highest probability of being free in all time slots. However, such time- varying traffic information is typically not available to the cognitive users a priori. The need to learn this information on-line creates a fundamental tradeoff between exploitation and exploration. Exploitation refers to the short-term gain resulting from accessing the channel with the estimated highest probability of being free (based on the results of previous sensing decisions) whereas exploration is the process by which the cognitive user learns the statistical behavior of the primary traffic (by choosing possibly different channels to probe across time slots). In the presence of multiple cognitive users, the medium access algorithm must also account for the competition between different users over the same channel.

In this paper, we develop a unified framework for the design and analysis of cognitive medium access protocols. As argued in the sequel, this framework allows for the construction of strategies that strike an optimal balance among exploration, exploitation and competition. The key observation motivat- ing our approach is the equivalence between our problem and the classical multi-armed bandit problem (see [5] and references therein). This equivalence allows for building a solid foundation for cognitive medium access using tools from reinforcement machine learning [6]. The connection between cognitive medium access and the multi-armed bandit problem has been independently and concurrently observed in [7]. That work, however, is limited to special cases of the general approach presented here. In particular, in [7], the channels are assumed to be independent and the goal is to maximize the discounted sum of throughput, which is the problem addressed in Example 4 in Section III below. A related work also appears in [8], in which the availability of each channel is assumed to follow a Markov chain, whose transition matrix is known to the cognitive user. The only uncertainty faced by the cognitive user in that work is the particular realization of the channel, while in our work the cognitive users also need to learn the statistics of the channel in real time.

We consider three scenarios in this paper. In the first

(2)

scenario, we assume the existence of a single cognitive user capable of accessing only a single channel at any given time. In this setting, we derive an optimal sensing rule that maximizes the expected throughput obtained by the cognitive user. Compared with a genie-aided scheme, in which the cognitive user knows a priori the primary network traffic information, there is a throughput loss suffered by any medium access strategy. We obtain a lower bound on this loss and further construct a linear complexity single index protocol that achieves this lower bound asymptotically (when the primary traffic behavior changes very slowly). In the second scenario, we design distributed sensing rules that account for the competitive dimension of the problem in which the cognitive users must also take the competition from other cognitive users into consideration when making sensing decisions. We first characterize the optimal distributed sensing rule for the case in which the traffic information of the primary network is available to the cognitive users. Under this idealistic assumption, we show that the throughput loss of the proposed distributed sensing rule, compared with a throughput optimal centralized scheme, goes to zero exponentially as the number of cognitive users increases. To prevent any possible misbehavior by the cognitive users, we further design a game theoretically fair sensing rule, whose loss compared with the throughput optimal centralized rule also goes to zero exponentially. Building on these results, we then devise distributed sensing rules that do not require prior knowledge about the traffic and converge to the optimal distributed rule and game theoretically fair rule, respectively. In the third scenario, we extend our work to the case in which the cognitive user is capable of accessing more than one channel simultaneously.

The rest of the paper is organized as follows. Our network model is detailed in Section II. Section III analyzes the scenario in which a single cognitive user capable of sensing one channel at a time is present. The extension to the multi-user case is reported in Section IV whereas the multi- channel extension is studied in Section V. Finally, Section VI summarizes our conclusions.

II. NETWORKMODEL

Throughout this paper, upper-case letters (e.g., X) denote random variables, lower-case letters (e.g., x) denote realizations of the corresponding random variables, and calligraphic letters (e.g, X ) denote finite alphabet sets over which corresponding variables range. Also, upper-case boldface letters (e.g., X) denote random vectors and lower-case boldface letters (e.g., x) denote realizations of the corresponding random vectors.

Figure 1 shows the channel model of interest. We consider a primary network consisting ofN channels, N = {1, · · · , N }, each with bandwidthB. The users in the primary network are operated in a synchronous time-slotted fashion. We use i to refer to the channel index, j to refer to the time-slot index andk referring to the index of the cognitive users. We assume that at each time slot, channeli is free with probability θi. Let Zi(j) be a random variable that equals 1 if channel i is free at time slot j and equals 0 otherwise. Hence, given θi,Zi(j) is

a Bernoulli random variable with probability density function (pdf)

hθi(zi(j)) = θiδ(1) + (1 − θi)δ(0),

whereδ(·) is the delta function. Furthermore, for a given θ = (θ1, · · · , θN), Zi(j) are independent for each i and j. We consider a block varying model in which the value of θ is fixed for a block ofT time slots and randomly changes at the beginning of the next block according to some joint pdff (θ).

Our results can also be extended to the scenarios in which Zi(j)s follow a Markov chain model.

Channel 1 Channel 2

Channel N

t=1 t=T

Occupied by the primary users Spectrum opportunities

Fig. 1. Channel model.

In our model, the cognitive users attempt to exploit the availability of free channels in the primary network by sensing the activity at the beginning of each time slot. Our work seeks to characterize efficient strategies for choosing which channels to sense (access). The challenge here stems from the fact that the cognitive users are assumed to be unaware of θ a priori.

We consider two cases in which the cognitive user either has or does not have prior information about the pdf of θ, i.e.,f (θ).

To further illustrate the point, let us consider our first scenario in which a single cognitive user capable of sensing only one channel is present. At time slot j, the cognitive user selects one channelS(j) ∈ N to access. If the sensing result shows that channelS(j) is free, i.e., ZS(j)(j) = 1, the cognitive user can sendB bits over this channel; otherwise, the cognitive user will wait until the next time slot and pick a possibly different channel to access (throughout the paper, it is assumed that the outcome of the sensing algorithm is error free). Therefore, the total number of bits that the cognitive user is able to send over one block (ofT time slots) is

W =

T

X

j=1

BZS(j)(j).

It is now clear that W is a random variable that depends on the traffic in the primary network and, more importantly for us, on the medium access protocols employed by the cognitive user. Therefore, the overarching goal of Section III is to construct low complexity medium access protocols that maximize

E{W } = E







T

X

j=1

BZS(j)(j)







. (1)

Intuitively, the cognitive user would like to select that channel with the highest probability of being free in order to obtain more transmission opportunities. If θ is known then

(3)

this problem is trivial: the cognitive user should choose the channel i^∗ = arg max

i∈N θi to sense. The uncertainty in θ imposes a fundamental tradeoff between exploration, in order to learn θ, and exploitation, by accessing the channel with the highest estimated free probability based on current available information, as detailed in the following sections.

III. SINGLEUSER–SINGLECHANNEL

We start by developing the optimal solution to the single user–single channel scenario under the idealized assumption thatf (θ) is known a priori by the cognitive user. As argued next, the optimal medium access algorithm suffers from a prohibitive computational complexity that grows exponentially with the block lengthT . This motivates the design of low complexity asymptotically optimal approaches that are considered next. Interestingly, the proposed low complexity technique does not require prior knowledge aboutf (θ).

A. Bayesian Approach

Our single user–single channel cognitive medium access problem belongs to the class of bandit problems. In this setting, the decision maker must sequentially choose one process to observe from N ≥ 2 stochastic processes. These processes usually have parameters that are unknown to the decision maker and, associated with each observation is a utility function. The objective of the decision maker is to maximize the sum or discounted sum of the utilities via a strategy that specifies which process to observe for every possible history of selections and observations. The following classical example illustrates the challenge facing our decision maker: A gambler enters a casino havingN slot machines, the i^thof which has winning probability θi, i ∈ N . The gambler does not know the values of the θis and must sequentially chooses machines to play. The goal is to maximize the overall gain for a total of T plays. In this example, the stochastic processes are the outcomes of the slot machines, the utility function is the reward that the gambler gains each time and the gambling strategy specifies which machine to play based on each possible past information pattern. A comprehensive treatment covering different variants of bandit problems can be found in [5].

We are now ready to rigorously formulate our problem. The cognitive user employs a medium access strategy Γ, which will select channel S(j) ∈ N to sense at time slot j for any possible causal information pattern obtained through the previousj − 1 observations:

Ψ(j) = {s(1), zs(1)(1), · · · , s(j − 1), zs(j−1)(j − 1)}, j ≥ 2, i.e. s(j) = Γ(f, Ψ(j)). Notice that zs(j)(j) is the sensing outcome of the jth time slot, in which s(j) is the channel being accessed. Ifj = 1, there is no accumulated information, thusΨ(1) = φ and s(1) = Γ(f ). Γ could be stochastic, i.e., for certainΨ(j), the cognitive user may randomly pick channel i from a set A ⊆ N with probability pi, such that P

i∈A

pi= 1.

The utility that the cognitive user obtains by making decision S(j) at time slot j is the number of bits it can transmit at time

slotj, which is BZS(j)(j). We denote the expected value of the payoff obtained by a cognitive user who uses strategyΓ as

WΓ = Ef







T

X

j=1

BZS(j)(j)







. (2)

We denote V^∗(f, T ) = sup

Γ

WΓ, which is the largest throughput that the cognitive user could obtain when the spectral opportunities are governed by f (θ) and the exact value of each realization of θ is not known by the user.

Each medium access decision made by the cognitive user has two effects. The first one is the short term gain, i.e., an immediate transmission opportunity if the chosen channel is found free. The second one is the long term gain, i.e., the updated statistical information about f (θ). This information will help the cognitive user in making better decisions in the future stages. There is an interesting tradeoff between the short and long term gains. If we only want to maximize the short term gain, we can pick the one with the highest free probability to sense, based on the current information. This myopic strategy maximally exploits the existing information.

On the other hand, by picking other channels to sense, we gain valuable statistical information about f (θ) which can effectively guide future decisions. This process is typically referred to as exploration.

More specifically, letf^j(θ) be the updated pdf after making j − 1 observations. We begin with f¹(θ) = f (θ). After observing zs(j)(j), we update the pdf using the following Bayesian formula.

1) Ifzs(j)(j) = 1

f^j+1(θ) = θs(j)f^j(θ)

R θ_s(j)f^j(θ)dθ, (3) 2) Ifzs(j)(j) = 0

f^j+1(θ) = 1 − θs(j) f^j(θ)

R 1 − θs(j) f^j(θ)dθ. (4) Now, lemma 2.3.1 of [5] proves that every bandit problem with finite horizon has an optimal solution. Applying this result to our set-up, we obtain the following.

Lemma 1: For any prior pdf f , there exists an optimal strategyΓ^∗to the channel selection problem (2), andV^∗(f, T ) is achievable. Moreover,V^∗satisfies the following condition:

V^∗(f, T ) = max

s(1)∈NEfBZs(1)+ V^∗ fZ_s(1), T − 1 , (5) wherefZs(1) is the conditional pdf updated using (3) and (4) as if the cognitive user choosess(1) and observes Zs(1). Also, V^∗ fZs(1), T − 1 is the value of a bandit problem with prior informationfZs(1) andT − 1 sequential observations. 2 In principle, Lemma 1 provides the solution to problem (2).

Effectively, it decouples the calculation at each stage, and hence, allows the use of dynamic programming to solve the problem. The idea is to solve the channel selection problem with a smaller dimension first and then use backward deduc- tion to obtain the optimal solution for a problem with a larger dimension. Starting with T = 1, the second term inside the

(4)

expectation in (5) is 0, since T − 1 = 0. Hence, the optimal solution is to choose channel i with the largest Ef{BZi}, which can be calculated as

Ef{BZi} = B Z

θif (θ)dθ.

AndV^∗(f, 1) = max

i∈N Ef{BZi}.

With the solution forT = 1 at hand, we can now solve the T = 2 case using (5). At first, for every possible choice of s(1) and possible observation zs(1), we calculate the updated pdf fzs(1) using (3) and (4). Next, we calculate V^∗(fzs(1), 1) (which is equivalent to the T = 1 problem described above).

Finally, applying (5), we have the following equation for the channel selection problem withT = 2

V^∗(f, 2) = max

i∈N

Z

[Bθi+ θiV^∗(fzi=1, 1)

+(1 − θi)V^∗(fzi=0, 1)] f (θ)dθ.

Correspondingly, the optimal solution is Γ^∗(f ) = arg max

i∈N V^∗(f, 2), i.e., in the first step, the cognitive user should choose i^∗(1) = arg max

i∈N V^∗(f, 2) to sense. After observing zi^∗(1), the cognitive user has Ψ(1) = {zi^∗(1)}, and it should choosei^∗(2) = arg max

i∈N V^∗(fz_i∗(1), 1) implying that Γ^∗(f, Ψ(1)) = arg max

i∈N V^∗(fzi∗(1), 1).

Similarly, after solving theT = 2 problem, one can proceed to solve the T = 3 case. Using this procedure recursively, we can solve the problem with T − 1 observations. Finally, our original problem with T observations is solved as follows.

V^∗(f, T ) = max

i∈N

Z

[Bθi+ θiV^∗(fz_i=1, T − 1) +(1 − θi)V^∗(fzi=0, T − 1)] f (θ)dθ.

Example 1: Suppose we have two channels and two obser- vations per block, i.e., N = {1, 2} and T = 2. The channels are known to be either both very busy or both relatively idle which is reflected in the following joint pdf

f (θ1, θ2) =4

5δ(0.1, 0) +1

5δ(0.8, 1),

where δ(x, y) is the delta function at point (x, y). For sim- plicity of presentation, we assume that B = 100.

In this example, on the average, channel1 is available with probability4/5 × 0.1 + 1/5 × 0.8 = 0.24, whereas channel 2 is available with probability4/5×0+1/5×1 = 0.2. Hence, if the cognitive user ignores the information gained from sensing, it should always choose channel 1 to sense, resulting in an average throughput of2×0.24×100 = 48 bits per block. Now, we use the procedure described above to derive the optimal rule and corresponding throughput.

1) First calculate all possible updated pdf after one step.

If s(1) = 1, zs(1)= 1, we have P (θ1= 0.1, θ2= 0|zs(1)= 1)

=P (z1= 1|θ1= 0.1, θ2= 0)P (θ1= 0.1, θ2= 0) P (z1= 1)

= 0.1 × 0.8

0.8 × 0.1 + 0.2 × 0.8 = 1 3.

Hence, for this case, we have f_{s(1)=1,z_s(1)_=1}= 1

3δ(0.1, 0) +2

3δ(0.8, 1).

Similarly, we obtain the following updated pdf f_{s(1)=1,z_s(1)_=0} = 18

19δ(0.1, 0) + 1

19δ(0.8, 1), f_{s(1)=2,z_s(1)_=1} = δ(0.8, 1),

f{s(1)=2,zs(1)=0} = δ(0.1, 0).

2) With the updated distribution information, we solve four channel-selection problems with T = 1. For example, with f_{s(1)=1,z_s(1)_=1} = ¹₃δ(0.1, 0) + ²₃δ(0.8, 1), if the cognitive user choose channel 1, the expected payoff would be

100 × 1

3× 0.1 +2 3 × 0.8

= 170 3 .

If the cognitive user choose channel 2, the expected payoff would be

100 × 1

3× 0 +2 3 × 1

=200 3 . Thus

V^∗(f{s(1)=1,zs(1)=1}, 1) = max{170/3, 200/3} = 200/3, and the user should choose channel2.

Similarly, we have

V^∗(f_{s(1)=1,z_s(1)_=0}, 1) = 100×max{26/190, 1/19} = 260/19, and the user should choose channel1.

V^∗(f{s(1)=2,zs(1)=1}, 1) = max{80, 100} = 100, and the user should choose channel2.

V^∗(f_{s(1)=2,z_s(1)_=0}, 1) = max{10, 0} = 10, and the user should choose channel1.

3) Finally, we solve the problem with pdf f and T = 2.

If the cognitive user chooses channel 1 in the first step, we calculate

Ef{BZ1+ V^∗(fZ1, 1)}

= P (θ1= 0.1)h

100 × 0.1 + 0.1 × V^∗(f_{s(1)=1,z_s(1)_=1}, 1) +(1 − 0.1) × V^∗(f{s(1)=1,zs(1)=0}, 1)i

+P (θ1= 0.8)h

100 × 0.8 + 0.8 × V^∗(f_{s(1)=1,z_s(1)_=1}, 1) +(1 − 0.8) × V^∗(f_{s(1)=1,z_s(1)_=0}, 1)i

= 252/5.

Similarly, if the cognitive user chooses channel2 in the first

(5)

step, we calculate

Ef{BZ2+ V^∗(fZ2, 1)}

= P (θ2= 0)h

100 × 0 + 0 × V^∗(f{s(1)=2,zs(1)=1}, 1) +V^∗(f_{s(1)=2,z_s(1)_=0}, 1)i

+P (θ2= 1)h

100 + V^∗(f_{s(1)=2,z_s(1)_=1}, 1) +(1 − 1)V^∗(f{s(1)=0,zs(1)=0}, 1)i

= P (θ2= 0)V^∗(f_{s(1)=2,z_s(1)_=0}, 1)

+P (θ2= 1)[100 + V^∗(f_{s(1)=2,z_s(1)_=1}, 1)] = 240 5 . Thus

V^∗(f, 2) = max

s(1)∈NEfBZs(1)+ V^∗ fZs(1), 1

= max{252/5, 240/5} = 252/5.

Hence, the optimal strategy isΓ^∗(f ) = 1, Γ^∗(f, z1= 1) = 2, Γ^∗(f, z1 = 0) = 1. In other words, the cognitive user should sense channel 1 in the first time slot. Interestingly, if channel 1 is found free, the user should switch to channel 2 in the second time slot. On the other hand, if channel 1 is found busy, the cognitive user should keep sensing channel1 at the second time slot. Finally, we observe that the optimal strategy offers a gain of 12/5 bits, on average, as compared with the

myopic strategy. 2

The optimal solution presented above can be simplified when f (θ) has a certain structure, as illustrated by the following examples.

Example 2: (Symmetric Channels) We have N = 2 channels. Without loss of generality, let 0 ≤ θb < θa ≤ 1. At any block, either 1) channel1 has probability θa of being free and channel2 has probability θbof being free or 2) channel1 has probabilityθb of being free and channel2 has probability θa

of being free. The cognitive user does not know exactly which case happens. The prior pdf information is thus given by

f (θ1, θ2) = ξδ(θa, θb) + (1 − ξ)δ(θb, θa),

where ξ is a parameter. The optimal strategy under this scenario is the following.

1) At the first time slot, choose channel 1, if ξ > 1/2.

If ξ = 1/2, randomly choose channel 1 or channel 2.

Otherwise choose channel2.

2) At time slots j ≥ 2, update the pdf based on Ψ(j) = {s1, zs1, · · · , sj−1, zsj−1} using (3) and (4). It is easy to see thatf^j has the following form

f^j(θ1, θ2) = ξjδ(θa, θb) + (1 − ξj)δ(θb, θa).

Then, choose channel 1 if ξj > 1/2, randomly choose channel 1 or 2 if ξj = 1/2 and choose channel 2 otherwise.

The optimality of this myopic strategy was proved in [9].

The previous myopic strategy is also optimal for some other special scenarios. For example, if the prior pdf is f (θ) = ξδ(a, b) + (1 − ξ)δ(c, d), then any of the following conditions ensures the optimality of the myopic strategy [10]: 1)a + b = c + d = 1, 2) a ≤ b and c ≤ d, 3) a ≥ b and c ≥ d. 2

Example 3: (One Known Channel) We have N = 2 channels with independent traffic distributions. Channel 1 and channel 2 are independent. Moreover,θ2is known. The traffic pattern of channel1 is unknown, and the probability density function ofθ1 is given byf1(θ1).

Since channel 2 is known and is independent of channel 1, sensing channel 2 will not provide the cognitive user with any new information. Hence, once the cognitive user starts accessing channel 2 (meaning that at a certain stage, sensing channel2 is optimal), there would be no reason to return to channel1 in the optimal strategy. A generalized version of this assertion was first proved in Lemma 4.1 of [11]. Restated in our channel selection setup, we have the following lemma.

Lemma 2: In the optimal medium access strategy, once the cognitive user starts accessing channel2, it should keep picking the same channel in the remaining time slots, regardless of the outcome of the sensing process. 2 This lemma essentially converts the channel selection problem to an optimal stopping problem [12], [13], where we only need to focus on the strategies that decide at which time-slot we should stop sensing channel1, if it is ever accessed. The following lemma derives the optimal stopping rule.

Lemma 3: For any f1(θ1) and any T , if θ2 ≥ Λ(f1, T ), then we should sense channel 2. Here

Λ(f1, T ) = max

Γ(f1)=1

Ef1

nPM

j=1Z1(j)o

Ef1{M } , (6)

whereΓ are the set of strategies that start with channel 1 and never switch back to channel1 after selecting channel 2; and M is a random number that represents the last time slot in which channel1 is sensed, when the cognitive user follows a strategy inΓ.

Proof: This result follows as a direct application of Theorem 5.3.1 and Corollary 5.3.2 of [5].

One can now combine Lemma 2 and Lemma 3 to obtain the following optimal strategy.

1) At any time slotj, if channel 2 was sensed at time slot j − 1, keep sensing channel 2.

2) If channel 1 was sensed at time slot j − 1, update the pdff^j using (3) and (4) and computeΛ(f₁^j, T − j + 1) using (6). If Λ(f₁^j, T − j + 1) < θ2, switch to channel 2; otherwise, keep sensing channel 1. 2 Example 4: (Independent Channels)

We have N independent channels with f (θ) =

N

Q

i=1

fi(θi).

This case has a simple form of solution in the asymptotic scenarioT → ∞ assuming the following discounted form for the utility function

W = Ef







∞

X

j=1

α^jBZS(j)(j)





 ,

where 0 < α < 1 is a discount factor. As discussed in the introduction, this scenario has been considered in [7], and the optimal strategy for this scenario is the following.

1) If channell was selected at time slot j − 1, then we get the updated pdf f_l^j using equations (3) and (4), based

(6)

on the sensing resultzl(j − 1). For other channels, we let f_i^j = f_i^j−1, ∀i 6= l, i ∈ N . That is we only update the pdf of the channel which was just accessed (due to the independence assumption).

2) For each channel, we calculate an index using the following equation

Λi(f_i^j) = max

Γ(f_i^j)=i

E_f^j

i

nPM

j=1α^jZ1(j)o E_f^j

i{PM

j=1α^j} ,

where Γ is the set of strategies for the equivalent One-Known-Channel selection problem (with channeli having the unknown parameter) and M is a random number corresponding to the last time slot in which channeli will be selected in the equivalent One-Known- Channel case. Λi is typically referred to as the Gittins Index [14].

3) Choose the channel with the largest Gittins index to sense at time slotj.

The optimality of this strategy is a direct application of the elegant result of Gittins and Jones [14]. Computational methods for evaluating the Gittins Index Λ could be found in [15] and references therein.

B. Non-parametric Asymptotic Analysis and Asymptotically Optimal Strategies

The optimal solution developed in Section III-A suffers from a prohibitive computational complexity. In particular, the dimensionality of our search dimension grows exponentially with the block length T . Moreover, one can envision many practical scenarios in which it would be difficult for the cognitive user to obtain the prior informationf (θ). This motivates our pursuit of low complexity non-parametric protocols which maintain certain optimality properties. Towards this end, we study in the following the asymptotic performance of several low complexity approaches. In this section, we analyze non-parametric schemes that do not explicitly use f (θ), thus the rules Γ considered in this section depend only onΨ(j) explicitly. We aim to develop schemes that have low complexity but still maintain certain optimality. Towards this end, we study the asymptotic performance of schemes as the block lengthT increases. This section will be concluded with our asymptotically optimal non-parametric protocols which require only linear computational complexity.

For a certain strategy Γ, the expected number of bits the cognitive user is able to transmit through a block with certain parameters θ is

E







T

X

j=1

BZS(j)(j)







=

T

X

j=1

B

N

X

i=1

θiPr{Γ(Ψ(j)) = i} .

Recall thatΓ(Ψ(j)) = i means that, following strategy Γ, the cognitive user should choose channel i at time slot j, based on the available informationΨ(j). Here Pr {Γ(Ψ(j)) = i} is the probability that the cognitive user will choose channeli at time slot j, following the strategy Γ.

Compared with the idealistic case where the exact value of θ is known, in which the optimal strategy for the cognitive user is

to always choose the channel with the largest free probability, the loss entailed by Γ is given by

L(θ; Γ) =

T

X

j=1

Bθi^∗ −

T

X

j=1

B

N

X

i=1

θiPr{Γ(Ψ(j)) = i} ,

where θi^∗ = max{θ1, · · · , θN}. We say that a strategy Γ is consistent, if for any θ ∈ [0, 1]^N, there exists β < 1 such that L(θ; Γ) scales as¹ O(T^β). For example, consider a royal scheme in which the cognitive user selects channel i at the beginning of a block and sticks to it. If θi is the largest one among θ, L(θ; Γ) = 0. On the other hand, if θi

is not the largest one, L(θ; Γ) ∼ O(T ). Hence, this royal scheme is not consistent. The following lemma characterizes the fundamental limits of any consistent scheme.

Lemma 4: For any θ and any consistent strategyΓ, we have lim inf

T →∞

L(θ; Γ)

ln T ≥ B X

i∈N \{i^∗}

θi^∗− θi

D(θi||θ^∗_i), (7) where D(θi||θl) is the Kullback-Leibler divergence between the two Bernoulli random variables with parametersθi andθl

respectively:

D(θi||θl) = θiln θi

θl

+ (1 − θi) ln 1 − θi

1 − θl

. Proof: The proof is an application of a theorem proved in [16]. More specifically, for a general bandit problem, let x be the random payoff obtained by choosing bandit i (not necessarily Bernoulli), and we also lethθi(x) be the pdf of x for a givenθi.

Letµi denote the average payoff of banditi, i.e.

µi= Z

xhθi(x)dx,

and note that the Kullback-Leibler divergence between bandit i and l is given by

D(θi||θl) = Z h

ln hθi(x) − ln hθ_l(x)i

hθi(x)dx.

Let i^∗ = arg max

i∈N µi, i.e., the index of the channel with the largest average payoff. Under mild regularity conditions on hθi(x), it has been proved in Theorem 1 of [16] that for any consistent strategyΓ

lim inf

T →∞

L(θ; Γ)

ln T ≥ X

i∈N \{i^∗}

µi^∗− µi

D(θi||θ_i^∗). (8) In our cognitive radio channel selection problem, given θ, x is a random variable with

hθi(x) = θiδ(B) + (1 − θi)δ(0);

henceµi= Bθi, and D(θi||θl) = θiln θi

θl

+ (1 − θi) ln 1 − θi

1 − θl

.

1In this paper, we use Knuth’s asymptotic notations 1)g1(N ) = o(g2(N )) means∀c > 0, ∃N0,∀N > N0, g1(N ) < cg2(N ), 2) g1(N ) = ω(g2(N )) means∀c > 0, ∃N0,∀N > N0, g2(N ) < cg1(N ), 3) g1(n) = O(g2(N )) means∃c²≥ c¹>0, N⁰,∀N > N⁰, c1g2(N ) ≤ g¹(N ) ≤ c²g2(N ).

(7)

Substituting these parameters into (8), the proof is complete.

Lemma 4 shows that the loss of any consistent strategy scales at least as ω(ln T ). An intuitive explanation of this loss is that we need to spend at least O(ln T ) time slots on sampling each of the channels with smaller θi, in order to get a reasonably accurate estimate of θ, and hence, use it to determine the channel having the largest θi to sense. We say that a strategyΓ is order optimal if L(θ; Γ) ∼ O(ln T ).

Now, the first question that arises is whether there exists order optimal strategies. As shown later in this section, we can design suboptimal strategies that have loss of order O(ln T ).

Thus the answer to this question is affirmative. Before proceed- ing to the proposed low complexity order-optimal strategy, we first analyze the loss order of some heuristic strategies which may appear appealing in certain applications.

The first simple rule is the random strategy Γr where, at each time slot, the cognitive user randomly chooses a channel from the available N channels. The fraction of time slots the cognitive user spends on each channel is therefore1/N , leading to the loss

L(θ; Γr) = BP^N

i=1

(θi^∗− θi)

N T ∼ O(T ).

The second one is the myopic ruleΓgin which the cognitive user keeps updatingf^j(θ), and chooses the channel with the largest value of

θˆi= Z

θif^j(θ)dθ

at each stage. Since there are no converge guarantees for the myopic rule, that is ˆθ may never converge to θ due to the lack of sufficiently many samples for each channel [17], the loss of this myopic strategy isO(T ).

The third protocol we consider is staying with the winner and switching from the loser rule ΓSW where the cognitive user randomly chooses a channel in the first time slot. In the succeeding time-slots 1) if the accessed channel was found to be free, it will choose the same channel to sense; 2) otherwise, it will choose one of the remaining channels based on a certain switching rule.

Lemma 5: No matter what the switching rule is, L(θ; ΓSW) ∼ O(T ).

Proof: Let i^∗ = arg max

i∈N θi and i^∗∗ = arg max

i∈N \{i^∗}θi, i.e.,i^∗ is the best channel, andi^∗∗is the second best channel.

To avoid trivial conditions, without loss of generality we assume that θi^∗ 6= θi^∗∗ and θi^∗ 6= 1. We can upper bound the performance of the staying with the winner and switching from the loser rule by assuming that the cognitive user has the following extra knowledge.

1) In the first time slot, the cognitive user is able to choose i^∗ correctly.

2) Once i^∗ is sensed busy, the cognitive user somehow knows which channel is the second best, and switches toi^∗∗.

3) Once i^∗∗ is sensed busy, the cognitive user is always able to switch back toi^∗.

We denote this optimistic rule by Γ^∗_SW. With any realistic switching ruleΓSW, we have

L(θ; ΓSW) ≥ L(θ; Γ^∗_SW).

Now with the optimistic ruleΓ^∗_SW, the system can be mod- elled as the following Markov process as shown in Figure 2, in which we have two states: 1) sensing channel i^∗ and 2) sensing channeli^∗∗. The transition probability matrix is

P =

θi^∗, 1 − θi^∗

1 − θi^∗∗, θi^∗∗

.

The probabilityPi^∗∗ that the cognitive user will sense channel

Fig. 2. A Markov process representation of the optimistic strategyΓ^∗_SW.

i^∗∗ can be obtained by the solving the following stationary equation

Pi^∗∗ = (1 − θi^∗)(1 − Pi^∗∗) + θi^∗∗Pi^∗∗, from which we obtain

Pi^∗∗ = 1 − θi^∗

1 − θi^∗ + 1 − θi^∗∗. Hence in the nontrivial cases, we have

L(θ; Γ^∗_SW) = BPi^∗∗(θi^∗ − θi^∗∗)T,

implying that, for any switching rule,L(θ; ΓSW) ∼ O(T ).

There are several strategies that have loss of orderO(ln T ).

We adopt the following linear complexity strategy which was proposed and analyzed in [18].

Rule 1: (Order optimal single index strategy)

The cognitive user maintains two vectors X and Y, where each Xi records the number of time slots for which the cognitive user has sensed channel i to be free, and each Yi

records the number of time slots for which the cognitive user has chosen channeli to sense. The strategy works as follows.

1) Initialization: at the beginning of each block, sense each channel once.

2) After the initialization period, the cognitive user obtains an estimation ˆθ at the beginning of time slot j, given by

θˆi(j) = Xi(j) Yi(j), and assigns an index

Λi(j) = ˆθi(j) +

s2 ln j Yi(j)

to the i^th channel. The cognitive user chooses the channel with the largest value ofΛi(j) to sense at time

(8)

slotj. After each sensing, the cognitive user updates X and Y.

The intuition behind this strategy is that as long asYigrows as fast as O(ln T ), Λi converges to the true value of θi in probability, and the cognitive user will choose the channel with the largestθi eventually. The loss ofO(ln T ) comes from the time spent on sampling the inferior channels in order to learn the value of θ. This price, however, is inevitable as established

in the lower bound of Lemma 4. 2

Finally, we observe that the difference between the myopic rule and the order optimal single index rule is the additional termp2 ln j/Yi(j) added to the current estimate ˆθi. Roughly speaking, this additional term guarantees enough sampling time for each channel, since if we sample channel i too sparsely, Yi(j) will be small, which will increase the probability thatΛiis the largest index. WhenYi(j) scales as ln T , θˆi will be the dominant term in the index Λi, and hence the channel with the largest θi will be chosen much more frequently.

IV. MULTIUSER–SINGLECHANNEL

The presence of multiple cognitive users adds an element of competition to the problem. In order for a cognitive user to get hold of a channel now, it must be free from the primary traffic and the other competing cognitive users. More rigorously, we assume the presence of a set K = {1, · · · , K} of cognitive users and consider the distributed medium access decision processes at the multiple users with no prior coordination. We denote Ki(j) ⊆ K as the random set of users who choose to sense channel i at time slot j. We assume that the users follow a generalized version of the Carrier Sense Multiple Access/Collision Avoidance (CSMA-CA) protocol to access the channel after sensing the main channel to be free, i.e., if channel i is free, each user k in the set Ki(j) will generate a random number tk(j) according to a certain probability density functiong, and wait the time specified by the generated random number. At the end of the waiting period, userk senses the channel again, and if it is found free, the packet from userk will be transmitted. The probability that userk in the set Ki(j) gains access to the channel is the same as the probability that tk(j) is the smallest random number generated by the users in the setKi(j). Thus, the throughput user k achieves in a block is

Wk=

T

X

j=1

BZSk(j)(j)I (

k = arg min

q∈K_Sk(j)(j)tq(j) )

.

Therefore, userk should devise sensing rule Γk that maximizes

E {Wk} = E







T

X

j=1

BZSk(j)(j)I (

k = arg min

q∈K_Sk(j)(j)tq(j) )



 .

Clearly, with multiple cognitive users, it is not optimal anymore for all the users to always choose the channel with the largest θi to sense. In particular, if all the users choose the channel with the largest θi, the probability that a given user gains control of the channel decreases, while potential

opportunities in the other channels in the primary network are wasted.

A. Known θ Case

To enable a succinct presentation, we first consider the case in which the values of θ are known to all the cognitive users.

The users distributively choose channels to sense and compete for access if the channels are free.

1) The Optimal Symmetric Strategy: Without loss of gener- ality, we consider a mixed strategy where user k will choose channel i with probability pk,i. Furthermore, we let p_k = [pk,1, · · · , pk,N] and consider the symmetric solution in which p = p1 = · · · = pK. The symmetry assumption implies that all the users in the network distributively follow the same rule to access the spectral opportunities present in the primary network, in order to maximize the same average throughput each user can obtain. The following result derives the optimal solution in this situation.

Lemma 6: For a cognitive network with K > 1 cognitive users and N channels with probability θ of being free, the optimal p^∗ is given by

p^∗_i =







1 −

λ^∗ Kθi

1/(K−1)⁺

, for θi > 0,

0, for θi = 0,

where λ^∗ is a constant such that

N

P

i=1

p^∗_i = 1. Here {x}⁺ = max{0, x}.

Proof: With a strategy p, the probability that user k chooses channel i and, at the same time, there are l other users choosing channeli to sense is

pi

K − 1 l

p^l_i(1 − pi)^K−1−l.

Under this scenario, the average bits transmitted at one slot of each user isBθi/(l + 1), Hence, the average throughput Wk

of userk is Wk = T

N

X

i=1

Bθi

l + 1

K−1

X

l=0

pi

K − 1 l

p^l_i(1 − pi)^K−1−l. Based on our symmetry assumption, we drop the subscriptk and write the average throughput of each user as W leading to

W = BT

N

X

i=1

piθi K−1

X

l=0

K − 1 l

p^l_i(1 − pi)^K−1−l l + 1

= BT

N

X

i=1

piθi K−1

X

l=0

(K − 1)!

l!(K − 1 − l)!

p^l_i(1 − pi)^K−1−l l + 1

= BT

N

X

i=1

θi

K

K−1

X

l=0

K l + 1

p^l+1_i (1 − pi)^K−1−l

= BT

N

X

i=1

θi

K







K

X

l^′=0

K l^′

p^l

′

i(1 − pi)^K−l

′

− (1 − pi)^K







= BT

N

X

i=1

θi

K1 − (1 − pi)^K .

(9)

Now, we should solve the following optimization problem

max W = BT

N

X

i=1

θi

K1 − (1 − pi)^K , s.t.

N

X

i=1

pi= 1, p≥ 0.

This optimization problem is equivalent to the following:

min y =

N

X

i=1

θi(1 − pi)^K,

s.t.

N

X

i=1

pi= 1, (9)

p≥ 0.

Since

∂²y

∂²pi = θiK(K − 1)(1 − pi)^K−2≥ 0,

for 0 ≤ pi ≤ 1, y is a convex function of p in the region of interest, i.e. p ∈ [0, 1]^N. Also, the constraints are the intersection of a convex set and a linear constraint. Therefore, our problem reduces to a convex optimization problem whose Karush-Kuhn-Tucker (KKT) conditions [19] for optimality are

p^∗ ≥ 0,

N

X

i=1

p^∗_i = 1, p^∗_i λ^∗− Kθi(1 − p^∗_i)^K−1

= 0,

λ^∗ ≥ Kθi(1 − p^∗_i)^K−1, whereλ^∗ is the Lagrange multiplier.

It is easy to check that if K > 1,

p^∗_i =







1 −

λ^∗ Kθi

^1/(K−1)+

for θi> 0,

0 for θi= 0,

(10)

satisfies the KKT conditions, in whichλ^∗ is the constant that satisfies P p^∗_i = 1.

If K = 1, then p^∗_i^∗ = 1, where i^∗ = arg max

i∈N θi, p^∗_l = 0, andl ∈ N \{i^∗}, satisfies the KKT conditions.

So, the total throughput of theK cognitive users is

KW = BKT

N

X

i=1

θi

K 1 − (1 − p^∗_i)^K

= BT

N

X

i=1

θi1 − (1 − p^∗_i)^K .

On the other hand, the average total spectral opportunities of the primary network is BT

N

P

i=1

θi. This upper bound can be achieved by a centralized channel allocation strategy when K > N (simply by assigning one cognitive user to each

channel). Therefore, the loss of the distributed protocol as compared with the centralized scheduling is

L = BT

N

X

i=1

θi(1 − p^∗_i)^K,

which is same as (9) up to a constant factor. There is an intuitive explanation of this loss. If there is a spectral opportunity in channeli but there are no users choosing channel i to sense, a loss occurs. The probability that there is no user choosing channeli to sense is (1 − p^∗_i)^K, and hence the probability of loss occurring at channeli is θi(1 − p^∗_i)^K. To obtain further insights on the performance of the cognitive network, we study the following special cases.

1) N ≥ 1, K = 1. As stated in the above, p^∗_i^∗ = 1, and p^∗_l = 0, l ∈ N \{i^∗}. Hence, the user should choose the channel with the largest free probability to sense. And

L = BT X

i∈N \{i^∗}

θi.

2) N = 2, K = 2. Substituting N = 2 and K = 2 into (10), we obtain

p^∗₁= θ1/(θ1+ θ2) and p^∗₂= θ2/(θ1+ θ2).

Furthermore,

W = BT θ1

2

1 − θ²₂ (θ1+ θ2)²

+BT θ2

2

1 − θ²₁ (θ1+ θ2)²

, L = BT θ1θ2

2(θ1+ θ2).

3) N is fixed, and K → ∞. We have the following asymptotic characterization.

Lemma 7: Let2 ≤ Q ≤ N be the number of channels for whichθi> 0. We have p^∗_i → 1/Q, and L → 0 exponentially as K increases, i.e.,

L ∼ O(e^−c¹^K), wherec1= ln_Q−1^Q .

Proof: Without loss of generality, we assume thatθi 6= 0, for1 ≤ i ≤ Q. At the moment, we assume that (we will show that this is true, ifK is large enough) if θi6= 0

p^∗_i = (

1 −

λ^∗ Kθi

1/(K−1))+

= 1 −

λ^∗ Kθi

1/(K−1)

.

Together with

N

P

i=1

p^∗_i =

Q

P

i=1

p^∗_i = 1, we have

(λ^∗)^1/(K−1)=K^1/(K−1)(Q − 1)

Q

P

i=1

θ_i^−1/(K−1) and

p^∗_i = 1 −(Q − 1)θ_i^−1/(K−1)

Q

P

i=1

θ^−1/(K−1)_i

, for 1 ≤ i ≤ Q.

(10)

To satisfy the condition p≥ 0, we need to show (Q − 1)θ_i^−1/(K−1)

Q

P

i=1

θ^−1/(K−1)_i

≤ 1,

for alli with θi > 0.

With i^∗ = arg max

i∈N θi and l^∗ = arg min

1≤l≤Qθl, we have for all i

(Q − 1)θ^−1/(K−1)_i

Q

P

i=1

θ_i^−1/(K−1)

≤ (Q − 1)θ^−1/(K−1)_l∗

Qθ^−1/(K−1)_i∗

.

For any ϑ ≤ Q/(Q − 1), if K is large enough, we have

θi^∗

θl^∗

_K−1¹

≤ ϑ since

K→∞lim

θi^∗

θl^∗

K−1¹

= 1.

Hence, for all1 ≤ i ≤ Q, we have (Q − 1)θ^−1/(K−1)_i

Q

P

i=1

θ_i^−1/(K−1)

≤ Q − 1 Q ϑ ≤ 1.

Now, straightforward limit calculation shows that p^∗→ 1/Q,

as K increases. And

K→∞lim L

exp^−c¹^K = lim

K→∞

BT

Q

P

i=1

θi(1 − p^∗_i)^K exp^−c¹^K = BT

Q

X

i=1

θi

withc1= ln_Q−1^Q .

The reason for the exponential decrease in the loss is that, as the number of cognitive users increases, the probability that there is no user sensing any particular channel decreases exponentially. IfQ = 1, there is no loss of performance, since the all the user will always sense the channel with non-zero availability probability.

2) The Game Theoretic Model: The optimality of the distributed protocol proposed in the previous section hinges on the assumption that all the users will follow the symmetric rule. However, it is straightforward to see that if a single cognitive user deviates from the rule specified in Lemma 6, it will be able to transmit more bits. If this selfish perspective propagates through the network, it may lead to a significant reduction in the overall throughput. This observation motivates our next step in which the channel selection problem is modeled as a non-cooperative game, where the cognitive users are the players, the Γks are the strategies and the average throughput of each user is the payoff. The following result derives a sufficient condition for the Nash equilibrium [20] in the asymptotic scenario K → ∞.

Lemma 8: (Γ1, · · · , ΓK) is a Nash-equilibrium, if K is large and at each time slot, there areτiK users sensing channel i, where τi satisfies

τi= θi N

P

i=1

θi

. (11)

At this equilibrium, each user has probability

N

P

i=1

θi

K of trans- mitting at each time slot.

Proof: We prove this by backward induction. At the last time slotT , if τis satisfy equation (11), the probability of user k gaining a channel is

pk = θi

τiK =

N

P

i=1

θi

K .

Now, if user k deviates from this strategy, and chooses channeli^′, the number of users sensing channeli^′ isτ_i^′K + 1, and the probability of userk gaining the channel is

p^′_k = θ_i^′

τ_i^′K + 1< θ_i^′ τ_i^′K = pk.

Hence the strategy that has τiK users sensing channel i at time slotT is a Nash equilibrium. Now, we know the optimal strategy for the last time-slot, so we can ignore this time slot.

Then time slot T − 1 becomes the last slot, in which this strategy is optimal. Similarly, we show that this strategy is optimal for all other time slots.

We note that in the lemma we implicitly assume thatτiK is an integer. In practice, this is not always true. However, since K is large, rounding τiK to the nearest integer will have minor effects. The Nash equilibrium is also optimal from a system perspective, in the sense that this strategy maximizes the total throughput of the whole network by fully utilizing the available spectral opportunities when K is large (i.e., on the average, each user will be able to transmit ^BT_K^{P θ}ⁱ bits per block, and the total throughput of the network isBTP θ_i).

With this equilibrium result, the cognitive users can use the following stochastic sensing strategy to approximately work on the equilibrium point for a large but finite K. Let sk(j) be the channel chosen by user k at time slot j. At each time slot, each user independently selects channeli with probability τi = ^P^θⁱ

i∈N

θi, i.e., Pr{sk(j) = i} = τi. Then at each time slot, the number of users sensing channeli will be

K

P

k=1

I{sk(j) = i}, where the I{sk(j) = i}s are i.i.d Bernoulli random variables. Hence, the total number of users sensing channel i is a binomial random number, and the fraction of users sensing channel i converges to τi in probability as K increases, i.e.

τ^′ =

K

P

k=1

I{sk(j) = i}

K → τi

in probability. Hence, asK increases, the operating point will converge to the Nash equilibrium in probability.