Stochastic Learning Procedure for Price Competition

Appendix 2.A Assumptions for Stochastic Approximation

5.4 Price Competition among Service Providers

5.4.2 Stochastic Learning Procedure for Price Competition

In the upper-level game, learning-based algorithms help the SPs gradually adjust their pricing strategies based on the service selections of the SUs at the equilibrium of the

lower-Algorithm 5.2 Self-organized Pricing (SoP)

1: Initially, set k = 0. Set the pricing probability vector and utility estimation as p_m,s_m(0) = 1/|Am|,

u_m,s_m(−1) = 0, ∀m ∈ M, sm ∈ Am.

2: At the beginning of the kth iteration, each seller selects an action a_m(k) according to the current pricing strategy pm(k).

3: When the service selection of SUs converges, each seller m receives the utility u_m(k) specified by (5.2) depending on the user load.

4: All SPs update their utility estimation and pricing probability vector in iteration k according to the rules: where η and ϵ are the learning rates for utility estimation and pricing probability, respectively.

level game. A seller’s pricing strategy is defined over a probability space of its candidate price levels.

As in the lower-level game, two main issues are considered when designing the learn-ing algorithm for the upper-level game, namely, the learnlearn-ing rule and the convergence property. First, since the total number of SUs is unknown to the SPs, it is diﬃcult to obtain the upper bound and normalize the revenue. Therefore, a probability update rule diﬀerent from (5.10) is needed. In this work, we consider the multiplicative-weight rule for mixed-strategy update. The learning procedure in the self-organized price (SoP) competition is described in Algorithm 5.2. The multiplicative-weight update rule in (5.18) belongs to the combined fully distributed payoﬀ and strategy reinforcement learn-ing (CODIPAS-RL) [7], in which learnlearn-ing applies to both the expected payoﬀ and the strategies.

The second issue is the convergence behavior when the SL algorithm is applied in the price competition game. Unlike the lower-level game, the upper-level competition with

the utility in (5.14) is a potential game. Thus, we are unable to provide a theoretical proof to guarantee the convergence toward an NE for the price competition game. How-ever, the algorithm still has some nice properties when applied to general strategic games.

We investigate first the approximation of (continuous-time) ordinary diﬀerential equation (ODE) by the discrete-time mixed strategy update rule, and then the theoretical per-spectives of convergence behaviors. The notations of p_m, P_−m, P, and ψ_m(P) are defined similarly as in Section 5.3, and e_s_m is a unit probability vector (of appropriate dimension) with the s_m-th component being one and all other components being zero.

Proposition 5.4.2. With suﬃciently small learning rates η and ϵ:

1. The estimated utility converges to

u_m,s_m → ψm(e_s_m, P_−m). (5.19)

2. Asymptotically, the probability matrix sequence {P(k)} can be approximated by the trajectory of the following ODE:

dp_m,s_m(t)

dt = p_m,s_m(t) [ψ_m(e_s_m, P_−m)− ψm(P)] (5.20) where p_m,s_m(t) is the continuous-time version of p_m,s_m(k), and the boundary condi-tion is given by P(0) = P₀, where P₀ is the initial mixed strategy matrix.

Proof: See [7, Section 4.3].

Notice that ψm(esm, P_−m) is the utility of player m if it employs pure strategy sm

while other player m^′,∀m^′ ∈ M, m^′ ̸= m employs a mixed strategy pm^′, and its value is learned by player m as the estimated utility ˆum,sm, as shown in (5.19). On the other hand, the ODE for mixed-strategy in (5.20) is the replicator equation [14] in which the probability of taking one strategy increases if the current estimated utility of this strategy is larger than the average utility over all strategies and decreases otherwise. Compared to the best response dynamics [12] where a player changes its strategy in the next iteration

to the best action according to other players’ actions (i.e., the best response), with the replicator dynamics, a player selects an action according to a probability distribution over the strategy set, and adjusts the weighting for each possible action in each iteration based on the utility estimation.

Proposition 5.4.3. The proposed learning algorithm has the following properties:

1. All Nash equilibria are stationary points of (5.20);

2. All stationary points of (5.20) are Nash equilibria.

Proof: Proposition 5.4.3 is an instance of the Folk theorems in the evolutionary game theory [14, Chapter 3], and these properties follow directly from the replicator equation in (5.20). Please also refer to [7, Section 4.3].

For an intuitive explanation, observe that for a mixed-strategy NE profile P^∗, all survived pure strategies (i.e., s_m with p^∗_m,s_m > 0) of player m perform equally well when other players follow the mixed strategy P^∗_−m. That is, the condition

ψ_m(e_s_m, P^∗_−m) = ψ_m(P^∗),

∀m ∈ M, sm ∈ Am with p^∗_m,s_m > 0 (5.21)

must hold. Therefore, any NE must lead the right-hand-side of (5.20) to zero and thus con-stitutes a stationary point of (5.20). In other words, if the proposed algorithm converges to a stationary point of (5.20), the limiting point must be a (possibly mixed-strategy) NE point. Although there is no theoretical proof as in the lower-level game (since G2 is not an EPG), the convergence toward NE in the upper-level game is still observed through numerical simulations.

5.5 Numerical Results

In order to evaluate the performance of the proposed scheme and algorithms, we con-duct a series of simulations. The distribution of the number of residual channels (i.e.,

spec-Table 5.3: Simulation Parameters

Parameter Value

Number of SPs M = 2

Max. number of channels K_m = 3

Ch. availability of SP₁ x₁ = [0, 0.1, 0.3, 0.6]

Ch. availability of SP2 x2 = [0, 0.4, 0.3, 0.3]

Pricing strategies Am = [1, 1.5, 2, 2.5],∀m Learning rate (η, ϵ) = (0.1, 0.05)

Number of SUs N = 6

Learning rate of SUs b = 0.3 Waiting time for obtaining NE T_conv = 400

trum opportunities) oﬀered by SP_mis described by a vector x_m = [x_m,0, . . . , x_m,c, . . . , x_m,K_m], where x_m,c denotes the probability that SP_m possesses c residual channels. The default values of simulation parameters are given in Table 5.3, and these values are adopted in the simulations unless otherwise specified.

We first study the lower-level game under a given price vector (q₁, q₂) = (1, 1.5). The purpose is to observe the convergence behavior and the performance of the proposed algorithm. Then, the upper-level game is involved to observe the price competition.

在文檔中自組式網路中的無線資源管理：分散式學習與穩當策略 (頁 80-84)