Appendix 2.A Assumptions for Stochastic Approximation
5.4 Price Competition among Service Providers
5.4.2 Stochastic Learning Procedure for Price Competition
In the upper-level game, learning-based algorithms help the SPs gradually adjust their pricing strategies based on the service selections of the SUs at the equilibrium of the
lower-Algorithm 5.2 Self-organized Pricing (SoP)
1: Initially, set k = 0. Set the pricing probability vector and utility estimation as pm,sm(0) = 1/|Am|,
ˆ
um,sm(−1) = 0, ∀m ∈ M, sm ∈ Am.
2: At the beginning of the kth iteration, each seller selects an action am(k) according to the current pricing strategy pm(k).
3: When the service selection of SUs converges, each seller m receives the utility um(k) specified by (5.2) depending on the user load.
4: All SPs update their utility estimation and pricing probability vector in iteration k according to the rules: where η and ϵ are the learning rates for utility estimation and pricing probability, respectively.
level game. A seller’s pricing strategy is defined over a probability space of its candidate price levels.
As in the lower-level game, two main issues are considered when designing the learn-ing algorithm for the upper-level game, namely, the learnlearn-ing rule and the convergence property. First, since the total number of SUs is unknown to the SPs, it is difficult to obtain the upper bound and normalize the revenue. Therefore, a probability update rule different from (5.10) is needed. In this work, we consider the multiplicative-weight rule for mixed-strategy update. The learning procedure in the self-organized price (SoP) competition is described in Algorithm 5.2. The multiplicative-weight update rule in (5.18) belongs to the combined fully distributed payoff and strategy reinforcement learn-ing (CODIPAS-RL) [7], in which learnlearn-ing applies to both the expected payoff and the strategies.
The second issue is the convergence behavior when the SL algorithm is applied in the price competition game. Unlike the lower-level game, the upper-level competition with
the utility in (5.14) is a potential game. Thus, we are unable to provide a theoretical proof to guarantee the convergence toward an NE for the price competition game. How-ever, the algorithm still has some nice properties when applied to general strategic games.
We investigate first the approximation of (continuous-time) ordinary differential equation (ODE) by the discrete-time mixed strategy update rule, and then the theoretical per-spectives of convergence behaviors. The notations of pm, P−m, P, and ψm(P) are defined similarly as in Section 5.3, and esm is a unit probability vector (of appropriate dimension) with the sm-th component being one and all other components being zero.
Proposition 5.4.2. With sufficiently small learning rates η and ϵ:
1. The estimated utility converges to
ˆ
um,sm → ψm(esm, P−m). (5.19)
2. Asymptotically, the probability matrix sequence {P(k)} can be approximated by the trajectory of the following ODE:
dpm,sm(t)
dt = pm,sm(t) [ψm(esm, P−m)− ψm(P)] (5.20) where pm,sm(t) is the continuous-time version of pm,sm(k), and the boundary condi-tion is given by P(0) = P0, where P0 is the initial mixed strategy matrix.
Proof: See [7, Section 4.3].
Notice that ψm(esm, P−m) is the utility of player m if it employs pure strategy sm
while other player m′,∀m′ ∈ M, m′ ̸= m employs a mixed strategy pm′, and its value is learned by player m as the estimated utility ˆum,sm, as shown in (5.19). On the other hand, the ODE for mixed-strategy in (5.20) is the replicator equation [14] in which the probability of taking one strategy increases if the current estimated utility of this strategy is larger than the average utility over all strategies and decreases otherwise. Compared to the best response dynamics [12] where a player changes its strategy in the next iteration
to the best action according to other players’ actions (i.e., the best response), with the replicator dynamics, a player selects an action according to a probability distribution over the strategy set, and adjusts the weighting for each possible action in each iteration based on the utility estimation.
Proposition 5.4.3. The proposed learning algorithm has the following properties:
1. All Nash equilibria are stationary points of (5.20);
2. All stationary points of (5.20) are Nash equilibria.
Proof: Proposition 5.4.3 is an instance of the Folk theorems in the evolutionary game theory [14, Chapter 3], and these properties follow directly from the replicator equation in (5.20). Please also refer to [7, Section 4.3].
For an intuitive explanation, observe that for a mixed-strategy NE profile P∗, all survived pure strategies (i.e., sm with p∗m,sm > 0) of player m perform equally well when other players follow the mixed strategy P∗−m. That is, the condition
ψm(esm, P∗−m) = ψm(P∗),
∀m ∈ M, sm ∈ Am with p∗m,sm > 0 (5.21)
must hold. Therefore, any NE must lead the right-hand-side of (5.20) to zero and thus con-stitutes a stationary point of (5.20). In other words, if the proposed algorithm converges to a stationary point of (5.20), the limiting point must be a (possibly mixed-strategy) NE point. Although there is no theoretical proof as in the lower-level game (since G2 is not an EPG), the convergence toward NE in the upper-level game is still observed through numerical simulations.
5.5 Numerical Results
In order to evaluate the performance of the proposed scheme and algorithms, we con-duct a series of simulations. The distribution of the number of residual channels (i.e.,
spec-Table 5.3: Simulation Parameters
Parameter Value
Number of SPs M = 2
Max. number of channels Km = 3
Ch. availability of SP1 x1 = [0, 0.1, 0.3, 0.6]
Ch. availability of SP2 x2 = [0, 0.4, 0.3, 0.3]
Pricing strategies Am = [1, 1.5, 2, 2.5],∀m Learning rate (η, ϵ) = (0.1, 0.05)
Number of SUs N = 6
Learning rate of SUs b = 0.3 Waiting time for obtaining NE Tconv = 400
trum opportunities) offered by SPmis described by a vector xm = [xm,0, . . . , xm,c, . . . , xm,Km], where xm,c denotes the probability that SPm possesses c residual channels. The default values of simulation parameters are given in Table 5.3, and these values are adopted in the simulations unless otherwise specified.
We first study the lower-level game under a given price vector (q1, q2) = (1, 1.5). The purpose is to observe the convergence behavior and the performance of the proposed algorithm. Then, the upper-level game is involved to observe the price competition.