• 沒有找到結果。

有限存貨下的動態定價

N/A
N/A
Protected

Academic year: 2022

Share "有限存貨下的動態定價"

Copied!
90
0
0

加載中.... (立即查看全文)

全文

(1)

國立臺灣大學電機資訊學院電機工程學系 碩士論文

Department of Electrical Engineering

College of Electrical Engineering and Computer Science

National Taiwan University Master Thesis

有限存貨下的動態定價

Dynamic Pricing With Limited Inventory

黃加成

Jia-Cheng Huang

指導教授:陳和麟博士 Advisor: Ho-Lin Chen, Ph.D.

中華民國 107 年 7 月 July, 2018

(2)
(3)

��Nff �7'���:fE:�

Dynamic Pricing With Limited Inventory

7$�>{��1J □ PX� (R05921041) tt�:UB�*��tlI��

�jc;PXiZfiW±�1*�)t, ��;1101�7F.111 B��9U����

��iffi�& □ �&m , �ittma�

pfi �:

(4)
(5)

誌謝

能夠完成這篇論文,首先最感謝的是我的指導教授:陳和麟老師。

老師總是給學生們非常大的空間及自由,可以對自己有興趣的領域、

題目盡情鑽研,並適時給予提點與幫助。跟老師討論完總是如沐春風,

老師對問題的理解力及切入的角度總是讓我驚豔,讓我能夠對理論研 究有更深一層的體會。對於論文的架構及修改,老師也給了我非常多 建議,讓我更有效率的將研究成果以一篇完整的論文展現出來。

接下來要特別感謝的是中研院的呂及人老師。呂及人老師是我的第 二位指導老師,本篇論文中的許多部分,都是受到老師指導所啟發的。

老師的數學底蘊非常深厚,每次討論時總是能夠快速點出我在思考或 推導時的盲點。老師在研究上也給我非常大的自由,從發想題目到研 究方法,總是讓我能夠有足夠的時間摸索。在專業領域上,老師又總 是能給我非常實用的提示,讓我少走了許多冤枉路。在碩士生涯中能 夠同時接受這麼好的兩位老師指導,實在是非常幸運。

再來要感謝的是兩位口試委員:王奕翔老師以及于天立老師。老師 們在百忙之中,願意抽空擔任我的口試委員,而舉辦口試當天剛好是 颱風天,老師們還願意冒雨前來,實在萬分感激。老師們也對我的研 究提供了非常實用的建議及未來的方向。其實在大學時,我便有修過 兩位老師的課,也碩一時參加過王奕翔老師開設的讀書會。老師們的 課在系上總是大獲好評,也因為會在理論的巧思或數學推導上有更深 的著墨,讓我總是獲益良多。因此,同樣對數學有興趣的我,這次能 請到老師們來擔任口試委員,實在是莫大的榮幸。

最後要感謝的是我的家人、朋友以及實驗室的夥伴們,在我研究碰

(6)

壁感到焦慮時給予我支持與鼓勵,讓我能夠心無旁騖地投入研究。沒 有以上的任何一位,完成這篇論文的路肯定都會艱辛許多。這篇論文 作為碩士生涯兩年的縮影,也是我研究生涯的第一個里程碑。期許我 自己之後能夠應用所學,做出更多更好的研究。

(7)

摘要

在這篇論文中,我們研究動態定價問題[18],並提供兩個有趣的情

境。我們主要針對線上學習的情境設計演算法,並且考慮有限存貨的 情況。對一位賣家而言,他的目標是在有限的時間,將有限的商品賣 出,並以達到最高的累積收益為目標。為了描述兩個我們關心的變數 對這個問題所造成的影響,我們建構了兩個理論模型。第一個模型,

賣家對每個買家的資訊有所了解,在他針對買家定價前,他會先得到 一些買家相關的資訊,這類的情境在網路購物等環境下較為常見。我

們對買家的類型(賣家所看到的資訊) 並沒有機率分佈的假設,加上商

品有限的假設,讓評估線上的動態定價演算法變得困難。我們提出了 一個標準來評估線上動態定價演算法,針對此標準設計了一個演算法,

並且提供該演算法期望收益的理論保證。第二個模型,我們假設每個 買家可能在賣場待上一段時間,而即便他們看到一個可接受的價格,

他們也可能策略性的等待更低的價格。針對這個模型,我們提供了一 個新的買賣機制,並基於該機制設計一個線上動態定價演算法。同樣 的,我們提供了該演算法在期望收益上的的理論保證。

關鍵字:動態定價, 收益管理, 線上學習, 多臂吃角子老虎機, 情境式吃

角子老虎機, 賽局理論, 機制設計

(8)
(9)

Abstract

This thesis introduces scenarios for the well-known dynamic pricing prob- lem[18], and presents corresponding learning algorithms. Different form the previous works, we mainly focus on the scenario that initially, the seller is given a finite inventory, and want to sell them out in a finite period of time.

We build two different theoretical models to describe this problem under dif- ferent concerns. For the first model, the seller observe a context vector of each consumer before deciding the posted price for her, also the context of each consumer is adversarially given. In general, the objective of the seller is to maximize the revenue, however, it’s not as trivial under the adversar- ial setting with limited inventory. We introduce a criterion to evaluate the performance of an learning algorithm, and then design an algorithm with per- formance guarantee on top of such criterion. For the second model, all con- sumers may stay in the market for a period of time, and they may wait for lower payment in order to maximize their utility. In this model, we intro- duce a new selling mechanism with good properties, and design a learning algorithm with performance guarantee based on the new mechanism.

Keywords: Dynamic Pricing, Revenue Management, Online learning, Multi- Armed Bandit, Contextual Bandit, Game Theory, Mechanism Design

(10)
(11)

Contents

口試委員會審定書 iii

誌謝 v

摘要 vii

Abstract ix

1 Introduction 1

1.1 Contextual Dynamic Pricing Problem . . . 3

1.1.1 Contextual Bandit Problem . . . 4

1.2 Dynamic Pricing with Strategic Buyers . . . 6

1.3 Our Results . . . 7

1.3.1 Results for Contextual Bandit Problem . . . 7

1.3.2 Results for the Contextual Dynamic Pricing Problem . . . 8

1.3.3 Results for Dynamic Pricing for Strategic Buyers . . . 8

1.4 Organization of the Thesis . . . 9

2 Problem Formulation 11 2.1 Preliminary . . . 11

2.2 Contextual Bandit . . . 12

2.2.1 K-Armed Contextual Bandit . . . 14

2.2.2 Batch K-Armed Contextual Bandit . . . 15

2.3 Contextual Dynamic Pricing . . . 16

(12)

2.3.1 CDP With Limited Inventory . . . 17

2.4 Dynamic Pricing With Strategic Buyers . . . 20

2.4.1 Modified Robust Dynamic Pricing . . . 23

3 Contextual bandit 25 3.1 LinUCB Algorithm . . . 25

3.1.1 Proof of Lemma 1 . . . 27

3.2 Generalized Linear Bandit . . . 33

3.2.1 Online Newton Method . . . 33

3.2.2 For K-Armed Contextual Bandit . . . 41

3.2.3 For Batch K-Armed Contextual Bandit . . . 44

4 Contextual dynamic pricing 49 4.1 Regret Bound For Contextual Dynamic Pricing . . . 49

4.2 Regret Bound For CDP With Limited Inventory . . . 51

4.2.1 Proof of Theorem 6 . . . 55

5 Dynamic Pricing With Strategic Buyers 59 5.1 Dynamic Installment Mechanism . . . 59

5.1.1 Properties for Dynamic Installment Mechanism . . . 60

5.2 Useful Results in Previous work . . . 62

5.2.1 Simple Robust Pricing Policy . . . 63

5.2.2 Direct Dynamic Mechanism . . . 63

5.3 Two-Stage Learning Algorithm . . . 64

5.3.1 Performance guarantee of Algorithm 10 . . . 65

6 Conclusion and Future Work 71 6.1 Conclusion . . . 71

6.2 Future Work . . . 72

Bibliography 75

(13)

Chapter 1 Introduction

Revenue management[24, 10] is a well-known problem in the fields of game theory and machine learning, since the applications are so important for industries. Dynamic pric- ing[18, 15] is a canonical setting in such research problems. A dynamic pricing problem can be describle as: A seller is selling a single product over a finite time horizon, say, a fi- nite number of time steps. Seller seeks to maximize the total revenue by properly deciding the price of selling at each time step. Many aspects have been studied, such as different assumptions of consumers, the information that the seller knows, constraint on inventory, and parameters that the seller needs to learn over the time horizon.

For dynamic pricing problem, there are so many different scenarios that were dis- cussed. For example, in a classic work, [18], the scenario was non-contextual that con- sumers enter the market within a continuous time horizon and does not stay in the market.

Lots of different scenarios and works are mentions and discussed in [15], however, most of them are very different to ones we concern. A few works that are highly related to this thesis will be discussed in the following paragraghs.

First, [6] provides an algorithm for non-contextual dynamic pricing with limited sup- ply, and proved that the revenue of it is comparable to any fixed price policy, with private valuation of consumers drawn from any i.i.d distribution. They also consider the case that inventory are extremely small. Second, [29] studies the contextual dynamic pricing prob- lem, and provides algorithm with adaptive partition to solve the problem. In his work, the action space is a metric space. Third, [8] consider contextual dynamic pricing with

(14)

limited inventory. Their work is under scochastic setting, in which the context, rewards from action and resource consumption is sampled from a fixed joint distribution, i.i.d for each round.

In this work, we study the following problems.

First, we study the contextual dynamic pricing problem on top of [3, 13], which is highly related to a well-known contextual bandit problem. In the contextual dynamic pricing problem, each price can be view as an arm (action) in a contextual bandit prob- lem. However, a contextual dynamic pricing problem preserves several properties that a contextual bandit problem doesn’t have, for example, the selling probability increases while the price decrease. To the best of our knowledge, there are no previous works on contextual dynamic pricing with limited inventory under adversarial setting.

For contextual bandit problem, there are several works closely related to ours, in ad- dition to [3, 13] mentioned before. First, [2] provided different algorithms for the case of linear rewards based on the notion of Thompson Sampling [2]. However, their regret bounds appear worse than ours. Second, [14, 1, 31] studied a different setting, in which at each step there are no context vectors given to the learner to restrict her action, and she has the freedom to choose any action she likes. On the other hand, the learner in our setting is given some context vectors at each step which she has no control over and the reward she can possibly receive is determined by those vectors. Thus, their results can not be applied here directly, although we can borrow some of their techniques. Finally, some variants of the problem have also been introduced, such as adding some knapsack constraints on the actions [7] or optimizing some submodular utility functions [30], but their settings are again different from ours.

Next, we study dynamic pricing with strategic buyers. For both problems, we are highly interested to the scenario that there are finite, non-replenishable inventory of the product, however, we will also study the unlimited versions for some simpler results and implications. We introduce these problems in the next two sections.

There are also works for dynamic pricing with strategic consumers that stays in the market for a period of time, such as [12] and . As long as they have very different settings

(15)

to our work, there are some inspiring points of view and methods in their works.

1.1 Contextual Dynamic Pricing Problem

In the contextual dynamic pricing problem models scenarios like product selling on the internet. The price may be decide while observing some information of the user clicking the website, and hence the owner of the website may dynamically decide which price to post that can lead to maxmimum revenue.

In this problem, a consumer enter the market each step. Each consumer may only decide whether to purchase or not within the round she enters the market, and will not stay for the upcoming time steps. The seller iteratively pick a price at each step. In a normal dynamic pricing problem, the only information which the seller may learn is whether the consumer purchase in each round. In the “contextual” dynamic pricing problem, the seller observe some additional context information from each consumer before setting a price for her. This may help making proper dicisions.

Here, we introduce a simple extension: the batched setting. In real world situation, it may be difficult for all consumers seeing different price. In a batched setting, a price is decided for a batch of consumers, observing all their context information. Normally, a group of consumers of the same type may arrive within a small period of time. Hence, viewing the small time interval as a single time step, it is natural to customize the posted price for a batch of consumers.

More formally, in the batched-contextual dynamic pricing problem, there are T rounds in total. At each round, n consumers enter the market, and the seller observe n context vectors, each are d-dimensional. The context vectors are arbitrary, and may be adversari- ally given. The original problem without batched setting is a special case in which n = 1.

The seller must choose a price out of K fixed candidates, and will receive feedback in- formation on whether each consumer makes a purchase or not. For the case of limited inventory, there will be only M products to sell.

In such a specific setting, there were no known fundamental limit on the performance of an online algorithm. In this thesis, for the case of infinite inventory, we first derive a

(16)

regret bound which depends on T, n, d for unlimited inventory.

For limited inventory, we present an algorithm that may achieve revenue close to a fraction of the objective revenue, the revenue of any given policy while all hidden pa- rameter is given in the first place. Here, it may sound bizarre that we aompare with any policy instead of the “optimal”. This is because optimal can’t be defined since the context vectors doesn’t follow any distribution, and for any given policy, the context vector may be given adversarially that either all products are not sold throughout the time horizon, or they may be sold out too soon, leading to a very large regret.

In this thesis, we study the contextual dynamic pricing problem by first deriving some results for contextual bandit problem, and develop algorithms for the contextual dynamic pricing problem on top of those results. For the following subsection, we introduce the contextual bandit problem.

1.1.1 Contextual Bandit Problem

A fundamental problem in the area of machine learning is the multi-armed bandit prob- lem [26, 4], in which a learner iteratively chooses some actions to play and receives cor- responding rewards for a number of steps. The goal of the learner is to maximizes her (or his) total reward, but she can only learn from very limited feedback information: the reward of her action at each step. This models many scenarios in our daily life and has found many applications in computer science as well as other disciplines.

In some scenarios, we also have some additional context information which may help us make better decisions if used properly. Consider for example an online advertising system associated with a search engine which chooses to display an ad for each incoming search query. One can see the possible ads as actions to choose from and the reward of an action at a time step corresponds to whether or not the user who sends the query actually clicks on the displayed ad. Note that in this scenario, the only feedback the system can receive at each step is whether or not the user clicks on the displayed ad, and the system has no idea at all about whether or not the user would click on a different ad had the system chosen to display it. This naturally fits the bandit feedback setting discussed before, but the

(17)

main difference is that the same action (ad) can have a very different reward distribution at each time step (due to a very different search query). As it seems reasonable to assume here that the reward distribution is in fact governed by the corresponding search query, the contextual bandit problem [22, 9] has been introduced to model this.

Formally, in the contextual bandit problem, a learner has to play in the following way for a total of T steps. At each step, the learner must choose one of some K actions to play based on some context vectors associated with these actions at that step. After choosing the action, the learner then receives some reward corresponding to that action, which is the only feedback she has, and then proceeds to the next step. The standard goal of the learner is to minimize her regret, which is the gap between the total reward she receives and the highest possible one (by taking the best action at each step).

For this problem, linear reward functions are usually assumed, and algorithms achiev- ing a regret of O(!

T dlog3(KT ))have been shown in [3, 13], where d is the dimension of the context vectors. However, both algorithms are quite complicated and may not be practical enough for real-life applications. In particular, both algorithms consist of two procedures, a master one and a base one, with the master one dividing the time steps into several parts, making sure some independence guarantee within each part, and calling the base one on each part. The reason for going through such a trouble is because they can only show that their base procedures work under some independence condition. In fact, the algorithm of Chu et al. [13] is modified from that of Auer [3] by using a different base procedure called LinUCB, introduced by Li et al. [22]. The LinUCB algorithm itself is known to work well in practice [22], but unfortunately no theoretical guarantee has been shown [13].

It is not clear if the LinUCB algorithm, with the advantage of being relatively simple, can actually achieve a small regret, comparable to that of the much more complicated algorithms of [3, 13].

In order to relate the results back to the dynamic pricing problem, several extensions and modifications of the problem will also be discussed in this thesis.

(18)

1.2 Dynamic Pricing with Strategic Buyers

Next, we turn to a completely different scenario. So far, we considered the scenario with a single consumer each round. In the previous section, when a consumer enter the market, she shall see a take it or leave it price. Whether she decides to buy it or not, she will leave the market after the round. This may not be a realistic setting for a real world market. In general, a consumer may be willing to stay in the market and wait for a satisfiable price.

Moreover, even if a consumer sees a price lower than her current private valuation, she may want to wait while believing a much lower price will occur in the near future.

There were few previous works that consider online learning problem under this sce- nario. Different from the previous works, we are interested to the situation that the dis- tribution of valuation for buyers are unknown, since it may be more realistic. The main difficulty is that since all the buyers are strategic, there action is highly unpredictable for general pricing strategy. Usually, in order to assure strategic consumer to act myopically, some strong restriction on the pricing strategy is necessary. In [12], in order to ensure that the buyers will act myopically, they assume that the seller may only choose a “robust dynamic pricing policy”, in which the price may not decline too fast through time. How- ever, such assumption seems very unnatural when we are looking at an online learning problem, since the exploration through different prices would be limited.

[12] claims that with the simple robust pricing policy they proposed, there will be an unique equilibrium since it induces a dominant strategy for all buyers. However, such equilibrium holds only if the monopoly’s pricing policy is restricted to some robust dy- namic pricing, which is defined in [12]. A constraint so called ’restricted sub-martingle’

constraint should be satisfied, and the behavior of buyers are predictable only under such constraint. However, this is highly unnatural while deciding price within a finite horizon.

To sum up, in order to apply simple, efficient algorithm to the dynamic pricing problem with strategic buyers, the constraint should be relaxed to an operable degree. At the same time, the behavior of the buyers must still be predictable, that is, for a buyer with certain private valuations to the product, no matter what she beleives in the pricing policy, she should be choosing the same selection(to buy or not).

(19)

1.3 Our Results

Our results will be divided into three parts. First, the results for contextual bandit problem;

Second, the results for contextual dynamic pricing problem, applying the results in the first part; Third, results for dynamic pricing for strategic buyers.

1.3.1 Results for Contextual Bandit Problem

Although [1] derived an regret bound for an algorithm similar to LinUCB, we provided an analysis method different to them, and achieve a similar regret: O(√

T dlog T ). Inter- estingly, comparing to [13], the simple algorithm turns out to have a regret incomparable to those of the more sophisticated counterparts: it is smaller by a"

log T factor but larger by a√

dfactor, which makes it a better choice for a large enough T .

Our second result is to provide a different algorithm, which is as simple as the Lin- UCB algorithm and achieves the same regret bound. The algorithm is based on the online Newton step of [20]. We show that the algorithm can work for a more general class of reward functions known as the generalized linear functions [16], which include logistic functions considered in [31].

Both algorithms are simple, which may have their advantage in practice. They are designed based on different principles and we analyze them in two very different ways.

Both algorithms are not new, and we consider our contribution as providing new under- standing for them. In particular, we show that the LinUCB algorithm not only works well in practice but in fact has a strong theoretical guarantee. On the other hand, the online Newton step algorithm was originally designed for a different problem (online convex optimization in the full-information setting), but inspired by [31], we make it work for the contextual bandit problem, with not only linear but also some generalized linear reward functions.

The second result can be applied to some other settings, which will be formally dis- cussed in Chapter 2 and Chapter 3.

Remark. Our results for contextual bandit problem was done in 2017, and was submitted

(20)

to International Conference on Machine Learning(ICML). However, our paper was not admitted, and since then, there are some furthur results that studied on broader models than ours, such as [23] and [21]. We apply our results instead of the new ones for con- sistency of our research, and the difference between our results and ther new ones isn’t a point of concern in our application.

1.3.2 Results for the Contextual Dynamic Pricing Problem

Our first result is for unlimited inventory. We directly apply the second result of contextual bandit problem and achieve a O(T2/3ndlog T ) regret bound with a simple algorithm under batched setting, which will be O(T2/3dlog T ) under non-batched setting.

Our second result is for limited inventory. We introduce an algorithm that given a pol- icy of choosing price under full understanding of parameters, our algorithm achieve total revenue with regret O#

KT2/3(β + dlog T )$

, comparing to some ratio of the revenue under such policy. The performance guarantee is under “any” context vectors.

In order to deal with context vector that may be given adversarially, the algorithm we design contains dynamic decision of explore and exploit. In order to prevent selling with price too low, leading to sold out too early, our algorithm pick prices in a “conservative”

sense.

Both algorithms we present are very simple, however, as describe above, there are sev- eral implications in them. We consider our contribution as defining an interesting scenerio that is a lot more general, and provide a simple algorithm with performance guarantee.

1.3.3 Results for Dynamic Pricing for Strategic Buyers

We use a model similar to the one described by Chen and Farias [12], and solve the learning version of it. We have 2 main results.

Our first result is to design a selling mechanism, so that instead of using robust dynamic pricing as in [12], the seller can pick her selling policy from a much broader action set.

We call the mechanism dynamic installment mechanism since after buying the product, the payment will be span through a period of time, and the average payment amount of

(21)

each time step may decrease dynamically, depending on the posted price after the time of purchase. Under this mechanism, the actions of buyers will be predictable even in a broader action set.

Specifically, when a buyer has valuation v and faces a posted price (but not payment in this mechanism) p, her dominant strategy is to buy immediately if v ≥ p, and is to wait if v ≤ pθ for some θ > 1 that will be specified in Chapter 5. Hence, for each posted price, the seller can learn some information of the buyers valuation while they decide whether to purchase or not.

Second, we design a simple 2-stage learning algorithm to the problem, and achieve revenue close to an approximate algorithm.

1.4 Organization of the Thesis

The rest of the thesis is organized as follows. First, we provide some preliminaries and formulate all the models we study in Chapter 2. In Chapter 3 we present our regret analysis for the LinUCB algorithm. We also show a different algorithm, based on the online New- ton step, which works not only for linear but also some generalized linear reward function.

In Chapter 4, we apply the results in Chapter 3 and develop algorithms with performance guarantee for contextual dynamic pricing. In Chapter 5, we show a new selling mecha- nism for dynamic pricing with strategic buyers. Finally, Chapter 6 includes conclusion and discussions of possible future works.

(22)
(23)

Chapter 2

Problem Formulation

In this chapter, we give formal definitions for all the settings introduced in Chapter 1. And this chapter will go through all the following.

First, in Section 2.1, we introduce some notations and definitions that will appear throughout the thesis.

Next, we define the contextual bandit problem in Section 2.2, which is the foundation for us to deal with the contextual dynamic pricing problem. We will deal with the common setting carefully in Section 3.1. However, in order to solve the contextual dynamic pricing problem, it requires simple modifications on the setting. Those modifications will be shown in Subsection 2.2.1 and Subsection 2.2.2.

Then, we will define the contextual dynamic pricing problem in Section 2.3, and such problem with limited inventory will be defined in Subsection 2.3.1.

2.1 Preliminary

Let us first introduce some notations and definitions which we will use later. Let N denote the set on positive integers, R the set of real numbers, and R+ the set of positive real numbers. For K ∈ N, let [K] denote the set {1, . . . , K}. For a vector x, let ∥x∥ denote its L2 norm, and for a positive-semidefinite (PSD) matrix A, we will also consider the weighted norm ∥x∥A = xAx. For d ∈ N, let 0ddenote the d-dimensional vector with every entry equal to 0, and let Iddenote the d×d identity matrix. For d1, d2 ∈ N, let 0d1,d2

(24)

denote the d1× d2 matrix that all entries are 0.

For d-dimensional distribution D, denote x ∼ D that x is an d-dimensional vector drawn from D, and for a sequence of vectors I = x1, . . . , xT, denote I ∼ DT that x1, . . . , xT are drawn i.i.d according to D.

Throughout this thesis, we will consider severel different settings, which will be in- troduced respectedly in the following sections.

2.2 Contextual Bandit

In this thesis, we study the following contextual bandit problem, which can be seen as a generalization of the classical multi-armed bandit problem. In the problem, there are K arms (or actions) for us to choose, but they may have different reward distributions at each time step due to different context (or feature) vectors. Formally, at each time step t, we are given K different context vectors xt,1, . . . , xt,K, and we need to choose an arm atto play. After our choice, we receive a corresponding random reward rt sampled from some distribution with mean R(xt,at, θ), for some known function R with respect to some unknown parameter vector θ. We repeat this for a total of T steps, and our goal is to minimize the total expected regret, defined as

%T t=1

maxa R(xt,a, θ)−

%T t=1

R(xt,at, θ), (2.1)

which is the difference between the best possible total expected reward and that of our algorithm. For simplicity of exposition, we will assume that each feature vector has norm

&

&xt,a&

& ≤ 1 and the target vector also has ∥θ∥ ≤ 1. This is without loss of generality because we can divide them by some factors if necessary.

In this thesis, we will focus on the case of binary rewards, with each rtbeing a binary value in {0, 1}, which models for example whether or not a user clicks on the ad we (the search engine) chose to display. Moreover, for the density function of rt, we will focus on

(25)

the type of functions known as the generalized linear functions, such that

R(xt,a, θ) =Pr'

rt= 1| xt,a(

= L#

xt,aθ$

, (2.2)

for some increasing function L known as the link function. Note that this includes the case of linear reward functions considered by [13], because by taking L(y) = y, we have

R(xt,a, θ) = xt,aθ.

This also includes the case of logistic functions considered by [31], because by taking L(y) = 1/(1 + e−y), we have

R(xt,a, θ) = 1 1 +exp#

−xt,aθ$. (2.3)

We will assume the following, which is satisfied by the logistic functions.

Assumption 1. There exist some constants ρ0 ∈ (0, 1), ρ1 > 0, and ρ2 > 0, such that for any y ∈ [−1, 1], the following three conditions hold:

1. ρ0 ≤ L(y) ≤ 1 − ρ0, 2. L(y)≤ ρ1, and

3. L′′(y)≥ ρ2.

The last two conditions will be used later to imply our Assumption 5 in Subsection ??.

The first condition is used to help us bound the regret. More precisely, it ensures that for

¯

a =arg maxaR(xt,a, θ)and a =arg maxaxt,aθ,

R(xt,¯a, θ)− R(xt,at, θ)

≤ ρ1

#xt,¯aθ− xt,atθ$

≤ ρ1#

xt,aθ− xt,atθ$ .

Thus with a constant ρ1, to bound the regret defined in (2.1), it suffices to bound the

(26)

following surrogate regret

%T t=1

)

maxa xt,aθ− xt,atθ

*

. (2.4)

2.2.1 K-Armed Contextual Bandit

A K-armed contextual bandit is a well known variation of Section 2.2, and can be a more suitable setting for contextual dynamic pricing. Similarly, there are K arms for us to choose, with possibly different reward distributions at each time step due to different con- text vectors. Formally, at each time step t, we are given a single context vectors xt, and we need to choose an arm atto play. In this variation, there will be K unknown parameter vec- tors θ∗,1, . . . , θ∗,Kand K known functions R1, . . . , RK, each representing different arms.

After our choice, we receive a corresponding random reward rtwith mean Rat(xt, θ∗,at).

Same as Section 2.2, we repeat this for a total of T steps, aiming to minimize the total expected regret, defined as

%T t=1

maxa Ra(xt, θ∗,a)−

%T t=1

Rat(xt, θ∗,at), (2.5)

We will also assume∥xt∥ ≤ 1 and&

∗,at

&

& ≤ 1. Furthurmore, for the density function of rt, we will focus on generalized linear functions, such that

Ra(xt, θ∗,a) = La

#xt θ∗,a$

, (2.6)

for some increasing link functions L1, . . . , LK. We assume all of these link functions satisfiy Assumption 1.

Here, since we have K different link functions, we will have K (ρ0, ρ1, ρ2) tuples, denoted as (ρ0,1, ρ1,1, ρ2,K), . . . , (ρ0,K, ρ1,K, ρ2,K). For simplicity, take ρ0 = supaρ0,a, ρ1 = supaρ1,a and ρ2 = supaρ2,a, and can be arranged as Assumption 2. Notice that here K is finite, hence simply taking maximum is well defined, but we take supremum for possible extension to a larger action space.

(27)

Assumption 2. There exist some constants ρ0 ∈ (0, 1), ρ1 > 0, and ρ2 > 0, such that for any y ∈ [−1, 1], for all action a, the following three conditions hold:

1. ρ0 ≤ L(y) ≤ 1 − ρ0,

2. L(y)≤ ρ1, and

3. L′′(y)≥ ρ2.

In this setting, it is not enough to consider the surrogate regret as Section 2.2 now, but the above assumption will be enough for us to mimic the actual regret by the surrogate regret

%T t=1

)

maxa xtθ∗,a− xt θ∗,at

*

. (2.7)

We will deal with this in Chapter 4.

2.2.2 Batch K-Armed Contextual Bandit

In this subsection, we introduce a staightfoward extention of K-Armed Contextual Ban- dit. That is, instead of one single context vector per time step, we are given n of those, xt,1, . . . , xt,n at each time step t. Different to the first definition in Section 2.2, these vectors doesn’t represent arms, but it means an arm dicision of one step should be made considering n different context vectors.

In each step, we choose an arm atto play, and receive corresponding rewards rt,1, . . . , tt,n

with mean Rat(xt,1, θ∗a,t), . . . , Rat(xt,n, θ∗,at), and repeat this for a total of T steps, as in Subsection 2.2.1. The expected regret we want to minimize for this extended model is defined as

%T t=1

maxa

⎧⎨

%n i=1

Ra(xt,i, θ∗,a)

⎫⎬

⎭−

%T t=1

%n i=1

Rat(xt,i, θ∗,at), (2.8)

which is a direct extension of (2.5).

Notice that the extension can be undo by taking n = 1.

(28)

2.3 Contextual Dynamic Pricing

Subsection 2.2.1 may be easily applied to construct a contextual dynamic pricing (CDP) problem.

A classical dynamic pricing problem can be modeled as: A monopoly wants to sell a single kind of commodity over T rounds. In round t, the monopoly posts a price pt, and a consumer comes with a private valuation vt, drawn independently from an fixed unknown distribution D. A purchase will occur if and only if vt≥ pt. The goal for the monopoly is to maximize his expected revenue, which is

%T t=1

p· Pr [vt≥ pt] , (2.9)

and the regret is the difference between the maximum expected revenue and that of our algorithm, that is,

Tmax

p

1p· Pr [vt≥ p]2

%T t=1

pt· Pr [vt ≥ pt] . (2.10)

For a contextual dynamic pricing(CDP) problem, we will assume futhurmore that, we have some “information” for each buyer and the probability distribution behind. In addition, to make our model more reasonable, we will extend our problem to the batch setting that more than one consumer comes in each round.

Formally, at each round t, n consumers come, represented by n context vectors xt,1, . . . , xt,n, we propose a price pt. Each consumer comes with private valuation, drawn independently from an fixed unknown distribution D, which depends on her context vector. A purchase will occur if vt,i ≥ pt. Hence, the regret of such problem is defined as,

%T t=1

maxp

⎧⎨

%n i=1

p· Pr'

vt ≥ p | xt,i

(

⎫⎬

⎭−

%T t=1

%n i=1

pt· Pr'

vt≥ p | bxt,i

(. (2.11)

We have additional assumption for distribution D. For a given price p, the expected reward of each purchase is related to a link functions Lpwith an unknown parameter θ∗,p,

(29)

that is,

pPr'

vt≥ p | xt,i

( = Lp#

xt,iθ∗,p$

. (2.12)

As usual, Lp is increasing and satisfies Assumption 1, that is, there exist ρ0,p, ρ1,p, ρ2,p

for each price p. Also,&

&xt,i&

& ≤ 1 and&

∗,p&

& ≤ 1. Furthurmore, we assume p ∈ [0, 1] for simplicity.

Since Lp represents expected reward for selling in a single round, Lp can be assumed to be nonnegative. For convenience, we extend the range for each Lp from [−1, 1] to [−1, 1] ∪ {−∞} and define

Lp(−∞) = 0 (2.13)

for each p.

2.3.1 CDP With Limited Inventory

We will add another constraint for contextual dynamic pricing, limiting the total, non- replenishable inventory to M. In consequence, for a purchase to take place, not only should vt,i ≥ pt, but there should still be commodity left. Denote purchase occurs or not as rt,i, that is,

rt,i = 1

⎧⎨

⎩vt,i ≥ pt,

%t−1 j=1

%n k=1

rj,k +

%i−1 k=1

rt,k < n

⎫⎬

⎭, (2.14)

and,

Pr'

rt,i = 1(

= Pr

⎣vt,i ≥ pt,

t−1

%

j=1

%n k=1

rj,k +

i−1

%

k=1

rt,k < n

= Pr'

vt,i ≥ pt( Pr

⎣%t−1

j=1

%n k=1

rj,k+

%i−1 k=1

rt,k< n

= Lp

#xt,iθ∗,pt$ Pr

t−1

%

j=1

%n k=1

rj,k+

i−1

%

k=1

rt,k < n

⎦ .

(30)

The expected revenue is then,

E

%T t=1

%n i=1

ptri,t

⎦ (2.15)

In this problem, we define the regret of CDP with limited inventory as the difference between the expected revenue and a given robust fixed policy profile, which is defined as the following. A fixed policy is a mapping from context vectors to a price, that is, policy π : Xn → [0, 1] only takes the current context vectors as input, but not the history of context vectors or purchase. Let7 be the set of all policies, and a fixed policy profile ϖ is a mapping from parameters to profile, that is, to pick a π ∈7 according to θ∗,p’s.

Finally, we say a fixed policy profile is robust if it satisfies the following assumptions:

Assumption 3. (Robustness)

1. (Consistency) The price that ϖ picks depends only on xθ∗,p’s. That is, ϖ select a price with the probability distribution of purchase only, since Lp is a fixed function for each p. ϖ picks without randomness, and will always pick one price, that is, there will be no situation of tie.

2. (Monotonicity) For any fixed price p, higher purchase probability leads to higher preference. That is, for any given price pa and pb, when ϖ prefers pa over pb with xθ∗,pa = saand xθ∗,pb = sb, ϖ will also prefer paover pbwhenever xθ∗,pa ≥ sa and xθ∗,pb ≤ sb.

3. (Reasonability) A price won’t be picked if the expected reward of it is 0, unless all price have expected reward 0. In convention, the highest price will be chosen while all price have expected reward 0.

4. (Conservative) There exists some η ∈ [1, ∞) that if ϖ prefers pa over pb, then the expected rewards, ra .

= Lpa8

xθ∗,pa9

and rb .

= Lpb8

xθ∗,pb9

, satisfies

ra≥ rb

η

(31)

Notice that pairwise comparison is used in the assumption of monotonicity and con- servative. This can be done by “turning off” all other prices, that is, to create the situation that any other prices have purchase probability 0.

While a price can only be selected from P, we denote the set of all robust fixed pol- icy profile as ΓP. P = [0, 1] when there is no restriction for price set. For context vectors I = x1, . . . , xT, denote the expected revenue of an robust fixed policy profile ϖ ∈ Γ[0,1] as EI[Rev(ϖ)]. For a learning algorithm A, denote the expected revenue of A as EI[Rev(A(ϖ)] while mimicking ϖ. Formally, the regret of an online algorithm A is defined as

ϖ∈Γsup[0,1]

supI E

I[Rev(ϖ)− Rev(A(ϖ)] (2.16)

Since A is generally not a fixed policy, the regret of A may be negative.

The regret is defined as a characterization of the ability an algorithm to mimic ϖ. This is a reasonable variation considering the selection of context sequence is in an adversar- ial setting. Consider context vectors in the stochastic setting, we can imagine an online algorithm is mimicking a fixed policy profile that is “optimal”, at which optimal is in the perspective of expectation. However, in an adversarial setting, we can’t define “optimal”

since any policy profile may act horrible for some context sequence.

We hence take a step further, requesting an algorithm to mimic any fixed policy profile under reasonable assumption, and left the selection of the policy profile flexible. Notice that the policy profile is given for an algorithm to mimic under any context sequence, and only the parameters θ∗,p’s are unknown.

For simplicity, when considering robust fixed policy profile ϖRob with |P| = K for some K ∈ N, denote P = {p1, . . . , pK}, with p1 < p2 <· · · < pK. It is legit for ϖRobto only take xθ∗,p1, . . . , xθ∗,pK as input by the assumption of robustness. That is, a fixed price p = ϖRob(xθ∗,p1, . . . , xθ∗,pK)will be outputed with p ∈ P.

(32)

2.4 Dynamic Pricing With Strategic Buyers

For this problem, we will study a continuous time model similar to [12]. First, the selling horizon for the monopoly is [0, T ]. We also denote the initial inventory by x0, the inventory process by Xt, and the sales process by Nt= x0− Xt.

Same as [12], buyers arrives in the selling horizon according to a Poisson process with a fixed rate λ, which is known by the seller. A buyer arriving at time t has a private valuation v, and a valuation decline rate r. Different to [12], the arrival of each buyer is revealed once they enter the market, and t is common knowledge for each buyer and the seller. Valuation for each buyer is drawn i.i.d from a unknown distribution, but the valuation decline rate is not assumed to follow any distribution, but is bounded in some [r, ¯r]. Here, for simplicity, different to [12], we assume there is no time discount vector for the buyer. When a buyer arrives at time t, for the rest of the time, her action is to purchase, wait, or to leave the market.

A consumer is denoted as type φ with the tuple

φ .

= (tφ, cφ, rφ),

and define the purchase tuple

yφ .

= (τφ, aφ, pφ),

where τφis the time of consumer leaving the market(whether with purchase or not), aφ =

∞(seller has inventory left and buyer φ makes a purchase), and pφis the payment. For a buyer, her utility of here is defined as

U (φ, yφ) = aφ

# vφ

81− rφφ− tφ)9

− pφ

$

, and the utility is 0 if she leaves without buying. This is slightly different with the one defined in [12], since there is no time discount vector and the valuation decline rate plays a slightly different role to the monitoring cost in [12]. Notice that the payment can only be determined at some time tφ≥ τφ, and the time of paying doesn’t change the utility.

(33)

As usual, before applying any learning algorithm, we make an additional assumption for simplicity of exposition, that is, the valuation is in some interval [a, b] almost surely, with b > a > 0. Although here the distribution for the buyers’ valuation is unknown, we will still make the standard assumption:

Assumption 4. Let F (·) be the c.d.f and f(·) be the corresponding p.d.f of the distribution of valuation. Denote ¯F (·) = 1 − F (·), then v − F (v)f (v)¯ is non-decreasing in v ∈ [a, b] and has root in [a, b].

Furthurmore, in order to make exploration of the distribution possible, we assume that there is a lower bound P0 > 0that ¯F (v)≥ P0∀v ∈ [a, b], or just simply ¯F (b) ≥ P0. We will also assume P0T ≪ x0so that simply setting the price to b will not be near-optimal, and P0x10 so that exploration may not be too costful.

In this thesis, we extend the selling mechanism to a broader class than the simple posted price mechinism. In each round, instead of posting a single “take it or leave it” price, the monopoly may propose a “contract”, and the consumer makes a purchase by accepting the contract. Formally, a contract is defined as the following.

Definition 1. A contract Πt, proposed at time t, specifies the payment p(t) of a buyer to obtain the commodity, while p(t) is a measurable mapping for the filtration FT. Filtra- tion Ft contains all the public information up to time t, and it is fixed once the selling mechanism is decided. Formally, Ft= σ(Πt, Xt).

For simplicity, we only consider contract with nonnegative payment, since contract with negative payments can only cause lower revenue than a modified one which changes all negative payments to 0.

Notice that the payment of a contract may not be a fixed price at the time it is signed, but it will be a deterministic value at the end of round T . The definition of contract is quite abstract, roughly, it is construct not only with a fixed price, but the payment depands on what happens in the future.

In order to construct contract with performance guarantee, we will like to discuss con- tract in a more specific sense. Hence, we define “posted price contract”.

(34)

Definition 2. A posted price contract is a contract constructed by posted price. That is, at each time t, similar to the posted price mechanism, there will be a posted price π(t) ≥ 0.

A contract Πtis a mapping from {π(τ)}τ∈[0,T ]to a payment.

Remark. We have some remarks for our model:

• For the above, we assume that all the contracts accepted will be fulfilled. We can imagine that there is a trusted third party, for instance, a court, that would punish anyone who breaks a contract with a huge number of fines.

• We define valuation decline rate instead of monitoring cost. This is because in our imagination, even if the product can be long-term use, the demand of each buyer can be a short time period. For example, a student need a word book for his GRE a month later, then if he buy it immediately, he can read it for the a whole month.

However, it he buy it 15 days later, he needs to borrow it for before he purchase one.

Since the book may be useless for him after the exam, the valuation of the book for him declines through time.

• It is normal to assume that for each consumer, her valuation for the product declines through time. There are several different models studied in previous works, such as linearly decay valuation [17], exponentially decay valuation [5] or models that are much more complex [27, 19]. For simplicity, throughout this thesis, we apply the linear model.

• Although we ignore the monitoring cost here, the valuation decline rate has a similar impact of the buyers utility.

Recall the discussion in Section 1.2, [12] defines “robust dynamic pricing” with some constraints for strategic buyers to act myopically. However, we would like to modifiy the constraints to a more natural and different strategy space. On the following, we define a

“modeified robust dynamic pricing”, and all the results in Section 5 follows such policy.

(35)

2.4.1 Modified Robust Dynamic Pricing

In [12], a robust dynamic pricing policy {π(t)} has to satisfy the following constraints:

• π(t) is left-continuous and adapted to Ft.

• (restricted sub-martingale) For all t such that Xt > 0, for all t > t,

E[(π(t) − π(t))+| Ft−]≤ θ(t− t)

, where θ is the lower bound of monitoring cost, which was defined in [12], similar to the valuation decline rate we define here.

• π(t) = ∞ if Xt= 0and π(t) < ∞ otherwise.

In this thesis, since we want to develop a learning algorithm, policy with such strict restrictions may not work, not to say it is a little irrational for both the seller obeys these restriction, and the buyers believe that the seller will do so. On the other hand, since we develop the concept of contract, the first restriction shall not hold. Hence, we will need a relaxed version.

Moreover, in a continuous time model, it is natural to assume that there is a reaction time ϵ > 0 for the seller to adjust the posted price. That is, the posted price π(t) can only depend on the information before t − ϵ. If we take ϵ → 0, it will reduce to the first constraint of robust dynamic pricing policy. However, in this thesis, we let ϵ very small but positive.

A modified robust dynamic pricing (contract) policy {Πt} has to satisfy the following constraints:

• Πt is left-continuous and adapted to FT. However, for the posted price, the con- straint π(t) is left-continuous and adapted to Ft−ϵdoes not change.

• For a buyer φ arrives at any t, denote her strategy set as Φ, for all t < t ≤ t + 1r, then

φ uses aE [(π(t)− π(t))+ | Ft]

(36)

is a constant for all a ∈ Φ.

• Πt=∞ if Xt−ϵ= 0and Πt<∞ otherwise.

For the first constraint, it is a direct relaxation due to the concept of contract. For the third constraint, since we assume valuation bounded in [a, b] almost surely, it can be modified as: Πt=∞ almost surely if Xt−ϵ = 0and Πt ∈ [a, b], almost surely, otherwise.

To keep the analysis simple, we will ignore the effect of ϵ on the revenue since ϵ is very small. And by assuming that at the points of discontinuity on π, no consumer will arrive within the delay due to reaction time, we can avoid the situation that a consumer arrives at some time t with Xt = 0and Πt <∞.

The second constraint is requires more attention. For the original robust dynamic pric- ing, it means that the price may not decline too fast. In the modified version, we just need the posted price(not payment) not to decline faster due to a single “non-purchase”. Now, it is possible for the seller to change the price in a more flexible way. And for the consumer, she no longer believes that the price will not drop too fast, but she believe that the decision she makes (especially when she decides to wait) may not make the price drop faster, at least not before she leaves the market.

Remark. Since we have relaxed the second constraint, the buyers may not act myopically.

However, the relaxed version is much more easier to follow, and is also more reasonable.

The goal for the seller is to maximaize the total revenue :

E :; T

0

ΠtdNt

<

,

with the above constraints.

(37)

Chapter 3

Contextual bandit

In this chapter, we focus on mathematical analysis on contextual bandit. In Section 3.1, we give a different regret upper bound analysis for the famous LinUCB Algorithm. In Sec- tion 3.2, we extend the contextual bandit problem from linear reward to generalized linear reward. We first prove a regret upper bound O#√

T dlog T$

using online Newton method in Subsection 3.2.1. In Subsection 3.2.2, we apply our analysis in Subsection 3.2.1 and prove the similar regret upper bound O#√

KT dlog T$

for K-armed contextual bandit with generalized linear reward, and then extended it to a batch setting in Subsection 3.2.3, which gives a regret upper bound of O#√

KT ndlog T$ .

3.1 LinUCB Algorithm

In this section, we consider the LinUCB algorithm of [13], shown in Algorithm 1. While [13] conjectured that the algorithm actually works, they were not able to prove so, which forced them to build a sophisticated procedure on top of it. We manage to provide a new analysis, which shows that the algorithm indeed works. The algorithm depends on some parameter α, and the following theorem shows that for a properly chosen α, Algorithm 1 achieves a regret about the order of√

T dlog T . Theorem 1. For any δ ∈ (0, 1), there exists some

α≤ O#"

dlog T + log(K/δ)$

(38)

Algorithm 1LinUCB Input: α ∈ R+, K, d∈ N A1 ← Id, b1 = 0d

for t = 1 to T do θt← A−1t bt

Observe K feature vectors xt,1, ..., xt,K ∈ Rd for a = 1 to K do

pt,a ← θt xt,a+ α!

xt,aA−1t xt,a

end for

Choose action at =arg maxapt,a. Receive reward rt∈ {0, 1}.

At+1← At+ xt,atxt,at bt+1 ← bt+ xt,atrt

end for

such that with probability at least 1 − δ, the expected regret of the LinUCB algorithm is at most

O# α"

T dlog T$ , which is O(

T dlog T ) for K ≤ TO(d)and constant δ.

Before proving the theorem, let us introduce some definitions:

• Dt: the (t − 1) × d matrix with xτ,aτ as its τ’s row.

• yt: the (t − 1)-dimensional vector with rτ as its τ’s component.

• st,a=!

xt,aA−1t xt,a.

According to the definitions, we have

At= Id+ DtDt and bt= Dt yt.

To prove the theorem, we rely on the following key lemma.

Lemma 1. For any δ ∈ (0, 1), there exists some α ≤ O("

dlog T + log(K/δ)) such that for any a and t,

Pr=>>>xt,aθt− xt,aθ>>>> αst,a+ 2 T

?

≤ δ

KT.

(39)

We will prove the lemma in Subsection 3.1.1, and now we use it to prove the theorem.

According to the lemma together with a union bound, we have with probability 1 − δ that

∀a ∈ [K], t ∈ [T ] :>>>xt,aθt− xt,aθ>>>≤ αst,a+ 2

T. (3.1)

In the following, let us assume that the condition (3.1) above indeed holds. Recall that the total expected regret is

%

t∈[T ]

#xt,aθ− xt,atθ$ ,

where we let a denote the optimal action at step t, which is arg maxa∈[K]xt,aθ. Let us decompose the term xt,aθ− xt,atθ as

xt,aθt+ xt,a− θt)− xt,atθt+ xt,att− θ) ,

which by (3.1) is at most

xt,aθt+ αst,a + 2

T − xt,atθt+ αst,at + 2 T.

As xt,aθt+ αst,a ≤ xt,atθt+ αst,at according to our choice of at, we have

xt,aθ− xt,atθ ≤ 2αst,at+ 4 T. Then by summing over t, we have

%

t∈[T ]

#xt,aθ− xt,atθ$

≤ 2%

t∈[T ]

αst,at + 4.

Finally, Theorem 1 follows from the following lemma of [13].

Lemma 2. @

t∈[T ]st,at ≤ O("

T dlog T ).

3.1.1 Proof of Lemma 1

Fix any a ∈ [K] and t ∈ [T]. Following [13], let us

參考文獻

相關文件

Exploring the online reading comprehension strategies used by sixth-grade skilled readers to search for and locate information on the Internet.. Disorientation in hypermedia

Numerical results are reported for some convex second-order cone programs (SOCPs) by solving the unconstrained minimization reformulation of the KKT optimality conditions,

Section 3 is devoted to developing proximal point method to solve the monotone second-order cone complementarity problem with a practical approximation criterion based on a new

Although we have obtained the global and superlinear convergence properties of Algorithm 3.1 under mild conditions, this does not mean that Algorithm 3.1 is practi- cally efficient,

Children explore the online world alone, but they use message boards to share what they find and what they do in the different creative studios around the virtual space.. In

Like the proximal point algorithm using D-function [5, 8], we under some mild assumptions es- tablish the global convergence of the algorithm expressed in terms of function values,

Based on the reformulation, a semi-smooth Levenberg–Marquardt method was developed, and the superlinear (quadratic) rate of convergence was established under the strict

Optim. Humes, The symmetric eigenvalue complementarity problem, Math. Rohn, An algorithm for solving the absolute value equation, Eletron. Seeger and Torki, On eigenvalues induced by