In the contextual dynamic pricing problem models scenarios like product selling on the internet. The price may be decide while observing some information of the user clicking the website, and hence the owner of the website may dynamically decide which price to post that can lead to maxmimum revenue.
In this problem, a consumer enter the market each step. Each consumer may only decide whether to purchase or not within the round she enters the market, and will not stay for the upcoming time steps. The seller iteratively pick a price at each step. In a normal dynamic pricing problem, the only information which the seller may learn is whether the consumer purchase in each round. In the “contextual” dynamic pricing problem, the seller observe some additional context information from each consumer before setting a price for her. This may help making proper dicisions.
Here, we introduce a simple extension: the batched setting. In real world situation, it may be difficult for all consumers seeing different price. In a batched setting, a price is decided for a batch of consumers, observing all their context information. Normally, a group of consumers of the same type may arrive within a small period of time. Hence, viewing the small time interval as a single time step, it is natural to customize the posted price for a batch of consumers.
More formally, in the batched-contextual dynamic pricing problem, there are T rounds in total. At each round, n consumers enter the market, and the seller observe n context vectors, each are d-dimensional. The context vectors are arbitrary, and may be adversari-ally given. The original problem without batched setting is a special case in which n = 1.
The seller must choose a price out of K fixed candidates, and will receive feedback in-formation on whether each consumer makes a purchase or not. For the case of limited inventory, there will be only M products to sell.
In such a specific setting, there were no known fundamental limit on the performance of an online algorithm. In this thesis, for the case of infinite inventory, we first derive a
regret bound which depends on T, n, d for unlimited inventory.
For limited inventory, we present an algorithm that may achieve revenue close to a fraction of the objective revenue, the revenue of any given policy while all hidden pa-rameter is given in the first place. Here, it may sound bizarre that we aompare with any policy instead of the “optimal”. This is because optimal can’t be defined since the context vectors doesn’t follow any distribution, and for any given policy, the context vector may be given adversarially that either all products are not sold throughout the time horizon, or they may be sold out too soon, leading to a very large regret.
In this thesis, we study the contextual dynamic pricing problem by first deriving some results for contextual bandit problem, and develop algorithms for the contextual dynamic pricing problem on top of those results. For the following subsection, we introduce the contextual bandit problem.
1.1.1 Contextual Bandit Problem
A fundamental problem in the area of machine learning is the multi-armed bandit prob-lem [26, 4], in which a learner iteratively chooses some actions to play and receives cor-responding rewards for a number of steps. The goal of the learner is to maximizes her (or his) total reward, but she can only learn from very limited feedback information: the reward of her action at each step. This models many scenarios in our daily life and has found many applications in computer science as well as other disciplines.
In some scenarios, we also have some additional context information which may help us make better decisions if used properly. Consider for example an online advertising system associated with a search engine which chooses to display an ad for each incoming search query. One can see the possible ads as actions to choose from and the reward of an action at a time step corresponds to whether or not the user who sends the query actually clicks on the displayed ad. Note that in this scenario, the only feedback the system can receive at each step is whether or not the user clicks on the displayed ad, and the system has no idea at all about whether or not the user would click on a different ad had the system chosen to display it. This naturally fits the bandit feedback setting discussed before, but the
main difference is that the same action (ad) can have a very different reward distribution at each time step (due to a very different search query). As it seems reasonable to assume here that the reward distribution is in fact governed by the corresponding search query, the contextual bandit problem [22, 9] has been introduced to model this.
Formally, in the contextual bandit problem, a learner has to play in the following way for a total of T steps. At each step, the learner must choose one of some K actions to play based on some context vectors associated with these actions at that step. After choosing the action, the learner then receives some reward corresponding to that action, which is the only feedback she has, and then proceeds to the next step. The standard goal of the learner is to minimize her regret, which is the gap between the total reward she receives and the highest possible one (by taking the best action at each step).
For this problem, linear reward functions are usually assumed, and algorithms achiev-ing a regret of O(!
T dlog3(KT ))have been shown in [3, 13], where d is the dimension of the context vectors. However, both algorithms are quite complicated and may not be practical enough for real-life applications. In particular, both algorithms consist of two procedures, a master one and a base one, with the master one dividing the time steps into several parts, making sure some independence guarantee within each part, and calling the base one on each part. The reason for going through such a trouble is because they can only show that their base procedures work under some independence condition. In fact, the algorithm of Chu et al. [13] is modified from that of Auer [3] by using a different base procedure called LinUCB, introduced by Li et al. [22]. The LinUCB algorithm itself is known to work well in practice [22], but unfortunately no theoretical guarantee has been shown [13].
It is not clear if the LinUCB algorithm, with the advantage of being relatively simple, can actually achieve a small regret, comparable to that of the much more complicated algorithms of [3, 13].
In order to relate the results back to the dynamic pricing problem, several extensions and modifications of the problem will also be discussed in this thesis.