Literature Review - 基於深度強化學習的智能存貨控制：以高科技供應鏈為例

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

the last section, we summarize the entire research and have a discussion on implications, limitations, and further directions for applying DRL in inventory problems.

2. Literature Review

2.1 Reinforcement Learning

Reinforcement learning (Sutton and Barto, 2018), a branch of machine learning, combines dynamic programming and supervised learning (Gosavi, 2009) and has been considered as a powerful framework for solving MDPs, a represent type of sequential decision-making problems under uncertainty. RL has been applied in several fields, such as game playing (Silver et al., 2017), energy efficiency optimization (Qi et al., 2016) and inventory control (Oroojlooyjadid et al., 2019b). This approach is a way of training agents to learn the optimal policy through trial-and-error interactions between agent and the environment (see Figure 2).

In each time step t, the agent chooses an action, 𝑎_𝑡 ∈ 𝐴(𝑠_𝑡) (where 𝐴(𝑠_𝑡) is the set of possible actions in current state) by observing the current state, 𝑠_𝑡 ∈ 𝑆 (where 𝑆 is the set of possible states), gain reward 𝑟_𝑡 ∈ ℝ ,and then the environment transits randomly to another state, 𝑠_𝑡+1 ∈ 𝑆. The transition probability matrix, 𝑃_𝑎(𝑠, 𝑠^′), shows the probability of transition from state 𝑠_𝑡 to 𝑠_𝑡+1 by taking action 𝑎 , i.e., 𝑃_𝑎(𝑠, 𝑠^′) = Pr⁡(𝑠_𝑡+1 = 𝑠^′|𝑠_𝑡 = 𝑠, 𝑎_𝑡= 𝑎) . Correspondingly, 𝑅_𝑎(𝑠, 𝑠^′) denotes the reward matrix. The objective of RL algorithms is to maximize the expected sum of discounted rewards by determining a policy, 𝜋_𝑡:⁡𝑆 → 𝐴⁡to⁡maximize ∑^∞_𝑡=0𝛾𝐸[𝑅_𝑎_𝑡(𝑠_𝑡, 𝑠_𝑡+1)] , where 𝑎_𝑡 = 𝜋_𝑡(𝑠_𝑡) and 0 ≤ 𝛾 ≤ 1 is the discount factor.

Figure 2: Agent and environment for RL

‧

policy through dynamic programming, like value iteration and policy iteration (Gosavi, 2009), or linear programming (Sutton and Barto, 2018). However, in reality, 𝑃_𝑎(𝑠, 𝑠^′) is usually unknown or difficult to estimate. Instead of computing the value function directly, model-free methods learn directly from trial-and-error interactions with environment and estimate the value function by stochastic approximation. Generally, model-free approaches are often superior if you have no access to the underlying model of environment (Arulkumaran et al., 2017).

Among model-free methods, RL algorithms can be further divided into two categories:

on-policy and off-policy (Sutton and Barto, 2018). There are two policies in RL’s framework.

The one used to generate behavior is called as the behavior policy and the other one evaluated or improved is referred to as the target policy. The difference between on-policy and off-policy is whether its behavior and target policy are identical. On-policy methods attempt to evaluate the behavior policy same as the target policy, whereas off-policy methods separate these two policies. The separation of off-policy methods enables them to continue to sample all possible actions while the target policy is deterministic (e.g., greedy). SARSA and Q-learning are the representation of on-policy and off-policy methods respectively.

Q-Learning is an important RL algorithm, which solves a MDP by obtaining a policy 𝜋 that maximizes the Q-values for any 𝑠 ∈ 𝑆 and 𝑎 = 𝜋(𝑠), i.e.:

𝑄^∗(𝑠, 𝑎) = max

𝜋 𝐸[𝑟_𝑡+ 𝛾𝑟_𝑡+1+ 𝛾²𝑟_𝑡+2+ 𝛾³𝑟_𝑡+3+ ⋯ |𝑠_𝑡 = 𝑠, 𝑎_𝑡 = 𝑎, 𝜋] (1) To compute 𝑄^∗(𝑠, 𝑎), we use Bellman Equation to transform Equation (1) into Equation (2) and solve the MDP for best decision.

𝑄^∗(𝑠, 𝑎) = max

𝜋 𝐸[𝑟_𝑡+ γmax

𝑎

𝑄(𝑠_𝑡+1, 𝑎) |𝑠_𝑡= 𝑠, 𝑎_𝑡 = 𝑎, 𝜋] (2)

‧

continues to update by Equation (3), which is a weighted average method that consists of old Q-value and learned Q-value:

𝑄(𝑠_𝑡, 𝑎_𝑡) = (1 − 𝛼_𝑡)𝑄(𝑠_𝑡, 𝑎_𝑡) + 𝛼_𝑡(𝑟_𝑡+1+ γ max

𝑎 𝑄(𝑠_𝑡+1, 𝑎)) , ∀𝑡 = 1,2, …, (3) where 𝑚𝑎𝑥_𝑎𝑄(𝑠_𝑡+1, 𝑎) is an estimate of optimal future value and 𝛼_𝑡 is the learning rate at time step t. However, traditional RL approaches suffer from the curse of dimensionality when problems have large size of state and action spaces. Although utilizing other functions to approximate the value function is proposed (Sutton and Barto, 2018), such as linear regression, non-linear functions or neural network approximators, these approximators have shaky or diverging Q-values problems caused by non-stationarity and correlations in the sequence of observations (Mnih et al., 2013). Therefore, Mnih et al. (2015) proposed a seminal deep Q-network (DQN) algorithm, which combines Q-learning with deep neural Q-network (DNN) and experience replay memory (Lin, 1992). DQN takes a DNN as value function approximator to predict the optimal Q-values: evaluation network with weights 𝜃 is trained with objective function, presented as Equation (5). Mnih et al. (2015) have implemented experience replay memory to decrease the correlation of time-dependent observations and keep training over its new behavior. In addition, they use an 𝜀 greedy approach: choosing 𝜀 greedy action, i.e., 𝑎_𝑡+1= 𝑎𝑟𝑔𝑚𝑎𝑥_𝑎𝑄(𝑠_𝑡+1, 𝑎) with probability 1 − 𝜀_𝑡, or choosing a random action with probability 𝜀_𝑡 in time t, to enhance the exploration ability of agent.

‧

2.2 Deep Reinforcement Learning for Inventory Control

Inventory control is usually conducted by companies who want to maximize their profit from the least amount of inventory holding without compromising service level. In a supply chain network, there are different players and each of them has to decide their inventory replenishment policies (Giannoccaro and Pontrandolfo, 2002). Our study lies in the intersection among inventory control, machine learning, and RL. There is a vast amount of prior studies on each of the afore-mentioned domain. Hence, rather than reviewing each topic respectively, we only discuss the most relevant literature in this section.

Giannoccaro and Pontrandolfo (2002) is the earliest that apply RL algorithm to address inventory problem in serial supply chains. They use a semi-Markov average reward technique to make ordering decisions for a supply chain system with three agents facing stochastic supply lead time and demand. Chaharsooghi et al. (2008) consider an environment with four agents and a finite time horizon, and propose a Q-learning based algorithm for ordering decisions.

Both studies report that RL algorithms deliver competitive results. However, classical RL algorithms have limitation on large size of state and action space, which makes the application of RL to inventory control have drawn limited attention.

Recent advances such as using deep neural nets as function approximator in DRL release the limitation and enhance the applicability of RL. To the best of our knowledge, only Oroojlooyjadid et al. (2019b) and Gijsbrechts et al. (2019) develop DRL agents for inventory control problems. Oroojlooyjadid et al. (2019b) replace one of the agents i in the beer game supply chain with an DQN agent while other agents follow a base-stock policy (rational) or a Sterman Formula (irrational). Their simulation environment consists of state variables defined as the last m periods of observation of agent i, the d+x rule (x is in some bounded numbers) for action space, and a reward function defined as the sum of stockout and holding cost of all agents. Moreover, they propose a penalization procedure to inform the DQN agent about local cost (DQN agent) and global cost (all agents). They show the DQN agent outperforms other

‧

policies in the minimization of total supply chain cost. Gijsbrechts et al. (2019) display how DRL can be implemented to solve dual-sourcing, lost sales, and multi-echelon inventory problems. By assigning inventory level and backlogs as state variables, all possible quantities from different suppliers as action space, and sum of sourcing and holding cost as reward function, they propose an Asynchronous Advantage Actor-Critic (A3C) algorithm and compare its performance to notable policies and heuristics. Their results show that DRL agent reaches reasonably small optimality gaps in example problems but does not appear to be the best-performer among tested policies/heuristics.

With the same objective of solving inventory control problems by DRL algorithms, our study is substantially different from Oroojlooyjadid et al. (2019b) and Gijsbrechts et al. (2019) in a few aspects (see Table 1). First, both studies assume demand is an independent and identically distributed (i.i.d.) random variable that follows a probability distribution, like gamma distribution or uniform distribution. The assumption is hard to be inferred in the EMS sector. In consequence, we let the DRL agent to learn from noisy demand signals directly rather than imposing any parametric assumptions on demand structures. Second, instead of restricting action space in some small numbers, we release the constrain by setting order quantities to be tens/hundreds of thousands and create serious challenges for our model development. Third, with limited training period retrieved directly from real dataset, our DRL agent can only interact with environment for 96 periods per episode, which makes agent training more difficult.

Moreover, unlike Oroojlooyjadid et al. (2019b), lost sales – unmet demand at specific period cannot be fulfilled in the future – is a major situation in our study. This constraint implies that out-of-stock is more important than holding inventory and increases the difficulty of modeling.

To solve a finite-horizon lost sales inventory problem with more realistic assumptions on demand and possible actions, we attempt to develop archetypal analytics by DRL techniques and assess the viability and value of deploying DRL agents for practitioners.

‧ 國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

在文檔中基於深度強化學習的智能存貨控制：以高科技供應鏈為例 - 政大學術集成 (頁 9-14)

Literature Review

國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

2. Literature Review

‧

‧

‧

‧

‧ 國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

立政治大學

立政治大學