基於深度強化學習的智能存貨控制：以高科技供應鏈為例 - 政大學術集成

全文

(1)國立政治大學資訊管理學系. 碩士學位論文指導教授: 莊皓鈞博士. 學. ‧ 國. 立. 政治大. 基於深度強化學習的智能存貨控制：. sit. y. ‧. Nat. 以高科技供應鏈為例 n. er. io. A Deep Reinforcement Learning Approach for al v Intelligent Inventory Supply ni C hControl inUHigh-Tech engchi Chains. 研究生：廖信堯中華民國一零九年二月 DOI:10.6814/NCCU202000375.

(2) Acknowledgements I would like to express my gratitude to all those who have given me support and encouragement for the completion of the thesis. Any contribution of this thesis is attributed to your valuable assistance and stimulation. First and foremost, I want to extend my heartfelt appreciation to my supervisor Dr. Howard Hao-Chun Chuang, for his continuous support of my master thesis, for his patience, motivation, enthusiasm and immense knowledge. He guided me during the whole process of my research and writing of this thesis, which has made my accomplishments possible. Big. 政治大. thanks once again go to him for without him this work would have never seen the light as it is. 立. today.. ‧ 國. 學. Also, my sincere gratitude is reserved for Dr. Yen-Chun Chou, Dr. Jui-Chung Yang and Dr. Yu-Ju Tu. Their advice and inspiration have helped me to re-examine my research, solidify. ‧. the research foundation of this paper and widen my knowledge in this subject.. y. Nat. sit. I wish to extend my thanks to my beloved classmates and scholars in my research. n. al. er. io. laboratory for their constant support, tolerance and encouragement.. i Un. v. My thanks also go to the authors whose books and articles have given me inspiration in the writing of this paper.. Ch. engchi. Last but not the least, my indebtedness also extends to my family and my girlfriend, Dora, who have been assisting, supporting and caring for me all of my life. Their support and care motivate me to hang on to dreams and pursue my objects steadily and indefatigably.. DOI:10.6814/NCCU202000375.

(3) Abstract Machine learning is revolutionizing business operations across industry sectors. Among different learning techniques, deep reinforcement learning (DRL) has received broad attention in recent years due to the salient performance of AlphaGo, an artificial intelligence (AI) system empowered by DRL. DRL is a model-free and data-driven approach to develop near-optimal policies for sequential decision-making problems. Intrigued by the success of DRL in various fields, we, in this study, assess the applicability of DRL to multi-period inventory control under stochastic demand, which is a classical Markov Decision Process problem. Working with the largest distributor of electronics manufacturing services (EMS) in the world, we propose deep. 政治大. Q-networks (DQN) for intelligent inventory control (IIC). Facing erratic and non-stationary. 立. demand for electronic components with limited market life cycle, the distributor could not infer. ‧ 國. 學. the exact demand distribution and solve the inventory optimization problem analytically in a finite-horizon with lost sales setting. Hence, we develop DQN by specifying relevant state and. ‧. decision inputs, and then designing a data-driven simulation environment, in which the agent. y. Nat. sit. is trained over thousands of episodes. For trained items, DQN outperforms the benchmark in a. n. al. er. io. few ways. First, DQN can reduce the total inventory by at least 40% while achieving better. i Un. v. service level. Second, when penalty parameter increases, DQN can effectively reduce the. Ch. engchi. amount of out-of-stock. While we transfer trained DQN into testing sets, within the same item, the out-of-sample performance is excellent. For other unseen items, we use the Maximum Entropy Bootstrap to train ensemble DDQN and make our DRL agent more robust. Given the promising results in our experiments, we discuss implications, limitations, and further directions for applying DRL/DQN to business decision-making problems.. Keywords: Reinforcement learning, deep learning, inventory optimization, simulation, operations management. DOI:10.6814/NCCU202000375.

(4) Contents 1. Introduction ..................................................................................................... 1 2. Literature Review ........................................................................................... 3 2.1 Reinforcement Learning .............................................................................. 3 2.2 Deep Reinforcement Learning for Inventory Control ................................. 6 3. A Deep Reinforcement Learning Agent........................................................ 8. 政治大. 3.1 Problem Situation and Simulation Environment ......................................... 8. 立. 3.2 Deep Q-Networks for Inventory Control................................................... 10. ‧ 國. 學. 4. Training Performance and Comparisons ................................................... 15. ‧. 4.1 RL Simulation Design and Neural Net Architecture Tuning .................... 15. sit. y. Nat. 4.2 Comparisons Between DRL and Benchmark ............................................ 18. er. io. n. al 5. The Applicability of Transferring Trained Agents i v ................................... 21 Ch. n engchi U. 5.1 Transfer Learning Performance on Latter Period of the Same Item ......... 21 5.2 Transfer Learning Performance on New Unseen Items ............................ 22 5.3 DDQN with Ensemble Learning Method .................................................. 24 6. Conclusion and Discussion ........................................................................... 25 References .......................................................................................................... 27. i. DOI:10.6814/NCCU202000375.

(5) List of Figures Figure 1: Schematic diagram of a high-tech supply chain ................................... 1 Figure 2: Agent and environment for RL.............................................................. 3 Figure 3: Demand realizations of example item ................................................... 9 Figure 4: Deep Q-Networks for inventory control ............................................. 12 Figure 5: Double Deep Q-Networks for inventory control ................................ 14 Figure 6: Training process of Double Deep Q-Networks ................................... 14. 政治大. Figure 7: Flow chart of research ......................................................................... 15. 立. Figure 8: Performance of DQN and DDQN ....................................................... 16. ‧ 國. 學. Figure 9: Comparison between DDQN and BS Policy ....................................... 18. ‧. Figure 10: Sensitivity analysis on LT, LD and δ ................................................. 19. sit. y. Nat. Figure 11: Decision comparison between Optimal, BS and DDQN policies ..... 20. er. io. Figure 12: Test performance on latter period of the same item .......................... 21. n. Figure 13: Decisions madeaby i v .................................. 22 l different policies (LT=12). n U i e h n unseen g c item.......................................... 22 Figure 14: Demand realizations of new. Ch. Figure 15: Test performance on new unseen item .............................................. 23 Figure 16: Decisions made by different policies on unseen item (LT=12)......... 23 Figure 17: Demand realizations of training item and simulated sequences ....... 24. ii. DOI:10.6814/NCCU202000375.

(6) List of Tables Table 1: DRL related literature review ................................................................. 8 Table 2: RL simulation environment design ....................................................... 16 Table 3: Hyperparameters of DDQN .................................................................. 17 Table 4: Sensitivity analysis on Interval and NN Structure while LT equals 5 .. 19 Table 5: Performance comparison between different policies ............................ 20 Table 6: Performance comparison between single and ensemble DDQN .......... 25. 立. 政治大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. iii. i Un. v. DOI:10.6814/NCCU202000375.

(7) 1. Introduction High-tech supply chain is a capital intensive industry, especially the semiconductor industry. Semiconductor industry has high value products with limited market life cycle. These characteristics make the inventory management more difficult in semiconductor industry. Among high-tech supply chains (see Figure 1), distributors play as a bridge between upstream and downstream and provide inventory docking service, implying that their entire business revolves around inventory. The responsibility of a distributor is to ensure each requested component to be transported to the right place at the right time. Any small delay to one component can temporarily stop an entire production, and lead to massive revenue losses and. 治政 dissatisfaction of clients. For distributors, a naïve thinking 大 would be storing more inventory for 立 unexpected future requests, but the space of warehouse is limited and storing up means the ‧ 國. 學. occurrence of holding cost. Therefore, how to develop practical ordering policies becomes an. ‧. important problem for high-tech distributors.. io. sit. y. Nat. n. al. er. Figure 1: Schematic diagram of a high-tech supply chain. Ch. i Un. v. In this study, we work with the largest distributor of electronics manufacturing services. engchi. (EMS) who aims to develop an optimal replenishment policy to reduce holding inventory without compromising its service level. We formulate the inventory control problem into a sequential decision-making process, which is usually referred to as a Markov Decision Process (MDP). Generically, a MDP can be solved by dynamic programming, that divides a problem into simpler sub-problems, but the high-dimensional state and action space, i.e., the curse of dimensionality, lead to inefficiency of classical dynamic programing. However, the dimensionality issue is relieved by deep reinforcement learning (DRL), that combines deep learning and reinforcement learning (RL). Started from 2015, DRL has attracted public attention largely owing to AlphaGo, an artificial intelligence (AI) system empowered by DRL. 1. DOI:10.6814/NCCU202000375.

(8) In addition to games, DRL also has broad applications in areas such as robotics, computer vision, finance and so on. Therefore, we propose a DRL approach for intelligent inventory control (IIC). Specifically, our DRL approach, based on Deep Q-Network (DQN), utilizes neural nets as the value function approximators and then design a data-driven simulation environment by specifying relevant state and decision inputs. We are thrilled to make following contributions: (1) We provide proof of concept that DRL, without making any assumptions on demand structures a priori, can cope with inventory problems in a real world setting with stochastic demand and lost sales. (2) By increasing the dimension of state and action space, we build a simulation environment which comes near to. 治政 reality. (3) Our DRL agent can outperform the current method 大used in the company (benchmark) 立 in training and testing by reducing total inventory and reaching at higher service level. (4) We ‧ 國. 學. design a penalty mechanism on out-of-stock to not only reduce the total inventory but also. ‧. maintain desirable service level and fill rate. (5) During out-of-sample testing within the same. sit. y. Nat. item, the performance of DRL is promising and provides decent procurement policies for the. io. er. distributor. (6) For out-of-sample testing on unseen item, we propose an ensemble method that significantly surpass our single DRL agent. Overall, our study is a strong examination of. n. al. whether DRL is qualified as. iv n C a general-purpose U h e n g c h itechnology. for intractable inventory. replenishment problems (Gijsbrechts et al., 2019). By making solid contribution to the theory and practice of inventory control in high-tech supply chains, our pure data-driven modeling effort aims to be a prototypical example of developing prescriptive analytics based on state-ofthe-arts DRL techniques. The remainder of this paper is structured as follows. Section 2 provides a brief summary of the relevant literature and our contribution. Details of methodology and algorithms are presented in Section 3, and Section 4 provides the training performance of DRL agent. Section 5 shows the results of transferring trained agents into test items and our ensemble model. In. 2. DOI:10.6814/NCCU202000375.

(9) the last section, we summarize the entire research and have a discussion on implications, limitations, and further directions for applying DRL in inventory problems.. 2. Literature Review 2.1 Reinforcement Learning Reinforcement learning (Sutton and Barto, 2018), a branch of machine learning, combines dynamic programming and supervised learning (Gosavi, 2009) and has been considered as a powerful framework for solving MDPs, a represent type of sequential decision-making problems under uncertainty. RL has been applied in several fields, such as game playing (Silver et al., 2017), energy efficiency optimization (Qi et al., 2016) and inventory control. 治政 (Oroojlooyjadid et al., 2019b). This approach is a way of 大 training agents to learn the optimal 立 policy through trial-and-error interactions between agent and the environment (see Figure 2). ‧ 國. 學. In each time step t, the agent chooses an action, 𝑎𝑡 ∈ 𝐴(𝑠𝑡 ) (where 𝐴(𝑠𝑡 ) is the set of. ‧. possible actions in current state) by observing the current state, 𝑠𝑡 ∈ 𝑆 (where 𝑆 is the set of. y. Nat. possible states), gain reward 𝑟𝑡 ∈ ℝ ,and then the environment transits randomly to another. er. io. sit. state, 𝑠𝑡+1 ∈ 𝑆. The transition probability matrix, 𝑃𝑎 (𝑠, 𝑠 ′ ), shows the probability of transition from state 𝑠𝑡 to 𝑠𝑡+1 by taking action 𝑎 , i.e., 𝑃𝑎 (𝑠, 𝑠 ′ ) = Pr⁡(𝑠𝑡+1 = 𝑠 ′ |𝑠𝑡 = 𝑠, 𝑎𝑡 = 𝑎) .. al. iv n C denotes the The objective of RL algorithms is to h reward e n gmatrix. chi U. n. ′. Correspondingly, 𝑅𝑎 (𝑠, 𝑠 ). maximize the expected sum of discounted rewards by determining a policy, 𝜋𝑡 :⁡𝑆 →. 𝐴⁡to⁡maximize ∑∞ 𝑡=0 𝛾𝐸[𝑅𝑎𝑡 (𝑠𝑡 , 𝑠𝑡+1 )] , where 𝑎𝑡 = 𝜋𝑡 (𝑠𝑡 ) and 0 ≤ 𝛾 ≤ 1 is the discount factor.. Figure 2: Agent and environment for RL 3. DOI:10.6814/NCCU202000375.

(10) Based on whether 𝑃𝑎 (𝑠, 𝑠 ′ ) is known, RL algorithms can be separated into two sub-fields: model-based and model-free methods. In model-based methods, we can obtain the optimal policy through dynamic programming, like value iteration and policy iteration (Gosavi, 2009), or linear programming (Sutton and Barto, 2018). However, in reality, 𝑃𝑎 (𝑠, 𝑠 ′ ) is usually unknown or difficult to estimate. Instead of computing the value function directly, model-free methods learn directly from trial-and-error interactions with environment and estimate the value function by stochastic approximation. Generally, model-free approaches are often superior if you have no access to the underlying model of environment (Arulkumaran et al., 2017).. 治政 Among model-free methods, RL algorithms can be further 大 divided into two categories: 立 on-policy and off-policy (Sutton and Barto, 2018). There are two policies in RL’s framework. ‧ 國. 學. The one used to generate behavior is called as the behavior policy and the other one evaluated. ‧. or improved is referred to as the target policy. The difference between on-policy and off-policy. sit. y. Nat. is whether its behavior and target policy are identical. On-policy methods attempt to evaluate. io. er. the behavior policy same as the target policy, whereas off-policy methods separate these two policies. The separation of off-policy methods enables them to continue to sample all possible. al. n. iv n C actions while the target policy is deterministic greedy). SARSA and Q-learning are the h e n g(e.g., chi U representation of on-policy and off-policy methods respectively.. Q-Learning is an important RL algorithm, which solves a MDP by obtaining a policy 𝜋 that maximizes the Q-values for any 𝑠 ∈ 𝑆 and 𝑎 = 𝜋(𝑠), i.e.: 𝑄 ∗ (𝑠, 𝑎) = max 𝐸[𝑟𝑡 + 𝛾𝑟𝑡+1 + 𝛾 2 𝑟𝑡+2 + 𝛾 3 𝑟𝑡+3 + ⋯ |𝑠𝑡 = 𝑠, 𝑎𝑡 = 𝑎, 𝜋] 𝜋. (1). To compute 𝑄 ∗ (𝑠, 𝑎), we use Bellman Equation to transform Equation (1) into Equation (2) and solve the MDP for best decision. 𝑄 ∗ (𝑠, 𝑎) = max 𝐸[𝑟𝑡 + γmax 𝑄(𝑠𝑡+1 , 𝑎) |𝑠𝑡 = 𝑠, 𝑎𝑡 = 𝑎, 𝜋] 𝜋. (2). 𝑎. 4. DOI:10.6814/NCCU202000375.

(11) Before learning, RL agent initializes 𝑄(𝑠, 𝑎) for all s and a with initial values and then continues to update by Equation (3), which is a weighted average method that consists of old Q-value and learned Q-value: 𝑄(𝑠𝑡 , 𝑎𝑡 ) = (1 − 𝛼𝑡 )𝑄(𝑠𝑡 , 𝑎𝑡 ) + 𝛼𝑡 (𝑟𝑡+1 + γ max 𝑄(𝑠𝑡+1 , 𝑎)) , ∀𝑡 = 1,2, …, 𝑎. (3). where 𝑚𝑎𝑥𝑎 𝑄(𝑠𝑡+1 , 𝑎) is an estimate of optimal future value and 𝛼𝑡 is the learning rate at time step t. However, traditional RL approaches suffer from the curse of dimensionality when problems have large size of state and action spaces. Although utilizing other functions to approximate the value function is proposed (Sutton and Barto, 2018), such as linear regression,. 政治大 diverging Q-values problems caused by non-stationarity and correlations in the sequence of 立. non-linear functions or neural network approximators, these approximators have shaky or. ‧ 國. 學. observations (Mnih et al., 2013). Therefore, Mnih et al. (2015) proposed a seminal deep Qnetwork (DQN) algorithm, which combines Q-learning with deep neural network (DNN) and. ‧. experience replay memory (Lin, 1992). DQN takes a DNN as value function approximator to. n. al. er. io. 𝑄(𝑠𝑡 , 𝑎𝑡 ) ≈ 𝑄(𝑠𝑡 , 𝑎𝑡 ; 𝜃). sit. y. Nat. predict the optimal Q-values:. i Un. v. (4). 𝜃 ∗ = min[𝑄(𝑠𝑡 , 𝑎𝑡 ; 𝜃) − (𝑟𝑡+1 + γ max 𝑄(𝑠𝑡+1 , 𝑎; 𝜃 − ))]2 , ∀𝑡 = 1,2, …,⁡ 𝜃. Ch. 𝑎. engchi. (5). where 𝑚𝑎𝑥𝑎 𝑄(𝑠𝑡+1 , 𝑎; 𝜃 − ) is approximated by a target network with weights 𝜃 − . The evaluation network with weights 𝜃 is trained with objective function, presented as Equation (5). Mnih et al. (2015) have implemented experience replay memory to decrease the correlation of time-dependent observations and keep training over its new behavior. In addition, they use an 𝜀 greedy approach: choosing 𝜀 greedy action, i.e., 𝑎𝑡+1 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑎 𝑄(𝑠𝑡+1 , 𝑎) with probability 1 − 𝜀𝑡 , or choosing a random action with probability 𝜀𝑡 in time t, to enhance the exploration ability of agent.. 5. DOI:10.6814/NCCU202000375.

(12) 2.2 Deep Reinforcement Learning for Inventory Control Inventory control is usually conducted by companies who want to maximize their profit from the least amount of inventory holding without compromising service level. In a supply chain network, there are different players and each of them has to decide their inventory replenishment policies (Giannoccaro and Pontrandolfo, 2002). Our study lies in the intersection among inventory control, machine learning, and RL. There is a vast amount of prior studies on each of the afore-mentioned domain. Hence, rather than reviewing each topic respectively, we only discuss the most relevant literature in this section. Giannoccaro and Pontrandolfo (2002) is the earliest that apply RL algorithm to address. 治政 inventory problem in serial supply chains. They use a semi-Markov 大 average reward technique 立 to make ordering decisions for a supply chain system with three agents facing stochastic supply ‧ 國. 學. lead time and demand. Chaharsooghi et al. (2008) consider an environment with four agents. ‧. and a finite time horizon, and propose a Q-learning based algorithm for ordering decisions.. sit. y. Nat. Both studies report that RL algorithms deliver competitive results. However, classical RL. io. of RL to inventory control have drawn limited attention.. al. er. algorithms have limitation on large size of state and action space, which makes the application. n. iv n C Recent advances such as using deep nets as function approximator in DRL release h eneural ngchi U. the limitation and enhance the applicability of RL. To the best of our knowledge, only Oroojlooyjadid et al. (2019b) and Gijsbrechts et al. (2019) develop DRL agents for inventory control problems. Oroojlooyjadid et al. (2019b) replace one of the agents i in the beer game supply chain with an DQN agent while other agents follow a base-stock policy (rational) or a Sterman Formula (irrational). Their simulation environment consists of state variables defined as the last m periods of observation of agent i, the d+x rule (x is in some bounded numbers) for action space, and a reward function defined as the sum of stockout and holding cost of all agents. Moreover, they propose a penalization procedure to inform the DQN agent about local cost (DQN agent) and global cost (all agents). They show the DQN agent outperforms other 6. DOI:10.6814/NCCU202000375.

(13) policies in the minimization of total supply chain cost. Gijsbrechts et al. (2019) display how DRL can be implemented to solve dual-sourcing, lost sales, and multi-echelon inventory problems. By assigning inventory level and backlogs as state variables, all possible quantities from different suppliers as action space, and sum of sourcing and holding cost as reward function, they propose an Asynchronous Advantage Actor-Critic (A3C) algorithm and compare its performance to notable policies and heuristics. Their results show that DRL agent reaches reasonably small optimality gaps in example problems but does not appear to be the bestperformer among tested policies/heuristics. With the same objective of solving inventory control problems by DRL algorithms, our. 治政 study is substantially different from Oroojlooyjadid et al. (2019b) 大 and Gijsbrechts et al. (2019) 立 in a few aspects (see Table 1). First, both studies assume demand is an independent and ‧ 國. 學. identically distributed (i.i.d.) random variable that follows a probability distribution, like. ‧. gamma distribution or uniform distribution. The assumption is hard to be inferred in the EMS. sit. y. Nat. sector. In consequence, we let the DRL agent to learn from noisy demand signals directly rather. io. er. than imposing any parametric assumptions on demand structures. Second, instead of restricting action space in some small numbers, we release the constrain by setting order quantities to be. al. n. iv n C tens/hundreds of thousands and create h serious challengesUfor our model development. Third, engchi with limited training period retrieved directly from real dataset, our DRL agent can only interact with environment for 96 periods per episode, which makes agent training more difficult. Moreover, unlike Oroojlooyjadid et al. (2019b), lost sales – unmet demand at specific period. cannot be fulfilled in the future – is a major situation in our study. This constraint implies that out-of-stock is more important than holding inventory and increases the difficulty of modeling. To solve a finite-horizon lost sales inventory problem with more realistic assumptions on demand and possible actions, we attempt to develop archetypal analytics by DRL techniques and assess the viability and value of deploying DRL agents for practitioners.. 7. DOI:10.6814/NCCU202000375.

(14) Table 1: DRL related literature review. 立. 政治大. 3.1 Problem Situation and Simulation Environment. 學. ‧ 國. 3. A Deep Reinforcement Learning Agent. ‧. According to interviews with procurement executives, we learned some inherited. sit. y. Nat. characteristics in EMS. First, the multiple-period inventory problem is a MDP with finite-. io. er. horizon lost sales setting. Second, demand of EMS clients is stochastic and the supply lead time – the number of weeks to receive goods after placing an order – is stable. To train a DRL. al. n. iv n C procurement agent, we have to first construct simulation environment. Equations (6) - (11) h e nthe gchi U. show the physics of the inventory control system in the simulation program. Below we explain each of them. Equation (6) shows that in period/week t, the begin on-hand (𝐵𝑂𝐻𝑡 ) inventory is the end on-hand (EOH) inventory in the previous week 𝐸𝑂𝐻𝑡−1 . Since in each week the decisionmaker decides on an order quantity 𝑞𝑡 that will arrive at the distributor’s warehouse after LT (supply lead time) weeks, the order received at week t (𝑂𝑅𝑡 ) is exactly q placed at period t-LT1 (shown in equation (7)). Then the random demand 𝑑𝑡 arrives and the company updates its EOH in equation (8) and out-of-stock (OOS) in equation (9). Equation (10) shows the begin of week backlog (𝐵𝐵𝐿𝑡 ) equals the begin of week backlog in the previous week (𝐵𝐵𝐿𝑡−1) minus 8. DOI:10.6814/NCCU202000375.

(15) goods already been received in the previous week (𝑂𝑅𝑡−1), plus the order placed in the previous week (𝑞𝑡−1 ). Finally, we compute the available stock (𝐴𝑆𝑡 )/inventory position in Equation (11) that equals to the sum of EOH and the begin of week in-transit inventory 𝐵𝐵𝐿𝑡 minus order received this period. (6). 𝐵𝑂𝐻𝑡 = ⁡ 𝐸𝑂𝐻𝑡−1. .. (7). 𝑂𝑅𝑡 = ⁡ 𝑞𝑡−𝐿𝑇−1 𝐸𝑂𝐻𝑡 = ⁡ 𝑀𝑎𝑥(𝐵𝑂𝐻𝑡 + 𝑂𝑅𝑡 − 𝑑𝑡 , 0). (8). 𝑂𝑂𝑆𝑡 = ⁡𝑀𝑖𝑛(𝑑𝑡 − 𝐵𝑂𝐻𝑡 − 𝑂𝑅𝑡 , 0). (9) (10). 𝐵𝐵𝐿𝑡 = ⁡ 𝐵𝐵𝐿𝑡−1 − 𝑂𝑅𝑡−1 + 𝑞𝑡−1 𝐴𝑆𝑡. 立. 治政大 = ⁡ 𝐸𝑂𝐻 + 𝐵𝐵𝐿 − 𝑂𝑅 𝑡. 𝑡. (11). 𝑡. The uncertainty in the inventory system arises from the unknown demand realization 𝑑𝑡 .. ‧ 國. 學. The stable yet long lead time, LT, aggravates the issue because predicting all uncertain demand. ‧. realizations 𝑑 in equation (12) for one to decide the exact optimal order quantity 𝑞𝑡∗ is by no. sit. y. Nat. means an easy task. Instead of imposing parametric assumptions on the random variable D, we. io. er. use the stochastic demand time series as raw inputs into equation (8) to create uncertainty in the decision-making environment. Figure 3 shows demand data for an example item used in. al. n. iv n C our preliminary analysis. The non-stationarity the time h e n ginc h i U series lets one cannot compute optimal order quantities, so we propose a DRL approach to this decision-making problem. 𝑞𝑡∗ = 𝑑𝑡+𝐿𝑇+1 − max⁡(0, 𝐴𝑆𝑡 − ∑𝑡+𝐿𝑇 𝑠 =𝑡+1 𝑑𝑠 ). (12). Figure 3: Demand realizations of example item 9. DOI:10.6814/NCCU202000375.

(16) The goal of our DRL agent is to figure out a replenishment policy for identifying 𝑞𝑡 such that the sum of EOH and OOS over T periods can be minimized, as shown in equation (13). The parameter 𝛿 is a non-negative penalty term for OOS to help DRL agent strike the balance between holding stock and causing stockouts. In addition, we monitor the service level (SL) in equation (14), where Ind ( ) returns 1 if the input is true and otherwise returns 0. 𝑅𝐹𝑅 denotes the required fill rate, i.e., the fraction of production demand ought to be satisfied by 𝐵𝑂𝐻𝑡 and 𝑂𝑅𝑡 . SL informs decision-makers the likelihood of being able to meet 𝑅𝐹𝑅 over T weeks. 𝑚𝑖𝑛⁡ ∑𝑇𝑡=1(𝐸𝑂𝐻𝑡 + 𝛿𝑂𝑂𝑆𝑡 ). (13). 𝐵𝑂𝐻𝑡 +𝑂𝑅𝑡 ⁡≥𝑅𝐹𝑅) 𝑑𝑡. ∑𝑇 𝑡=1 𝐼𝑛𝑑(. 政治大 3.2 Deep Q-Networks for Inventory Control 立 𝑆𝐿 =. (14). 𝑇. ‧ 國. 學. After explaining the problem situation and simulation environment, in this section we delineate how we develop deep Q-networks (DQN) for order decisions. Each RL agent goes through. ‧. trial-and-error interaction with the dynamic environment. Every time the agent chooses an. sit. y. Nat. action given its current state(s), it gains a reward from environment. Then, due to the randomly. n. al. er. io. realized demand in our case, the environment transits into the next state(s). By collecting the. i Un. v. history of state-action pairs and corresponding rewards, the agent learns how to make decisions. Ch. engchi. under various circumstances, i.e., different state values. The details of state and action spaces, reward function, and DRL algorithms are defined below. State variables: we include on-hand inventory in the current week, the inventory intransit (𝑞𝑡−𝐿𝑇 , 𝑞𝑡−𝐿𝑇+1 , … , 𝑞𝑡−1 ), demand realizations in recent weeks (𝑑𝑡−𝐿𝐷+1, …,𝑑𝑡−1 ,𝑑𝑡 ) and time effects ( 𝑠𝑒𝑎𝑠𝑜𝑛𝑡 and time stamp t) in state variables and display 𝑠𝑡 as (𝐵𝑂𝐻𝑡 ,𝐸𝑂𝐻𝑡 ,𝑞𝑡−𝐿𝑇 , 𝑞𝑡−𝐿𝑇+1 , … , 𝑞𝑡−1 , 𝑑𝑡−𝐿𝐷+1,…,𝑑𝑡−1 ,𝑑𝑡 ,𝑠𝑒𝑎𝑠𝑜𝑛𝑡 , 𝑡), where LT is the supply lead time and LD is the length of demand history. Reward function: The reward function is the key that can be used to drive an agent’s behaviors. In each period/week t, the agent observes state variables 𝑠𝑡 and takes action 𝑎𝑡 ;. 10. DOI:10.6814/NCCU202000375.

(17) we set our reward structure as 𝑟𝑡 to measure the goodness of action, where 𝑟𝑡 = ⁡(𝐸𝑂𝐻𝑡+𝐿𝑇+1 + 𝛿 ∗ 𝑂𝑂𝑆𝑡+𝐿𝑇+1 ) and 𝛿 is the penalty parameter of stockouts. Note that we use the values of EOH and OOS in period t+LT+1 because 𝑎𝑡 will not directly affect system performance until that period. Action space: The number of possible actions, or order quantities, in this MDP problem is infinite. For the DQN to function appropriately, we discretize the action space and set a upper limit quantity, 𝑞𝑚𝑎𝑥 , and granularity of action, interval, to generate the possible action choices as 𝑎1 = 0, 𝑎2 = (𝑞𝑚𝑎𝑥 /interval 1)*1, 𝑎3 = (𝑞𝑚𝑎𝑥 /interval −1) ∗ 2, … , 𝑎𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙 = 𝑞𝑚𝑎𝑥 The larger the interval, the more degrees of freedom in possible actions. However, this comes. 治政 at the cost of increasing the number of output nodes and parameters 大 in neural nets, making the 立 DRL agent more difficult to train. ‧ 國. 學. DQN and DDQN algorithms: After defining state, action and reward, we propose our. ‧. DQN algorithm for inventory control in Figure 4. The objective of our DRL agent is to learn a. sit. y. Nat. near-optimal policy with the aim of minimizing (rather than maximize) the accumulated. io. er. discounted reward (see Equation (13)). At first, we initialize all weights of our evaluation network (𝜃) and target network (𝜃 − ) to random numbers 𝑊, where 𝑊~𝑁(0, 0.1). Then, the. al. n. iv n C algorithm begins with an outer for looph that requires the agent e n g c h i U to learn how much to order over n episodes. For each episode, we reset the memory buffer and initialize all state variables to -1. (because no observations are available at period 0). After that, we enter an inner for loop over T periods/weeks. In each period t, the agent receives previously order quantity (𝑂𝑅𝑡 ), fulfills the random demand 𝑑𝑡 , and takes an action 𝑎𝑡 . Subsequently, the environment transits to new state 𝑠𝑡+1 . During T periods, the agent keeps interacting with the environment and storing up normalized experience, 𝑒𝑡 , i.e. [𝑞. 𝑠𝑡. 𝑚𝑎𝑥. , 𝑎𝑡 , 𝑞. 𝑟𝑡. 𝑚𝑎𝑥. 𝑠. , 𝑞 𝑡+1 ] , into the memory buffer 𝐸 , where 𝐸 𝑚𝑎𝑥. includes {𝑒𝑡−𝑚 , … , 𝑒𝑡 } and 𝑚 is the size of memory buffer. Here, we normalize experience by 𝑞𝑚𝑎𝑥 to solve the scaling problem of different items and enhance the generalizability of. 11. DOI:10.6814/NCCU202000375.

(18) our DRL agent. Note that to balance the exploration-exploitation trade-off in this MDP, we adopt an 𝜀 greedy strategy to ensure that DQN agent has a probability 𝜀 to choose an action randomly (explore) and 𝜀 decays with episodes as the agent evolves over time. Algorithm Deep Q-Networks for Inventory Control ..1: procedure DQN ..2: for Episode = 1 : n do (simulate the inventory control process) ..3: Initialize Experience Replay Memory, E = [ ] ..4: Reset BOH, EOH, and Backlog ..5: for t = 1 : T do ..6:. With probability ε take random action at. ..7:. otherwise set at= argmin 𝑄(𝑠𝑡 , 𝑎, 𝜃) 𝑎. 政治大. ..8: ..9: 10:. Execute action at, move into state st+1, and collect reward rt Add (st, at, rt, st+1) into E Get a mini-batch of experiences (sj, aj, rj, sj+1) from E. 11:. Set 𝑦𝑗 = 𝑟𝑗 + 𝛾 ∗ min 𝑄(𝑠, 𝑎; 𝜃 − ) ⁡⁡𝑖𝑓⁡𝑡 < 𝑇, 𝑒𝑙𝑠𝑒⁡𝑦𝑗 = 𝑟𝑗 ⁡⁡⁡⁡. 12: 13:. Run one forward pass and backward propagation on the DNN with loss function (𝑦𝑗 − 𝑄(𝑠𝑗 , 𝑎𝑗 ; 𝜃))2 to update 𝜃. 立. ‧ 國. 學. 𝑎. ‧. y. Nat. sit. n. al. er. io. 14: Every C iterations, set 𝜃 − = 𝜃 15: end for 16: Adjust ε 17: end for 18: end procedure Figure 4: Deep Q-Networks for inventory control. Ch. engchi. i Un. v. When the memory buffer is full, the agent executes two tasks respectively. One is keep updating the memory buffer by first in, first out (FIFO) method and the other is sample a minibatch randomly to train the evaluation network (𝜃) (Lin, 1992). The random sampling approach breaks the correlations among time-dependent observations, reduces the variance of approximating values (Mnih et al., 2013), and prevents the neural networks from being a diverging approximators (Minh et al., 2015). The Prioritized Experience Replay (Schaul et al., 2015) is another option for sampling experience, but, considering the continuous non-zero reward in our simulation, random sampling and prioritized sampling may make no difference.. 12. DOI:10.6814/NCCU202000375.

(19) The evaluation network (𝜃) tries to find weights⁡𝜃 to minimize the mean square error (MSE) between Q (s, a; 𝜃) and 𝑦𝑗 , where 𝑦𝑗 is the prediction of the Q-value gained from target network (𝜃 − ). This is done by running one forward pass and backward propagation on the minibatch sample. Over the iterated training processes in each episode, weights 𝜃 − of the target network will be replaced by the weights 𝜃 of the evaluation network every C iterations. Though the structure of DQN is innovative, the DRL literature (Van Hasselt et al., 2015) reports that DQN approximates target Q-values, 𝑦𝑖 , by a greedy strategy, which might cause underestimation of non-optimal actions’ Q-values in our simulation. Thus, we make slight yet effective modifications to DQN following Van Hasselt et al. (2015) who propose a Double. 治政 DQN (DDQN) algorithm that uses both evaluation (𝜃) and大 target (𝜃 ) nets in determining 𝑦 立 as shown in Figure 5 and 6. Specifically, we first use the evaluation network to select best −. 𝑗. ‧ 國. 學. action to take for the next state, i.e., 𝑎𝑚𝑖𝑛 (𝑠𝑗+1 ; 𝜃) = argmin𝑎𝑗 𝑄(𝑠𝑗+1 , 𝑎𝑗 , 𝜃) . Then we 𝑦𝑗 = 𝑟𝑗 + 𝛾 ∗ 𝑚𝑖𝑛𝑎 𝑄(𝑠, 𝑎; 𝜃 − ). in. Figure. 4. ‧. replace. with. 𝑦𝑗 = 𝑟𝑗 + 𝛾 ∗. sit. y. Nat. 𝑄(𝑠𝑗+1 , 𝑎𝑚𝑖𝑛 (𝑠𝑗+1 ; 𝜃); 𝜃 − ). The remaining parts of DDQN algorithm are undifferentiated to. n. al. er. io. DQN algorithm. If Q-values exhibit convergence and MSE values stabilize after n episodes,. i Un. v. the DRL for inventory control in Figure 5 will result in an intelligent agent – the evaluation. Ch. engchi. network (𝜃) – for procurement decision-making. The flow chart of research is consolidated in Figure 7. We have already explained our simulation and environment. In next sections, we present our training performance and out-of-sample testing.. 13. DOI:10.6814/NCCU202000375.

(20) Algorithm Double Deep Q-Networks for inventory control ..1: procedure DDQN ..2: for Episode = 1 : n do (simulate the inventory control process) ..3: Initialize Experience Replay Memory, E = [ ] ..4: Reset BOH, EOH, and Backlog ..5: for t = 1 : T do ..6: With probability ε take random action at ..7:. otherwise set at= argmin 𝑄(𝑠𝑡 , 𝑎, 𝜃) 𝑎. ..8:. Execute action at, move into state st+1, and collect reward rt. ..9: 10:. Add (st, at, rt, st+1) into E Get a mini-batch of experiences (sj, aj, rj, sj+1) from E. 11:. Define 𝑎𝑚𝑖𝑛 (𝑠𝑗+1 ; 𝜃) = argmin 𝑄(𝑠𝑗+1 , 𝑎𝑗 , 𝜃). 政治大 , 𝑎 (𝑠 ; 𝜃); 𝜃 )⁡⁡𝑖𝑓⁡𝑡 < 𝑇, 𝑒𝑙𝑠𝑒⁡𝑦 = 𝑟 ⁡ 𝑎𝑗. 立 Run one forward pass and backward propagation on the DNN with loss function 𝑚𝑖𝑛. 13: 14:. (𝑦𝑗 − 𝑄(𝑠𝑗 , 𝑎𝑗 ; 𝜃))2 to update 𝜃. −. 𝑗. 𝑗. Every C iterations, set 𝜃 − = 𝜃 end for Adjust ε. ‧. 15: 16: 17:. 𝑗+1. ‧ 國. Set 𝑦𝑗 = 𝑟𝑗 + 𝛾 ∗ 𝑄(𝑠𝑗+1. 學. 12:. n. al. er. io. sit. y. Nat. 18: end for 19: end procedure Figure 5: Double Deep Q-Networks for inventory control. Ch. engchi. i Un. v. Figure 6: Training process of Double Deep Q-Networks 14. DOI:10.6814/NCCU202000375.

(21) 立. 政治大. n. al. Ch. er. io. 4. Training Performance and Comparisons. sit. y. ‧. ‧ 國. 學. Nat. Figure 7: Flow chart of research. i Un. v. We use the first 70 weeks of demand in the selected item (see Figure 3) to train our DRL agent,. engchi. and the remaining 26 weeks for its out-of-sample testing. Our simulation environment is designed as close as to the realistic business situation according to the interview with procurement executives. 4.1 RL Simulation Design and Neural Net Architecture Tuning Based on our knowledge and information we have about IIC, we set up our environment parameters as shown in Table 2. The minimal and maximal possible ordering quantity, 𝑞𝑚𝑖𝑛 , 𝑞𝑚𝑎𝑥 , and interval, the number of binning by grouping continuous ordering quantity, define the action space of DRL. In addition, the penalty parameter of stockouts, 𝛿 , enables us to maintain at desirable SL and FR. And, the size of state variables of DRL is related to the length 15. DOI:10.6814/NCCU202000375.

(22) of supply lead time (LT) and history demand (LD). The required fill rate, 𝑅𝐹𝑅 , has been defined in previous section. Table 2: RL simulation environment design. After building up the simulation environment, we start to train our DRL agents. However, before devoting ourselves to tuning model performance, we want to ensure our proposed. 治政 algorithms are capable of learning near-optimal policies. Theoretically, our agents are ought to 大立 select the action with lowest reward frequently in its best episode. Therefore, we attempt to ‧ 國. 學. verify this theory by assigning one of the possible actions with zero reward consistently and. ‧. plot the training loss and reward per episode of both algorithms in Figure 8. The graph shows. y. Nat. that both algorithms learn a near-optimal policy from our simulation environment. Considered. er. io. sit. the stableness of training loss and the issue of underestimation of non-optimal actions’ Q-values, we finally choose DDQN over DQN and compare its performance with other policies in the. n. al. following sections.. Ch. engchi. i Un. v. (b) Training reward per episode. (a) Training loss per episode. Figure 8: Performance of DQN and DDQN. 16. DOI:10.6814/NCCU202000375.

(23) To train our DRL agent, some hyperparameters of DDQN are necessary to be assigned beforehand. However, to determine suitable values of hyperparameters of DDQN is a nontrivial and time-consuming task, not to mention that evaluating all possible combinations is computationally expensive. Hence, we define the grid space on each important hyperparameter and utilize grid search to find good hyperparameter sets in each experiment. Specifically, we first test each hyperparameter one by one by fixing others to find significant ones, (i.e., learning rate, update frequency and batch size), for model performance. Then, we explore combinations of the three significant hyperparameters for a total of 288 scenarios, fixing other hyperparameters (structure of neural network = S, activation function = ReLU, memory buffer. 治政 size = 40, gamma = 0.99). After identifying the best scenarios 大 among the 288 tested cases, we 立 then test different values of the fixed hyperparameters repeatedly to complete the grid search. ‧ 國. 學. As there exists infinite choices to design the neural net structure, we select four different. ‧. configurations which we label as small, medium, large and extra-large. Each DNN network is. sit. y. Nat. all fully connected network and has hidden layers with ReLU activation function. In the default. io. er. setting of simulation environment, we observe good performance and high efficiency with the hyperparameter set displayed in Table 3. To optimize the neural nets, we utilize the Adam. al. n. iv n C optimizer (Kingma and Ba, 2017) with h batch size 5 and learning e n g c h i U rate 0.0005, and train our DRL agent for 8,000 episodes with 𝜀 = 0.9 and linearly reduces it to 0.05 through episodes. All of the following computation and experiments are done on nodes with 12 cores and 64 GB of memory with PyTorch 1.1. Table 3: Hyperparameters of DDQN. 17. DOI:10.6814/NCCU202000375.

(24) 4.2 Comparisons Between DRL and Benchmark To evaluate the performance of our DRL agents, we choose a baseline policy (BS policy) currently implemented in the collaborated company as our benchmark, shown in Equation (15). In comparison with BS policy, we only focus on the average total inventory, average EOH, average OOS, SL (Equation (14)) and FR which is defined as Equation (16). 𝑞𝑡 = 𝑚𝑎𝑥 ((. ∑𝑡𝑖=𝑡−𝑚+1 𝑑𝑖⁄ 𝑚) × 𝐿𝑇 − 𝐴𝑆𝑡 , 0). ∑𝑇−1 𝑂𝑂𝑆𝑡 𝐹𝑅 = 1 − ( 𝑡=𝐿𝑇+1 ) ⁄ 𝑇−1 ∑𝑡=𝐿𝑇+1 𝑑𝑡. (15) (16). According to the company’s internal forecast analysis, we set 𝑚 = 8 in BS policy. In. 政治大. Figure 9, from preliminary training results, we perceive our DDQN agent outperforms BS. 立. policy in a few ways. First, when the penalty parameter (𝛿) is equal to 1, our DDQN agent can. ‧ 國. 學. reduce the average EOH and average OOS level by 89% and 34% respectively, and achieve higher FR than BS policy. Second, as the increase of 𝛿, our DDQN holds more average EOH. ‧. to avoid the incurrence of stockouts. When 𝛿⁡equals⁡10, i.e., 1 unit of OOS is equal to 10 units. y. Nat. sit. of EOH, DDQN agent holds less average EOH than BS policy, but achieves 89% of SL with. n. al. er. io. 89% reduction of average OOS. Consequently, in the training stage, DDQN outperforms BS policy and performs much better while 𝛿 increases.. Ch. engchi. i Un. v. Figure 9: Comparison between DDQN and BS Policy 18. DOI:10.6814/NCCU202000375.

(25) Besides, we run some sensitivity analysis on the environment parameters such as supply lead time (LT), the length of history of demand (LD) and the action space (Interval), and consolidated the results in the following. In Figure 10, while LT increases, the SL of DDQN with 𝛿 = 1 decreases, but the SL of DDQN with 𝛿 = 10 maintains stable and outweighs BS policy However, LD seems has no impact on the performance of DDQN.. 立. 政治大. Figure 10: Sensitivity analysis on LT, LD and δ. ‧ 國. 學. In our default design of simulation environment in Table 2, the number of possible actions. ‧. of DDQN is only 21. In Table 4, we explore the potentiality of DRL by increasing interval and compare its performance in different neural net structures. The results imply degree of freedom. y. Nat. io. sit. of action is not a significant factor on DRL’s performance. One possible reason is that the. er. number of weights making the neural nets hard to reach convergence.. al. n. iv n C Table 4: Sensitivity analysis onhInterval Structure while LT equals 5 i U e n g and c hNN. 19. DOI:10.6814/NCCU202000375.

(26) We have known that DDQNs outperform BS Policy in the default environment setting. In Table 5, we further compare its performance on different LT and consolidate the five significant indexes of different ordering policy. As the increase of LT, DDQN with high 𝛿 consistently outperforms BS Policy. In 𝐿𝑇 = 12, DDQN not only keeps both average cost and OOS at low level (lower by 68% and 86% than BS Policy) but also maintains a near 90% SL and 97% FR. Table 5: Performance comparison between different policies. 立. 政治大. ‧ 國. 學. Overall, the above analysis shows that our DRL agent can be trained to make good ordering decisions. Then, we review these decisions made by different policies while LT equals. ‧. 12 by plotting the line chart in Figure 11. The Optimal Policy denotes the replenishment policy. Nat. sit. y. while we have perfect information about future demand and order quantity at t is defined as. n. al. er. io. shown in Equation (12). We observe that yellow line (DDQN policy) is closer to blue line. i Un. v. (Optimal Policy). The correlation coefficient of DDQN and BS Policy between Optimal Policy. Ch. engchi. are 0.025 and 0.3 respectively. It implies DDQN is more like a near-optimal policy than BS Policy. Figure 11: Decision comparison between Optimal, BS and DDQN policies. 20. DOI:10.6814/NCCU202000375.

(27) 5. The Applicability of Transferring Trained Agents In this section, we transfer our trained DRL agent into two situations: transferring to latter period of the same item and transferring to new unseen items. We also attempt to explore the potential and capability of DRL on inventory control problem by adding ensemble techniques. 5.1 Transfer Learning Performance on Latter Period of the Same Item In Figure 12, we display the test results of DDQN and BS policy on different LT. While LT increases, BS policy tends to hold more inventory to achieve desirable SL. The average EOH of BS policy is increased by 39% and 237% while LT equals 9 and 12 respectively. However, our DRL agents outperform BS policy on different LT in all aspects. DDQNs not only reduce. 治政 the total inventory but also attain 100% SL and FR. Compared 大 with BS policy, DDQNs with 立 𝛿 = 1 achieve 100% SL and FR, and reduce average EOH by 13%, 62% and 393% while LT ‧ 國. 學. equals 5, 9 and 12 respectively. While 𝛿 is equal to 10, the testing outcomes are as good as. ‧. the afore-mentioned results. As the results shown in Figure 12, we prove our trained DDQNs. sit. y. Nat. can be successfully transfer into latter period of the same item, which implies that DRL can. io. al. n. enough historical data.. er. help practitioners to figure out better inventory policies for different items when they collect. Ch. engchi. i Un. v. Figure 12: Test performance on latter period of the same item. 21. DOI:10.6814/NCCU202000375.

(28) Furthermore, in Figure 13, we plot the ordering quantity made by different policies while LT equals 12. We notice BS Policy only orders two times with massive quantity and DDQN has a tendency to order adequate quantity frequently, which may explain why DDQN performs better than BS Policy Compared decisions made at each week, order quantities made by DDQN are closer to the optimal quantities of Optimal Policy. The out-of-sample performance of DDQN is promising and provides decent procurement policies for the distributor.. 立. 政治大. ‧ 國. 學. Figure 13: Decisions made by different policies (LT=12). ‧. 5.2 Transfer Learning Performance on New Unseen Items. Nat. sit. y. We also transfer our trained DRL agents into unseen items such as the demand data shown in. n. al. er. io. Figure 14. This item is a time series with 96 weeks and has a totally different demand pattern. i Un. v. as example item, which increases the difficulty of transfer learning.. Ch. engchi. Figure 14: Demand realizations of new unseen item We consolidate the test results of DDQNs and BS Policy on different LT as Figure 15. For the case of 𝐿𝑇 = 5, DDQNs perform better than BS Policy on total inventory, but they achieve 22. DOI:10.6814/NCCU202000375.

(29) low SL and FR even when 𝛿 is large. The total inventory of BS Policy increases by 61% and 156% while LT equals 9 and 12, but BS Policy maintains stable SL and FR. Our DDQNs perform more stable on total inventory while LT increases, but they attain low SL and FR. The testing results are not as good as training results described in Section 4.. 立. 政治大. ‧ 國. 學. Figure 15: Test performance on new unseen item. ‧. We plot the line chart of ordering quantities made by different policies in Figure 16. BS. sit. y. Nat. Policy orders less frequently but orders massive quantity to satisfy future demand, whereas. io. er. DDQN tends to order less but more frequently than BS Policy. Based on the gap of quantity. al. between DDQN and Optimal Policy, however, our DDQN order inadequate quantity to fulfill. n. iv n C upcoming demand. The discrepancy between h e n gBSc Policy h i Uand DDQN reflects both on total inventory and SL. Overall, it reveals that transfer-learning of DRL on unseen items is more difficult and challenging based on single DDQN.. Figure 16: Decisions made by different policies on unseen item (LT=12). 23. DOI:10.6814/NCCU202000375.

(30) 5.3 DDQN with Ensemble Learning Method In Section 5.2, we trained our DRL agent with only one time series sequence, and the transferring of trained agent is not ideal according to the out-of-sample performance. Though we trained our agent for thousands of episodes, the exploration of simulated environment might not be enough for DRL agent to perform well on unseen items. In order to increase the generalizability of our DDQN, we utilize the Maximum Entropy Bootstrap (MEBoot) (Vinod and Lòpez-de-Lacalle, 2009) to simulate other 19 time series sequences based on the training item observations (see Figure 17). There are many sampling techniques such as random sampling and stratified sampling, but we choose MEBoot owing to the time-dependent. 治政 relationship of our demand sequence. MEBoot is an algorithm 大 for avoiding unnecessary 立 distributional assumptions and creates an ensemble for time series inference. To reduce the ‧ 國. 學. variance of Q-values estimation of DDQN and make our agent more robust, we train. ‧. multiple/ensemble DDQN instead of a single DDQN.. n. er. io. sit. y. Nat. al. Ch. engchi. i Un. v. Figure 17: Demand realizations of training item and simulated sequences By simulating 19 times by MEBoot from the original sequence, we have 20 time series sequences and, for each time series sequence, we set the environment as default setting and our DDQNs with same hyperparameters in Table 3. Then, we train one agent for one time series 24. DOI:10.6814/NCCU202000375.

(31) sequence respectively and store the neural network with best performance in each sequence. The difference between those 20 time series sequences makes each agent can explore more possibilities with the simulated environment. While the agent makes ordering decisions during testing, for each action, we take average Q-values of the 20 DDQN agents and choose the action with minimal average Q-value. The out-of-sample performance of single and ensemble DDQN are consolidated as Table 6. We have already known the testing results of single DDQN is not as good as training results described in Section 4, but ensemble DDQN can outperform BS Policy and single DDQN. When 𝛿 = 10 , the performance of single DDQN is not ideal, but ensemble DDQN. 治政 outperforms other policies. Ensemble DDQN holds nearly 大half of the average EOH of BS 立 Policy and incurs only 38% of the average OOS of single DDQN, but ensemble DDQN can ‧ 國. 學. achieves 82% SL and 90% FR. These results imply we successfully transfer our trained agents. ‧. on new unseen item. We conclude that our ensemble technique is effective and our ensemble. sit. y. Nat. DDQN probably reduces the variance of optimal Q-values estimation and increases the. io. er. generalizability and robustness of our DRL agent.. Table 6: Performance comparison between single and ensemble DDQN. n. al. Ch. engchi. i Un. v. 6. Conclusion and Discussion In this paper we formally assess how to train an intelligent procurement agent using DRL algorithms and noisy demand data. We address the inventory control problem as the singleagent inventory control in a realistic setting with stochastic demand and lost sales. The proposed DDQN significantly outperforms benchmark - the current procurement policy used. 25. DOI:10.6814/NCCU202000375.

(32) by the distributor. Reducing nearly 50% of holding inventory level, the DDQN is capable of achieving much higher SL for the tested item. More importantly, seeing the unsatisfactory transfer learning of DDQN to unseen items, we develop an ensemble technique that substantially improves inventory performance. We posit that DRL agent can help practitioners to produce better replenishment policies and assist them in deciding order quantities. Based on our preliminary test in the distributor, we show DRL for IIC has potential to solve practical inventory problems given the demand realization of selected items. Compared with prior researches, our study has no parametric assumption on demand structures and lets the DRL agent to learn from noisy demand signals directly. Moreover, we release the constraint. 治政 of small action and state space, and consider finite-horizon大 and lost sales. Although the release 立 and complex situation lead to serious challenges for our model development, the performance ‧ 國. 學. of our DRL agent proves our proposed algorithm can effectively solve high-dimensional state. ‧. and action spaces and provide practical ordering policies in IIC problem. By navigating the. sit. y. Nat. application of cutting-edge AI developments to inventory control in a real business setting, our. io. er. modeling effort expects to make non-trivial contributions to the theory and practice of procurement operations in high-tech supply chains. The nuts and bolts of developing our DRL. al. n. iv n C procurement agent offer valuable lessons fresh insights to both academic and industrial h eand ngchi U communities.. Finally, like other machine learning algorithms, our DRL solution has some limitations. First, the neural network tuning of DRL is effort-intensive and no one will regard it as an easy task. In this study, we only tried a grid search method instead of exploring other hyperparameter optimization techniques such as random search or genetic algorithms. Another possible extension of this research is to increase the capability of handling continuous action space. We discrete the possible action space for calculating Q-values in DDQN, but some DRL algorithms, like policy gradient, can handle continuous action space directly.. 26. DOI:10.6814/NCCU202000375.

(33) References Arulkumaran, K., Deisenroth, P. M., Brundage, M., & Bharath, A. A. (2017) A brief survey of deep reinforcement learning. ArXiv: 1708 05866v2 Chaharsooghi, S., Heydari, J., & Zegordi, S. (2008) A reinforcement learning model for supply chain ordering management: An application to the beer game. Decision Support Systems, 45(4): 949-959. Chollet, F. (2017) Deep Learning with Python. Manning Publications. Giannoccaro, I., & Pontrandolfo, P. (2002) Inventory management in supply chains: A reinforcement learning approach. International Journal of Production Economics, 78(2): 153-161.. 立. 政治大. Gijsbrechts, J., Boute, R., Zhang, D., & Van Mieghem, J. (2019) Can Deep Reinforcement. ‧ 國. 學. Learning Improve Inventory Management? Performance on Dual Sourcing, Lost Sales Multi-Echelon. Problems.. Available. ‧. and. at. SSRN. sit. y. Nat. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3302881.. io. er. Gosavi, A. (2009) Reinforcement learning: A tutorial survey and recent advances. INFORMS Journal on Computing, 21(3): 177-345.. n. al. Kingma,. D.. P.,. &. Ba,. iv n C J. h(2017). Adam:U A engchi. Method. for. Stochastic. Optimization. ArXiv:1412 6980v9.. Lin, L. J. (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine Learning, 8(3-4): 293-321. Mnih V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., & Riedmiller, M. (2013) Playing Atari with deep reinforcement learning. ArXiv:1312 5602v1 Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., & Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., & Hassabis, D. (2015) Human-level control through deep reinforcement learning. Nature, 518: 529-533. 27. DOI:10.6814/NCCU202000375.

(34) Oroojlooyjadid, A., Snyder, L., & Takáč, M. (2019) Applying deep learning to the newsvendor problem. IISE Transactions, in press. Oroojlooyjadid, A., Nazari, M., Snyder, L., & Takáč, M. (2019b) A deep Q-Network for the beer game: A deep reinforcement learning algorithm to solve inventory optimization problems. ArXiv:1708 05924v3 Porteus, E. L. (2002) Foundations of Stochastic Inventory Theory. Stanford University Press, California. Qi, X., Wu, G., Boriboonsomsin, K., Barth, M. J., & Gonder, J. (2016) Data-driven reinforcement learning–based real-time energy management system for plug-in hybrid. 治政 electric vehicles. Transportation Research Record, 2572(1), 大 1-8. 立 Schaul, T., Quan, J., Antonoglou, I., & Silver, D. (2015). Prioritized Experience Replay. ArXiv: ‧ 國. 學. 1511 05952v4.. ‧. Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T.,. sit. y. Nat. Baker, L., Lai, M., Bolton, A., Chen, Y., Lillicrap, T., Hui, F., Sifre, L., van den Driessche,. io. Nature 550: 354-359.. er. G., Graepel, T., Hassabis, D. (2017) Mastering the game of Go without human knowledge.. al. n. iv n C Sutton, R. S. and Barto, A. G. (2018) Reinforcement An Introduction Second edition, h e n g cLearning: hi U MIT Press, Cambridge.. Van Hasselt, H., Guez, A., and Silver, D. (2015). Deep Reinforcement Learning with Double Q-Learning. ArXiv:1509 06461 Vinod, H.D. and Lòpez-de-Lacalle, J. (2009). Maximum entropy bootstrap for time series. The meboot R Package. J Stat Softw , 29 (2009), pp. 1-19. 28. DOI:10.6814/NCCU202000375.

(35)