Conclusion and Discussion

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

sequence respectively and store the neural network with best performance in each sequence.

The difference between those 20 time series sequences makes each agent can explore more possibilities with the simulated environment. While the agent makes ordering decisions during testing, for each action, we take average Q-values of the 20 DDQN agents and choose the action with minimal average Q-value.

The out-of-sample performance of single and ensemble DDQN are consolidated as Table 6. We have already known the testing results of single DDQN is not as good as training results described in Section 4, but ensemble DDQN can outperform BS Policy and single DDQN.

When 𝛿 = 10 , the performance of single DDQN is not ideal, but ensemble DDQN outperforms other policies. Ensemble DDQN holds nearly half of the average EOH of BS Policy and incurs only 38% of the average OOS of single DDQN, but ensemble DDQN can achieves 82% SL and 90% FR. These results imply we successfully transfer our trained agents on new unseen item. We conclude that our ensemble technique is effective and our ensemble DDQN probably reduces the variance of optimal Q-values estimation and increases the generalizability and robustness of our DRL agent.

6. Conclusion and Discussion

In this paper we formally assess how to train an intelligent procurement agent using DRL algorithms and noisy demand data. We address the inventory control problem as the single-agent inventory control in a realistic setting with stochastic demand and lost sales. The proposed DDQN significantly outperforms benchmark - the current procurement policy used

Table 6: Performance comparison between single and ensemble DDQN

‧

by the distributor. Reducing nearly 50% of holding inventory level, the DDQN is capable of achieving much higher SL for the tested item. More importantly, seeing the unsatisfactory transfer learning of DDQN to unseen items, we develop an ensemble technique that substantially improves inventory performance. We posit that DRL agent can help practitioners to produce better replenishment policies and assist them in deciding order quantities.

Based on our preliminary test in the distributor, we show DRL for IIC has potential to solve practical inventory problems given the demand realization of selected items. Compared with prior researches, our study has no parametric assumption on demand structures and lets the DRL agent to learn from noisy demand signals directly. Moreover, we release the constraint of small action and state space, and consider finite-horizon and lost sales. Although the release and complex situation lead to serious challenges for our model development, the performance of our DRL agent proves our proposed algorithm can effectively solve high-dimensional state and action spaces and provide practical ordering policies in IIC problem. By navigating the application of cutting-edge AI developments to inventory control in a real business setting, our modeling effort expects to make non-trivial contributions to the theory and practice of procurement operations in high-tech supply chains. The nuts and bolts of developing our DRL procurement agent offer valuable lessons and fresh insights to both academic and industrial communities.

Finally, like other machine learning algorithms, our DRL solution has some limitations.

First, the neural network tuning of DRL is effort-intensive and no one will regard it as an easy task. In this study, we only tried a grid search method instead of exploring other hyperparameter optimization techniques such as random search or genetic algorithms. Another possible extension of this research is to increase the capability of handling continuous action space. We discrete the possible action space for calculating Q-values in DDQN, but some DRL algorithms, like policy gradient, can handle continuous action space directly.

‧

Arulkumaran, K., Deisenroth, P. M., Brundage, M., & Bharath, A. A. (2017) A brief survey of deep reinforcement learning. ArXiv: 1708 05866v2

Chaharsooghi, S., Heydari, J., & Zegordi, S. (2008) A reinforcement learning model for supply chain ordering management: An application to the beer game. Decision Support Systems, 45(4): 949-959.

Chollet, F. (2017) Deep Learning with Python. Manning Publications.

Giannoccaro, I., & Pontrandolfo, P. (2002) Inventory management in supply chains: A reinforcement learning approach. International Journal of Production Economics, 78(2):

153-161.

Gijsbrechts, J., Boute, R., Zhang, D., & Van Mieghem, J. (2019) Can Deep Reinforcement Learning Improve Inventory Management? Performance on Dual Sourcing, Lost Sales

and Multi-Echelon Problems. Available at SSRN

https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3302881.

Gosavi, A. (2009) Reinforcement learning: A tutorial survey and recent advances. INFORMS Journal on Computing, 21(3): 177-345.

Kingma, D. P., & Ba, J. (2017). Adam: A Method for Stochastic Optimization. ArXiv:1412 6980v9.

Lin, L. J. (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine Learning, 8(3-4): 293-321.

Mnih V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., & Riedmiller, M. (2013) Playing Atari with deep reinforcement learning. ArXiv:1312 5602v1

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., & Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., & Hassabis, D. (2015) Human-level control through deep reinforcement learning. Nature, 518: 529-533.

‧ 國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

Oroojlooyjadid, A., Snyder, L., & Takáč, M. (2019) Applying deep learning to the newsvendor problem. IISE Transactions, in press.

Oroojlooyjadid, A., Nazari, M., Snyder, L., & Takáč, M. (2019b) A deep Q-Network for the beer game: A deep reinforcement learning algorithm to solve inventory optimization problems. ArXiv:1708 05924v3

Porteus, E. L. (2002) Foundations of Stochastic Inventory Theory. Stanford University Press, California.

Qi, X., Wu, G., Boriboonsomsin, K., Barth, M. J., & Gonder, J. (2016) Data-driven reinforcement learning–based real-time energy management system for plug-in hybrid electric vehicles. Transportation Research Record, 2572(1), 1-8.

Schaul, T., Quan, J., Antonoglou, I., & Silver, D. (2015). Prioritized Experience Replay. ArXiv:

1511 05952v4.

Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., Chen, Y., Lillicrap, T., Hui, F., Sifre, L., van den Driessche, G., Graepel, T., Hassabis, D. (2017) Mastering the game of Go without human knowledge.

Nature 550: 354-359.

Sutton, R. S. and Barto, A. G. (2018) Reinforcement Learning: An Introduction Second edition, MIT Press, Cambridge.

Van Hasselt, H., Guez, A., and Silver, D. (2015). Deep Reinforcement Learning with Double Q-Learning. ArXiv:1509 06461

Vinod, H.D. and Lòpez-de-Lacalle, J. (2009). Maximum entropy bootstrap for time series. The meboot R Package. J Stat Softw , 29 (2009), pp. 1-19

在文檔中基於深度強化學習的智能存貨控制：以高科技供應鏈為例 - 政大學術集成 (頁 31-34)

國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

6. Conclusion and Discussion

‧

‧

‧ 國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

立政治大學

立政治大學