Most Valuable Player - 利用其他遊戲的經驗來加速新遊戲的訓練

We conducted a record of the frequency of one existing model has been select by another game, as shown in 4.3. The 13 existing models have 3 different action space; in addition, they could be roughly divided into two categories, shooting games and maze games (as shown in Table. 4.4, Alien, Amidar, Bank Heist and Venture are maze games, the rest are all shooting games. Note that most of the shooting games we have are vertical shooting, instead of Chopper Command and Star Gunner, which are horizontal shooting). Here, we refer to the most frequently selected model as the most valuable player (MVP) in that game.

We could find that the MVP of a game, not always the one we think looks like that game. The MVP in Air Raid and Carnival is Chopper Command, all of them are shooting games, while Chopper Command is horizontal shooting and the other two games are not.

MVPs in Alien, Amidar, Bank Heist are not maze games.

Likewise, we found that MVPs of Air Raid, Amidar, Carnival, Centipede and Chopper Command have different action space than the game itself. This implies that mapping different action spaces gives the agent more useful options.

Environments Screenshot Number of

Ac-tions Environments Screenshot Number of Ac-tions

Air Raid 6 Chopper

Com-mand 18

Alien 18 Demon Attack 6

Amidar 10 Solaris 18

Bank Heist 18 Space Invaders 6

Battle Zone 18 Star Gunner 18

Carnival 6 Venture 18

Centipede 18

Table 4.4: The Screenshot and the size of action space for each game (on the list of existing models).

Chapter 5 Conclusion

In this work, we tried to leverage some experiments from other tasks. We used models trained on other games as the policy to explore the current new environment. Experimental results show that even though these models have different goals and different perspectives, they could explore the environment more efficiently than random attempts. There is only one limitation to this approach: we need to provide a common network structure for each task. With this limitation, we could extend this approach to other tasks without additional computational costs or editing our framework. We hope that there will be a way to design a learning path for the RL agent. Before that, we believe that there are still some more efficient but simple ways to explore a new environment.

Bibliography

[1] K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath. A brief survey of deep reinforcement learning. arXiv preprint arXiv:1708.05866, 2017.

[2] Y. Aytar, T. Pfaff, D. Budden, T. L. Paine, Z. Wang, and N. de Freitas. Playing hard exploration games by watching youtube. CoRR, abs/1805.11592, 2018.

[3] P.-L. Bacon and D. Precup. Learning with options: Just deliberate and relax. In NIPS Bounded Optimality and Rational Metareasoning Workshop, 2015.

[4] M. Baroni, A. Joulin, A. Jabri, G. Kruszewski, A. Lazaridou, K. Simonic, and T. Mikolov. Commai: Evaluating the first steps towards a useful general ai. arXiv preprint arXiv:1701.08954, 2017.

[5] G. Berseth, C. Xie, and P. Cernek. Multi-skilled motion control. 2018.

[6] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba. Openai gym, 2016.

[7] P. Dhariwal, C. Hesse, O. Klimov, A. Nichol, M. Plappert, A. Radford, J. Schulman, S. Sidor, Y. Wu, and P. Zhokhov. Openai baselines. https://github.com/openai/

baselines, 2017.

[8] R. Dubey, P. Agrawal, D. Pathak, T. L. Griffiths, and A. A. Efros. Investigating human priors for playing video games. CoRR, abs/1802.10217, 2018.

[9] K. Frans, J. Ho, X. Chen, P. Abbeel, and J. Schulman. Meta learning shared hierar-chies. CoRR, abs/1710.09767, 2017.

[10] P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and D. Meger. Deep reinforcement learning that matters. CoRR, abs/1709.06560, 2017.

[11] B. Hengst. Discovering hierarchy in reinforcement learning with hexq. In ICML, volume 2, pages 243–250, 2002.

[12] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network.

arXiv preprint arXiv:1503.02531, 2015.

[13] M. Huber and R. A. Grupen. A feedback control structure for on-line learning tasks.

Robotics and autonomous systems, 22(3-4):303–315, 1997.

[14] A. Irpan. Deep reinforcement learning doesn’t work yet. https://www.alexirpan.

com/2018/02/14/rl-hard.html, 2018.

[15] J. Z. Kolter, P. Abbeel, and A. Y. Ng. Hierarchical apprenticeship learning with application to quadruped locomotion. In Advances in Neural Information Processing Systems, pages 769–776, 2008.

[16] G. Konidaris and A. G. Barto. Building portable options: Skill transfer in reinforce-ment learning. In IJCAI, volume 7, pages 895–900, 2007.

[17] T. D. Kulkarni, K. Narasimhan, A. Saeedi, and J. B. Tenenbaum. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation.

CoRR, abs/1604.06057, 2016.

[18] T. Lai and H. Robbins. Asymptotically efficient adaptive allocation rules. Adv. Appl.

Math., 6(1):4–22, Mar. 1985.

[19] S. Li and C. Zhang. An optimal online method of selecting source policies for rein-forcement learning. CoRR, abs/1709.08201, 2017.

[20] R. Liaw, S. Krishnan, A. Garg, D. Crankshaw, J. E. Gonzalez, and K. Goldberg.

Composing meta-policies for autonomous driving using hierarchical deep reinforce-ment learning. arXiv preprint arXiv:1711.01503, 2017.

[21] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning. CoRR, ab-s/1509.02971, 2015.

[22] D. Lopez-Paz et al. Gradient episodic memory for continual learning. In Advances in Neural Information Processing Systems, pages 6467–6476, 2017.

[23] M. C. Machado, M. G. Bellemare, E. Talvitie, J. Veness, M. J. Hausknecht, and M. Bowling. Revisiting the arcade learning environment: Evaluation protocols and open problems for general agents. CoRR, abs/1709.06009, 2017.

[24] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.

[25] M. Nickel and D. Kiela. Poincaré embeddings for learning hierarchical represen-tations. In Advances in neural information processing systems, pages 6338–6347, 2017.

[26] R. E. Parr and S. Russell. Hierarchical control and learning for Markov decision processes. University of California, Berkeley Berkeley, CA, 1998.

[27] M. Roderick, C. Grimm, and S. Tellex. Deep abstract q-networks. In AAMAS, 2018.

[28] A. A. Rusu, S. G. Colmenarejo, Ç. Gülçehre, G. Desjardins, J. Kirkpatrick, R. Pas-canu, V. Mnih, K. Kavukcuoglu, and R. Hadsell. Policy distillation. CoRR, ab-s/1511.06295, 2015.

[29] T. Shu, C. Xiong, and R. Socher. Hierarchical and interpretable skill acquisition in multi-task reinforcement learning, 2017.

[30] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driess-che, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Diele-man, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach,

K. Kavukcuoglu, T. Graepel, and D. Hassabis. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484–489, Jan. 2016.

[31] R. S. Sutton, D. Precup, and S. Singh. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(1-2):181–211, 1999.

[32] M. E. Taylor and P. Stone. Transfer learning for reinforcement learning domains: A survey. Journal of Machine Learning Research, 10(Jul):1633–1685, 2009.

[33] C. Tessler, S. Givony, T. Zahavy, D. J. Mankowitz, and S. Mannor. A deep hierar-chical approach to lifelong learning in minecraft, 2016.

[34] H. van Hasselt, A. Guez, and D. Silver. Deep reinforcement learning with double q-learning. CoRR, abs/1509.06461, 2015.

[35] O. Vinyals, T. Ewalds, S. Bartunov, P. Georgiev, A. S. Vezhnevets, M. Yeo, A. Makhzani, H. Küttler, J. Agapiou, J. Schrittwieser, J. Quan, S. Gaffney, S. Petersen, K. Simonyan, T. Schaul, H. van Hasselt, D. Silver, T. P. Lillicrap, K. Calderone, P. Keet, A. Brunasso, D. Lawrence, A. Ekermo, J. Repp, and R. Tsing.

Starcraft II: A new challenge for reinforcement learning. CoRR, abs/1708.04782, 2017.

[36] H. Yin and S. J. Pan. Knowledge transfer for deep reinforcement learning with hierarchical experience replay. In AAAI, pages 1640–1646, 2017.

在文檔中利用其他遊戲的經驗來加速新遊戲的訓練 (頁 27-33)