There is a problem in RL that has been discovered for a long time, a well-trained model and even a problem definition can only be applied to a single task. Since the 1990s, there have been many attempts to apply the idea of transfer learning to RL tasks, and in 2009 there was a survey report on RL transfer learning [32]. As described in [32], the researchers believe that not only generalization can be carried out in tasks, but also generalization between tasks, thus effectively speeding up learning. However, after nearly two decades, which parts should be transferred from which sources, and how to do this, have not been well resolved.
In order to define what should be transferred, some researchers try to define an em-bedding for states, actions or the task, such as [25, 4, 22]. Some motivated by the idea of the student-teacher paradigm introduced by Knowledge Distilling [12], make existing models to be the teachers for a multi-task agent to shorten the time to adapt to the new environment, such as [28, 33, 29, 36, 5].
Whether the researchers want to transfer something from one task to another, or if they want to do multi-task training, the definition of the common part remains a problem.
Furthermore, [8] points out the importance of prior knowledge, claimed that general priors, such as the importance of objects and visual consistency, are critical for efficient game-play. This claim, which was supported by experiments, reveas the fact that this problem is much more difficult than the researchers known.
One of the most similar works with ours is [19]. This work defines a framework for dynamically reusing source policies during training. However, like other transfer learning methods, this work has significant limitations on the source policies: the same action space and the same state space as the new task. Even simple as Atari games are not suitable for
4
this assumption. In fact, this work only experiments on a series of navigation tasks in the same environment. To make this framework more versatile, we relaxed the restrictions by setting the problem as a goal-oriented exploration rather than reusing source policies.
We want to take advantage of the models we’ve trained as well, while we don’t think it is necessary to extract common components between tasks before doing so.
Chapter 3 Method
Here, we consider the process of collecting experience as a process for generating training data, as shown in Fig. 3.1. Therefore, the process of evaluating the policy and updating the Q-function could be considered as the process of training on generated training data.
As shown in Fig. 3.2, the training data generated by the traditional method only comes from the Q-Function which are not yet well-trained, and the random policy without a tar-get. In this way, we need to spend a lot of steps in the startup phase to get a few successful experiences. Most training data generated by random policy has less information, which makes it difficult for an agent to introduce some rules. Here, we want to spend most of the steps in the startup phase to use those policies from the existing models, which are trained on other Atari games, instead of random policy. These policies will have different goals because they are designed for different games and have different perspectives since the screen will vary from game to game. To observe the influence of those policies trained on other games, we directly load the model trained on one game, say game A, to the other game B. We found that sometimes the policy from game A will work on game B, and the way the agent solve game B will be different with using the policy trained on game B. We assume that using the policy trained on other game will cause the agent to make a different decision in the same situation and that the decision should be reasonable since the agent has a goal to achieve. Whether or not this goal is similar to the real goal of the current game, keeping a goal during the game may result in more information on the training data generated.
6
Figure 3.1: The Framework for Double DQN: We could consider the process of inter-acting with the environment than collect the experiences, as the process of generating data for training. Refer to Sec. 3 for details.
Inspired by [19], we use the existing policies to interact with the current environment directly, rather than loading one of the existing models as initial values. Though [19] has strong limitations on the method they proposed, each source policies should have the same action space and state space as the task currently being solved, we could still use the same framework as proposed in [19] since the reason for our use of the old policies are different.
[19] and many other works want to leverage from those learned policies, based on the assumption that some common sense is hidden in learned policies which are useful for the current task. While in this work, we only want some policies that are different from the random policy and the Q-function being updating. We assume that to use centain policies under different goals in the same environment could produce some different experiences.
These different experiences could increase the diversity of our training data, and make our training more effective. The experimental results in Sec. 4 shows that those training data generated by policies trained on other games will make training easier than generated by random policy.
Figure 3.2: The Training Data Produced: Here we show the policy that the agent takes when generating training data. The upper graph shows that as the training time increases, the probability of randomly selecting actions will decrease, and the learned policy will be selected. The lower graph shows that if we use the policies of the existing models during the startup phase, we will get more different and goal-oriented training data. (Note that the two numbers 3e5 and 1e6 represent the required steps. More detailed settings will be shown in the experimental setting. Refer to Sec. 4.1 for details.)
3.1 System Framework
In our approach, we will have one or more pre-trained models, each of them were trained in a single game. The only limitation on the games they trained in is the game should be one of the 59 Atari games supported by [6]. Therefore, by using the Deep Q-Network algorithm (DQN, [24]), we could place models under a similar architecture. One of the important contributions of DQN is that we could train in different games with one ar-chitecture and the same hyper-parameters. All models have 3 convolution layers and 2 connected layers. The only difference between models is the size of the final fully-connected layer, depending on the size of the action space of the game it trained in, refer to Sec. 4.1. We will introduce the method we use to overcome this difference in Sec. 3.3.
This similar network architecture and mapping from different action space make us able to directly load one model trained on game A to play directly in game B. As illustrated in Fig. 3.3, this is almost the same as the framework in[19], we will put all the models available in the pool. Each time before a new episode of the game begins, the agent will
8
Figure 3.3: The Framework to use existing models: Refer to Sec. 3.1 for details.
have a probability pt, which will decrease over time, to select a model in the pool. Oth-erwise, the agent will enter the traditional path of Q-Learning (the right path in Fig. 3.3), using the Q-function in training, as shown in Alg. 1. When the agent obtains a policy, from an existing model or the policy which will be updated through time, there is another probability of ϵ to be evaluated every time step. The agent will have a probability of ϵ to select an action randomly or select the best action based on the policy it obtains.
To train the agent by the Q-Learning method, here we use Double Q-Learning (DDQN, [34]), update the policy πqfuncusing the loss
(Rt+1+ γt+1q−θ(St+1, a′)− qθ(St, At))2,
where t is a time step randomly selected from the memory, Stand Atare the state and the action in time t respectively. The parameters θ of the online network is used to select actions, and the parameters θ−of a target network, which is a periodic copy of the online network, will not be directly optimized.