1Introduction ActiveDeepQ-learningwithDemonstration

(1)

Active Deep Q-learning with Demonstration

Si-An Chen

^∗1,2

, Voot Tangkaratt

^†2

, Hsuan-Tien Lin

^‡1

, and Masashi Sugiyama

^§2,3

1

National Taiwan University, Taiwan

2

RIKEN Center for Advanced Intelligence Project, Japan

3

The University of Tokyo, Japan

Abstract

Recent research has shown that although Reinforcement Learning (RL) can benefit from expert demonstration, it usually takes considerable efforts to obtain enough demonstration. The efforts prevent training decent RL agents with expert demonstration in practice. In this work, we propose Active Reinforcement Learning with Demonstration (ARLD), a new framework to streamline RL in terms of demonstration efforts by allowing the RL agent to query for demonstration actively during training. Un- der the framework, we propose Active Deep Q-Network, a novel query strategy which adapts to the dynamically-changing distributions during the RL training process by estimating the uncertainty of recent states.

The expert demonstration data within Active DQN are then utilized by optimizing supervised max-margin loss in addition to temporal difference loss within usual DQN training. We propose two methods of estimating the uncertainty based on two state-of-the-art DQN models, namely the divergence of bootstrapped DQN and the variance of noisy DQN. The em- pirical results validate that both methods not only learn faster than other passive expert demonstration methods with the same amount of demonstration and but also reach super-expert level of performance across four different tasks.

1 Introduction

Sequential decision making is a common and important problem in the real world. For instance, to achieve its goal, a robot should produce a sequence of decisions or movements according to its observations over time. A recommender

∗r05922089@ntu.edu.tw

†voot.tangkaratt@riken.jp

‡htlin@csie.ntu.edu.tw

§sugi@k.u-tokyo.ac.jp

arXiv:1812.02632v1 [cs.LG] 6 Dec 2018

(2)

system should decide when and which item or advertisement to display to a customer in a sequential manner. For sequential decision making, reinforcement learning (Sutton and Barto, 1998) (RL) has been recognized as an effective framework which learns from interaction with the environment. Thanks to advances in deep learning and computational hardware, deep RL has achieved a number of successes in various fields such as end-to-end policy search for motor control (Watter et al., 2015), deep Q-networks for playing Atari games (Mnih et al., 2015), and combining RL and tree search to defeat the top human Go expert (Silver et al., 2016). These successes show the power of RL to solve various kinds of sequential decision making and control problems.

In contrast with these successes, deep RL algorithms are notorious for their substantial demands on simulation during training. Typically, these algorithms start from scratch and require millions of data samples to learn a locally optimal policy, which is not a problem if unlimited simulation is available but is infeasible for many real-world applications such as robotic systems. To address this problem, several methods have been proposed to improve learning efficiency by leveraging prior knowledge from human experts. Imitation learning (Schaal, 1996), also known as learning from demonstration (LfD), is an attempt to learn the policy of an expert by observing the expert’s demonstrations. However, the performance of imitation learning is limited by the expert, since the agent only learns from the expert without regard to rewards given by the environment. Another way is to improve RL by leveraging demonstrations given by the expert and rewards simultaneously. Recently, Deep Q-learning from Demonstra- tion (DQfD) (Hester et al., 2018) and Policy Optimization with Demonstrations (POfD) (Kang et al., 2018) have shown state-of-the-art results on several Atari games by training the agent with an objective that combines the rewards and the expert demonstrations.

Although expert demonstrations improve RL, the efforts made by the expert are not negligible. For instance, it takes a human expert thousands of steps to finish a mere 5 episodes for most Atari games (Hester et al., 2018). The huge efforts make it hard to collect a large number of demonstrations for DqFD in practice. In this paper, we introduce the concept of active learning to make more efficient use of the expert’s efforts. In supervised learning, the goal of active learning is to achieve better performance with less labeling effort by interactively querying for new labels from unlabeled data (Settles, 2009). In RL, we can also actively ask the expert for a recommended action given the current observed state. Videos of bootstrapped DQN (Osband et al., 2016) have shown that the behavior of different well-performing policies agree at critical points but diverge at other less important states. This suggests that we could save much expert effort by querying only at critical states in contrast to previous methods in which the expert’s demonstration is collected for several entire episodes. In other words, we can achieve further improvement in RL with the same number of demonstrations.

In this work, we consider reinforcement learning problems that allow selec- tive human demonstration during the learning process. We first propose a new framework called Active Reinforcement Learning with Demonstration (ARLD)

(3)

for such learning problems. Then we propose Active DQN, which proactively asks for demonstration and leverages the demonstration data. The query criterion should decide when to query—i.e., identifying states where the agent can indeed learn and improve by obtaining the demonstration. We propose two query methods based on uncertainty of Q-value estimation, named divergence and variance, which are derived from two state-of-the-art DQN methods, bootstrapped DQN and noisy DQN, respectively. The uncertainty terms are then thresholded dynamically to form querying decisions.

The dynamic nature allows the two methods to adapt to recent states observed by the agent and can be applied in various kinds of environments without exhaustive parameter tuning. Experimental results show that our method with both uncertainty measurements is effective in four different tasks. In this paper we thus offer three main contributions:

1. We propose a new framework, Active Reinforcement Learning with Demon- stration, which is the first work to reduce human effort in RL with demonstration to the best of our knowledge;

2. We propose a novel uncertainty-based query strategy which can be applied toward different tasks and less sensitive to additional parameters;

3. We verify the effectiveness of two DQN uncertainty estimations with promising experiment results.

2 Related Work

Imitation learning (Schaal, 1996) is a classic technique for learning from human demonstration. Typically, imitation learning uses a supervised approach to im- itate an expert’s behaviors. DAGGER (Ross et al., 2011), a popular imitation algorithm, requests an action from the expert at each step, and takes an action sampled from a mixed distribution of the agent and the expert. It then aggre- gates the observed states and demonstrated actions to train the agent iteratively.

Deeply AggreVaTeD (Sun et al., 2017) is an extended version of DAGGER which works with deep neural networks and continuous action spaces. However, both require an expert to provide demonstration during the whole training phase.

To reduce the demand for human effort, the agent learns actively in active imitation learning (Shon et al., 2007; Judah et al., 2014) by requesting fewer expert demonstrations. The supervised setting of imitation learning make it easier to apply techniques from traditional supervised active learning. However, although imitation learning can lead to no-regret performance in online learning settings, its performance is still limited by the expert given the use of only expert demonstration data for learning.

On the contrary, it is possible for Reinforcement Learning (RL) to achieve better performance than the human expert by learning to interact with the environment and maximize the cumulative rewards. In RL, there exist a variety of methods that leverage demonstration to obtain improved performance.

(4)

For instance, some use expert advice or demonstration to shape rewards in the RL problem (Brys et al., 2015; Suay et al., 2016). Another approach is to ask for demonstration from a given state to another state to improve the exploration (Subramanian et al., 2016). In contrast, the HAT algorithm sum- marizes the demonstrated knowledge via a decision tree and bootstraps the task with the learned policy to transfer it to the target agent (Taylor et al., 2011).

CHAT, an extension of HAT, measures the source policy’s confidence to decide whether to take its advice (Wang and Taylor, 2017). The main difference between CHAT and our work is CHAT’s use of another model to learn a source policy from demonstration offline and estimate the confidence of the source policy; in contrast we estimate the uncertainty of the target learner and ask the expert directly.

Reinforcement Learning with Expert Demonstrations (RLED) (Piot et al., 2014) concerns a scenario in which the expert also receives rewards from the environment. In this case, DQfD (Hester et al., 2018), DDPGfD (Vecerik et al., 2017), and POfD (Kang et al., 2018) have shown state-of-the-art results on a variety of tasks by combining the original RL loss with a supervised loss on the expert’s demonstrations. Then, the agent simultaneously learns its original objective and the behavior of the expert. In comparison to similar work such as Human Experience Replay Hosu and Rebedea (2016) and Replay Buffer Spiking Lipton et al. (2016), RLED methods yield massive acceleration with a relatively small amount of demonstration data. Moreover, experiments show that these methods can also outperform the expert they learn from. In contrast to our work, these works collect demonstration data before training, and the expert must interact with environment by completing the whole episode several times, whereas the proposed method requires the expert to demonstrate only at critical states given the learning progress of the agent.

Most previous works focus on how to improve RL from “passive” demonstration data. To the best of our knowledge, this is the first work to introduce the concept of active learning to leverage demonstration data. The most similar work is active imitation learning; however, as the mechanisms for supervised learning and reinforcement learning differ greatly, we cannot directly apply the same methods. Table 1 compares different settings mentioned above.

no expert offline/passive learning online/active learning Imitation

Learning

DAGGER Active Imitation Learning Deeply AggreVaTeD

Reinforcement Learning

DQN DQfD

ARLD (our work)

DDPG DDPGfD

A3C POfD

HAT, CHAT

Table 1: Comparison between different settings

(5)

3 Background

3.1 Reinforcement Learning and Deep Q Network

The standard reinforcement learning framework consists of an agent interacting with an environment which can be modeled as a Markov decision process (MDP). An MDP is defined by a tuple M = hS, A, R, P, γi, where S is the state space, A the action space, R : S × A → R the reward function, P (s⁰|s, a) the transition probability function, and γ ∈ [0, 1) the discount factor. At each step, the agent observes a state s ∈ S and takes an action a ∈ A according to a policy π. The policy π can be either deterministic, π : S → A, or stochastic, π : S → P (A). On taking an action, the agent receives a reward R(s, a) and reaches a new state s⁰according to P (s⁰|s, a). The goal of the agent is to find the policy π which maximizes the discounted accumulative reward Eτ[P∞

t=0γ^tR(s_t, a_t)], where τ denotes the trajectory obtained with π and P . For problems with discrete actions, the most popular approach nowadays is the Deep Q-network (DQN (Mnih et al., 2015)). The key idea of DQN is to learn an approximation of the optimal value function Q^∗, which conforms to the Bellman optimality equation

Q^∗(s, a) = R(s, a) + γ E

s⁰∼P (s⁰|s,a)

max

a⁰∈AQ^∗(s⁰, a⁰)

.

The optimal policy is then defined by Q^∗ as π(s) = argmax_a0∈AQ^∗(s, a⁰). The value-function is approximated by a neural network Q(s, a; θ) with parameter θ where the parameter is learned by minimizing the temporal difference loss:

Qtarget= r + γ max

a⁰∈AQ(s⁰, a⁰; θ⁻) LT D(θ) = E

(s,a,r,s⁰)∼D(Qtarget− Q(s, a; θ))² ,

where D is a distribution of transitions (s, a, r = R(s, a), s⁰∼ P (s⁰|s, a)) drawn from a replay buffer of previously observed transitions, and θ⁻is the parameter of a separate target network which is copied from θ regularly to stabilize the target Q-values. Double Q-learning (van Hasselt et al., 2016) is an enhancement of DQN where the target value is calculated by replacing maxa⁰∈AQ(s⁰, a⁰; θ⁻) with Q(s⁰, argmax_a0∈AQ(s⁰, a⁰; θ); θ⁻). This modification reduces the overesti- mation of target value created with the original update rule.

3.2 Deep Q-learning from Demonstration

Deep Q-learning from Demonstration (DQfD (Hester et al., 2018)) is a state-of- the-art method to leverage demonstration data to accelerate the learning process of DQN. The agent is pre-trained on demonstration data to obtain better initial parameters before any interaction with the environment. It keeps demonstration data in a prioritized replay buffer (Schaul et al., 2016) permanently, and give additional priority to demonstration data to increase the probability that they

(6)

are sampled. DQfD applies a combination of four losses: the typical one-step double Q-learning loss (LT D), N-step double Q-learning loss (LN), supervised large margin classification loss (LE), and L2 regularization loss. The overall loss is thus

L(θ) = L_{T D}(θ) + λ₁L_N(θ) + λ₂L_E(θ) + λ₃||θ||²₂.

The typical one-step temporal difference loss and N-step temporal difference loss are used to obtain the optimal Q-value conforming to the Bellman equation, where the N-step loss is

LN(θ) = E

(s,a,r,s⁰)∼D(QN − Q(s, a; θ))² ,

QN = rt+ γrt+1+ γ²rt+2+ ... + γ^{N −1}rt+N −1+ max

a γ^NQ(st+N, a).

The large margin classification loss (Piot et al., 2014) is defined as LE(θ) = max

a∈A[Q(s, a; θ) + M1 [a 6= aE]] − Q(s, aE),

where aE represents the action that the expert took in state s, M is a pos- itive constant value which means margin and 1[·] is indicator function. L2 regularization loss is applied to parameters of the network to prevent overfit- ting on demonstration data. All losses are applied in both pre-training and reinforcement learning phases, whereas the supervised loss is only applied with demonstration data.

3.3 Deep Exploration via Bootstrapped DQN

Exploration is an important issue in RL. E.g., epsilon-greedy is commonly used but it does not exploit any information. Bootstrapped DQN (Osband et al., 2016) is a modification of DQN to improve exploration during training. In practice, the network is built with K ∈ N outputs, each representing a Q-value function estimation Qk(s, a; θ). Each output head is trained against its own target network Qk(s, a; θ⁻) and is updated with its own bootstrapped data from the replay buffer. The parameters of each head are initialized independently, while the gradient of each update is normalized with 1/K. During training, a single head is sampled at the beginning of each epoch, and the agent takes the optimal policy corresponding to the sampled Q-value approximation function for the duration of the episode. This allows the agent to conduct a more consistent exploration as compared to other common dithering strategies such as

-greedy. To keep track of the bootstrapped data for each head, we attach to each transition data in the replay buffer a boolean mask m ∈ {0, 1}^Kindicating which heads are privy to this data. The masks are drawn from an identical Bernoulli distribution independently (mi ∼ Ber(p), ∀i ∈ 1 . . . K). However, their experiments show that the performance of bootstrapped DQN is not in- fluenced by difference choices of p, and that all outperform the original DQN.

Hence in practice, to increase computational efficiency, we simply share all the data between each head (p = 1).

(7)

3.4 Noisy Networks for Exploration

NoisyNet (Fortunato et al., 2017) is an alternative approach to improve the efficiency of exploration in RL, where the parameters in the output layer of a network are perturbed by noise. The noisy parameter θ of Q(s, a; θ) is represented by θ = µ + Σ ε, where ζ = (µ, Σ) is a set of vectors of learnable parameters, ε is a vector of zero-mean noise sampled from the standard normal distribution, and stands for element-wise multiplication. A noisy linear layer with p inputs and q outputs is then represented by

y = (µw+ σw εw)x + µb+ σb εb,

where µ_w+ σ_w ε_wand µ_b+ σ_b ε_b replace the weight matrix and bias vector of typical linear regression. The parameters µ_w∈ R^q×p, µ_b∈ R^q, σ_win R^q×p, σ_b∈ R^q are learnable parameters and εw ∈ R^q×p, εb ∈ R^q are noise variables. The agent samples a new set of variables after each update step and follows the optimal policy corresponding to the sampled Q-value function. The noise of the online network, target network, and online network in double DQN are sampled independently. The loss function for noisy double DQN is defined as

L(ζ) =¯

ε,εE⁰,ε⁰⁰

(s,a,r,sE⁰)∼D[r + γQ(s⁰, a^∗, ε⁰; ζ⁻) − Q(s, a, ε; ζ)]²

a^∗= argmax

a∈A

Q(s⁰, a, ε⁰⁰; ζ).

4 Active Deep Q-learning with Demonstration

In this section, we first describe a new problem setting, then propose an uncertainty- based query strategy to address the problem, after which we introduce two ways to estimate the uncertainty of a deep Q-network given an observed state with bootstrapped DQN and noisy DQN, separately.

4.1 Problem Setup

We proposed a new framework named Active Reinforcement Learning with Demonstration (ARLD) to improve the demonstration efficiency, which is not considered in previous RLED works. In ARLD, we consider the standard RL framework introduced in 3.1. In addition, we assume there is an expert π⁺ which performs well on the task we seek to learn. Notice that π⁺does not need to be optimal or deterministic, which is common for human experts. The agent interacts with the expert by asking what action to take only when it is not sure what to do given the observed state at each step. The algorithm then decides whether to take the action given by the expert, or to just take the action given by the agent’s policy. The main challenge of ARLD is to decide when to query from the expert so that the agent can indeed benefit from the obtained demonstration. As with active learning, our goal is to improve RL by making as few

(8)

queries as possible. More precisely, given a limited query budget, we seek to enable the agent to learn to solve the task in as few steps as possible. Below, we discuss uncertainty sampling (Lewis and Gale, 1994), which one of the simplest and most commonly used query frameworks in active supervised learning.

4.2 Query Strategy with Adaptive Uncertainty Threshold

Uncertainty sampling is one of the simplest and most commonly used query frameworks in active learning (Settles, 2009). In this framework, an active learner estimates the uncertainty of a pool of unlabeled instances and submits queries for those it is least certain how to label. It is challenging to apply active learning with deep neural networks, as good deep models typically require large amounts of data. Recent work has shown that uncertainty can be estimated by taking advantage of specialized models such as Bayesian neural networks (Gal et al., 2017). However, to improve RL by requesting an expert demonstration, we require a online query strategy that takes advantage of uncertainty.

In our setting, at each step, before the agent takes an action, it decides whether to query the expert. A naive way to solve this problem is to make the decision with a fixed threshold: the agent asks expert for demonstration once the uncertainty of an observed state exceeds the threshold; otherwise it takes the action which maximizes the estimated Q-value. However, it is difficult to find a proper threshold when the distribution of uncertainty keeps changing; the discrepancy between different tasks also makes this difficult. One proposal is to adjust the threshold with a fixed adjustment factor to work with the online Query by Committee (QBC) (Krawczyk and Wozniak, 2017), but it is still difficult to choose a good adjustment factor for all tasks, especially when the uncertainty measurement is unbounded and its magnitude unknown.

We propose an adaptive method which enables the agent to adjust its query policy during training time without any prior knowledge of the task. Each time the agent makes a decision, it compares the uncertainty of the current state with that estimated in recent steps. If the current state uncertainty is larger than a given proportion of recent steps, the agent queries the expert for demonstration;

otherwise it determines its own action. In this algorithm, we decide whether to ask the expert given the parameters N_r and t_query, representing the amount of recent steps we consider and the proportion of recent steps for which the state uncertainty must be higher than the current state uncertainty. In practice, we use a dequeue to maintain the uncertainty of recent steps and a balanced binary search tree (BST) to keep these uncertainties in order so that we can make the decision in O(log2Nr) complexity for each step. The pseudocode of the algorithm is provided in Algorithm 1.

The performance of the algorithm depends on the choice of uncertainty estimation. Below, we propose two methods to estimate the uncertainty. One is based on bootstrapped DQN and the other one is based on noisy network.

(9)

Algorithm 1 Adaptive Query Strategy with Uncertainty

Input: uncertainty Ut, reference size Nr, proportion threshold tquery ∈ [0, 1], recent uncertainty dequeue D, recent uncertainty BST B

Output: asking ∈ [TRUE , FALSE ], D, B

1: idx ← size of D × tquery 2: Uthreshold← B[idx]

3: if Ut> Uthreshold then

4: asking ← TRUE

5: else

6: asking ← FALSE

7: end if

8: if size of D ≥ R_rthen

9: U_del← D.pop lef t()

10: remove U_delfrom B

11: end if

12: add Ut into D, B

4.3 Divergence of Bootstrapped DQN

Bootstrapping is a commonly used technique in statistics to estimate a sampling distribution. In bootstrapped DQN (Osband et al., 2016), multiple value function heads Qk(s, a; θ) are used to approximate a distribution over Q-values.

There are several ways to estimate uncertainty with these bootstrapped heads, including calculating the entropy of the voting distribution or averaging the variance of action values predicted by each head. In this work, we consider each head as a distribution and estimate the uncertainty using Jensen-Shannon divergence, which is a well-known method to measure the similarities between multiple distributions.

Notice that while the agents in bootstrapped DQN tend to behave differently at less critical states because the Q-values of each action might be close to each other, the JS divergence in this situation will be low since the distributions of Q- values are similar. Thus, we can prevent from asking these unimportant states by estimating JS divergence as uncertainty. For example, when considering an environment with two actions and two bootstrapped value functions, if the two heads predict (1, 0) and (0, 1) respectively, intuition dictates that the state should be more uncertain than (0.5, 0.4) and (0.4, 0.5) in another state.

However, with the voting method, both states are considered equally uncertain, as the two heads vote for different actions in both cases; JS divergence, by contrast, distinguishes the difference between them. Another advantage of JS divergence is that as it is a bounded function, its value is more meaningful and easier to use as a threshold in different environments.

To measure the JS divergence between the bootstrapped heads, we first normalize the Q-values and actions using softmax to obtain a policy distribution.

(10)

For each head Qk(s, a; θ),

πk(a|s; θ) = e^Q^k^(s,a;θ)/X

a⁰

e^Q^k^(s,a⁰^;θ).

Given this policy distribution, we estimate the uncertainty by calculating the Jensen-Shannon divergence of the policy distribution between each head, yield- ing

UD= J S(π1, π2, ..., πK) = H(1 K

X

k

πk) − 1 K

X

k

H(πk),

where H(π) is the Shannon entropy of distribution π and K the number of bootstrapped heads.

4.4 Predictive Variance of Noisy DQN

For our second estimate of uncertainty, we evaluate the predictive variance of a noisy network. Previous work has shown the effectiveness of estimating uncertainty by the predictive variance of a Bayesian convolution network in classification active learning (Gal et al., 2017). Other works have shown that injecting noise into the parameter space improves the exploration process in deep reinforcement learning (Fortunato et al., 2017; Plappert et al., 2017). Combining these two ideas, we use the noisy network as an exploration policy and estimate uncertainty using its predictive variance. We replace the fully connected layer in the output layer with a noisy fully connected layer. The corresponding noisy output layer can be seen as Bayesian linear regression:

Q(s, a; θ) = w_aφ(s) + b_a, wa∼ N (µ^(w^a⁾, Σ), Σ = diag((σ^(w^a⁾)²)

ba∼ N (µ^(b^a⁾, (σ^(b^a⁾)²),

where φ(s) ∈ R^p is input to the last layer. wa ∈ R^p, ba ∈ R represents the variables corresponding to action a, and µ^(wâ⁾, σ^(wâ⁾, µ^(bâ⁾, and σ^(bâ⁾ are the parameters actually learned by the model, representing the mean and noise level of wa and ba respectively.

Given the posterior distribution of the parameters, we derive the predictive variance as

V ar[Q(s, a)] = V ar[waφ(s) + ba]

= V ar[w_aφ(s)] + V ar[b_a]

= φ(s)^TΣφ(s) + (σ^(b^a⁾)².

The variance of each action measures the lack of confidence with respect to this action. We take the variance of the action with the largest Q-value as our uncertainty:

UV = V ar(Q(s, argmax

a

Q(s, a))),

(11)

which translates to the lack of confidence for the action the agent would take at the step. By querying the states with low confidence, we avoid bad moves leading to task failure and explicitly teach the agent which is the proper action of the state.

5 Experiments

In this section, we describe the environments used for the evaluation as well as the experimental setup. To focus on the effectiveness of each query strategy, we show the experimental results of methods based on the bootstrapped and noisy network separately, after which we present results evaluating the effect of the query proportion threshold of the proposed method.

5.1 Experimental Setup

We use four different environments for evaluation: (1) Cart-Pole, (2) Acrobot, (3) Mountain Car, and (4) Lunar Lander, all of which are included in OpenAI Gym (Brockman et al., 2016). Among them Cart-Pole is the simplest task and Lunar Lander is the most complicated one. The target score to mark each task as solved is listed in Table 3.

For each environment, we evaluated six different methods based on bootstrapped (Osband et al., 2016) or noisy (Fortunato et al., 2017) networks separately. The methods are:

1. DQN: Prioritized Double DQN trained without any demonstration 2. DQfD: Deep Q-learning from Demonstration (Hester et al., 2018) 3. GDQN: Greedy query strategy which queries all states until budget is all

spent

4. BDQN: Bernoulli query strategy, queries states at fixed probability 5. ADQN: Active DQN, queries states according to proposed query strategy

and uncertainty estimation

6. ADQNP: Active DQN with DQfD pretraining.

The key differences between the methods are summarized in Table 2.

We tuned the basic parameters such as the learning rate and discount factor for DQN on each environment to ensure reasonable learning progress and then fixed these parameters for all six methods, as listed in Table 4. The network structure applied in all environments was identical: two fully connected hidden layers with 64 neurons followed by another fully connected layer to the Q- Values for each action. The layers all used rectified linear units (ReLU) for non-linearity. We trained the networks using Adam and a -greedy policy with

annealed linearly from 0.9 to 0.01. We set the parameters of prioritized replay according to (Schaul et al., 2016). For bootstrapped DQN, we used 10 bootstrap

(12)

Demonstration Pre-training Interaction Query criterion

DQN no no no no

DQfD yes yes no no

GDQN yes no yes greedy

BDQN yes no yes Bernoulli

ADQN yes no yes uncertainty

ADQNP yes yes yes uncertainty

Table 2: Comparison between methods

heads with normalized gradient and shared all the data as (Osband et al., 2016).

For the noisy networks, we used factorized noise and followed the initialization and hyperparameter values from (Fortunato et al., 2017).

For DQfD, we did not use L2 regularization loss or N-step temporal difference loss, as they brought no benefit to training in our experiments. We set the expert margin M = 0.8 as (Hester et al., 2018) and tuned the supervised loss weight λ in {10⁻⁵, 10⁻⁴, 10⁻³, . . . , 1}. The number of demonstration data and pretraining steps were set to allow DQfD learn a better initial policy than learning from a scratch.

For each query, all ARLD methods receives five consecutive expert demonstrations until the end of the episode. The query threshold tquery is tuned in {0.05, 0.1, 0.3, 0.5}. For ADQNP, the number of demonstration for pretraining is half of DQfD and the query budget is half of ADQN, resulting in the same number of total demonstrations. All task-specific parameters are all listed in Table 4.

Mean score/std Min. score Avg. steps Target score

Cart-Pole 166.77±39.14 93 166.77 195

Acrobot -128.25±66.86 -489 128.25 -100

Mountain Car -134.0±27.52 -158 134 -110

Lunar Lander 155.18±55.58 -16.63 784.92 200

Table 3: Expert statistics

To obtain an expert for each environment, we saved the prioritized double DQN models during training and evaluated each over 100 episodes. Then we chose as the environment expert a model that (a) did not perfectly solve the task (b) had reached low performance variance (c) still solved the task before the end of the episode. The choice ensures the experts to be realistic rather than idealistic. These experts were used to collect demonstration data in DQfD and perform interactive demonstration in ADQN. The evaluation statistics of these experts are shown in Table 3. In section 5.5, we show the effect of different artificial expert settings on ADQN and DQfD.

All of the experiments were repeated 20 times with different random seeds.

(13)

Cart-Pole Acrobot Mountain Car Lunar Lander

Discount factor 0.9 0.99 0.99 0.99

Learning rate 0.0001 0.0001 0.001 0.001

Training steps 20000 200000 500000 500000

# of demo/budget 200 100 500 3000

Memory size 10000 100000 100000 100000

Pre-training steps 10000 10000 10000 30000

DQfD λ 0.00001 1 1 1

ADQN-B tquery 0.05 0.3 0.1 0.3

ADQN-N tquery 0.5 0.3 0.3 0.1

Table 4: Task specific parameters

The figures show the median of the results over 20 trials. The y-axis indicates the averaged test score, where the test scores in each trial were computed at a fixed frequency by executing 20 test episodes without exploration.

5.2 Comparison between ARLD methods

We first compare the methods without pretraining, i.e., DQN, GDQN, BDQN and ADQN. Table 5 lists the median of number of steps that each method takes to solve the tasks in 20 trials. Among methods based on bootstrapped DQN, the proposed ADQN outperforms other methods in all environments and yields significant improvements in Acrobot, Mountain Car, and Lunar Lander.

For methods based on noisy network, ADQN also achieves best performance in three out of four tasks, and improves the learning progress in Acrobot and Lu- nar Lander dramatically. The strength of ADQN over DQN again confirms the usefulness of interacting with demonstration. Most importantly, the advantage of ADQN over GDQN and BDQN validates that ADQN allows a more effective use of the demonstration efforts by querying strategically at the important moments.

Fig. 1 shows the learning curves of ARLD methods based on the bootstrapped or noisy network separately. The results demonstrates that the methods with demonstration not only outperform the original DQN in general, but also achieve higher score than the realistically-simulated experts. Among the methods with demonstration, ADQN is often the most competitive one, especially in the hardest task of Lunar Lander, which was solved by ADQN with fewer steps and a higher score.

5.3 Comparison with Pretraining Methods

Next, we compare DQfD with ADQN to understand the effect of collecting expert demonstration before or during training. We also design ADQNP as a simple mixture between the two. The results in Table 5 show that ADQN usually solves the tasks with fewer steps than DQfD or ADQNP, except for

(14)

Cart-Pole Acrobot Mountain Car Lunar Lander

DQN-B 8000 57000 210000 260000

GDQN-B 8000 45000 170000 310000

BDQN-B 7500 31000 100000 217500

ADQN-B 7000 25000 85000 205000

DQfD-B 7500 38000 140000 500000

ADQNP-B 6000 49000 135000 267500

DQN-N 13000 114000 190000 355000

GDQN-N 10000 73000 295000 500000

BDQN-N 10000 55000 170000 160000

ADQN-N 9500 8000 230000 47500

DQfD-N 8000 95000 300000 402500

ADQNP-N 8000 75000 255000 212500

Table 5: Median number of step to solve the task. The bold numbers indicate the best performance on that task.

the simplest task of Cart Pole. ADQNP also often improves over DQfD when the tasks gets harder. The results justify the effectiveness of leveraging expert demonstration during training.

For simulating real-world scenario where the demonstrating humans may not always be perfect, our simulated experts are designed to be realistic but imperfect. Then, methods with pretraining need to take additional steps in the beginning to correct the policy learned from the imperfect experts. This situation explains the performance dropping in the beginning of DQfD for Acrobot and Lunar Lander, as shown in Fig 2. ADQN, on the other hand, demonstrates better ability to leverage the imperfect demonstrations to aid RL.

5.4 Effect of Query Proportion Threshold

Active DQN uses two parameters: the number of recent steps for which we compare the uncertainty (Nr) and the proportion threshold that determines whether to make a query given recent steps (tquery). Since the uncertainty distribution usually changes smoothly, the value of Nr effects the performance little compared to parameter tquery. In Fig. 3 we plot the performance given different values of tquery: we observe that in most cases, different choices of tquery

perform similarly. Moreover, in Table 4 we see that the best values of t_queryare either 0.1 or 0.3 in Acrobot, Mountain Car, and Lunar Lander; in Cart-Pole, the only exception, the result shows the least variance between choices of t_query. Thus, the performance of Active DQN is not sensitive to the parameter t_query, and it is easy to choose a value between 0.1 and 0.3 that optimizes performance for all kinds of tasks.

(15)

5.5 Effect of Expert’s Quality on ADQN and DQfD

In this section, we compare three different approaches to obtain the expert agents used in the experiment. First of all, the perfect experts are DQN agents trained on each task until convergence. These well-trained experts can solve their tasks perfectly and efficiently. Second, we obtain weaker experts by apply- ing random noise to the perfect expert. That is, each time an expert is going to make a demonstration, there is a probability the expert will do a random action rather than following the perfect expert’s policy. The random behaviors some- times lead to the end of an episode directly, therefore even though the expert is able to make optimal choices at most of the time, it still might fail to solve the task at the beginning of an episode. Last, as described in section 5.1 and 3, we saved the temporary models in the process of training a DQN agent and selected one of them to be a weak expert. Compared to the noisily-acting weak experts, these policy-consistent weak experts act more consistently through an episode, hence their behavior are more similar to a human expert.

Figure 4 demonstrates the effect of expert’s quality on ADQN and DQfD in Lunar Lander. We experiment with both noisy and bootstrapped network structures along with the three experts mentioned above. The figure shows both DQfD and ADQN can learn efficiently and solve the task within few steps with an perfect expert. On the contrary, both of their performances suffer from the noisily-weak expert. They not only learned slower than learning with perfect experts, but also converge at a worse score. However, while working with the policy-consistent weak expert, though DQfD still perform poorly, ADQN converge at higher scores which are close to ones achieved by working with perfect experts. As a result, we found that ADQN is able to take the advantage of policy-consistent weak experts, which are similar to human experts.

6 Conclusion and Future Work

In this work, we propose Active DQN, which improves RL with demonstration more efficiently with regard to human effort. We use DQfD to leverage demonstration data and propose a novel uncertainty-based query strategy which applies to diverse tasks. We provide two measurements of the uncertainty: the divergence of Bootstrapped DQN, and the predictive variance of Noisy DQN.

Experimental results show that both of the proposed methods yield better performance than the expert and learn faster with the same number of demonstrations in different tasks.

As an initial work on Active Reinforcement Learning with Demonstration, the proposed method has achieved promising performance. A possible extension of this work is to apply it on RL algorithms such as DDPG which work on a continuous action space. A more difficult challenge is to work with off-policy methods such as policy gradient which require a different way to learn with demonstration.

(16)

(a) Comparison between methods with bootstrapped DQN

(b) Comparison between methods with bootstrapped DQN

Figure 1: Comparison between different ARLD methods. The dashed lines indicate the score of solving the task.

(17)

(a) ADQN, DQfD and ADQNP based on bootstrapped DQN

(b) ADQN, DQfD and ADQNP based on noisy DQN

Figure 2: Comparison between ADQN, DQfD and ADQNP. The dashed lines indicate the score of solving the task.

(18)

(a) Active DQN based on bootstrapped DQN

(b) Active DQN based on noisy DQN

Figure 3: Comparison between different query thresholds among {0.05, 0.1, 0.3, 0.5}.

(19)

(a) Effect of expert’s quality on DQfD-B and ADQN-B

(b) Effect of expert’s quality on DQfD-N and ADQN-N

Figure 4: Comparison between perfect expert, prefect expert with 40% random action, and weak expert.

(20)

Acknowledgement

MS was supported by KAKENHI 17H00757.

References

Brockman G, Cheung V, Pettersson L, Schneider J, Schulman J, Tang J, Zaremba W (2016) Openai gym. CoRR abs/1606.01540, URL http://arxiv.

org/abs/1606.01540, 1606.01540

Brys T, Harutyunyan A, Suay HB, Chernova S, Taylor ME, Now´e A (2015) Re- inforcement learning from demonstration through shaping. In: IJCAI, AAAI Press, pp 3352–3358

Fortunato M, Azar MG, Piot B, Menick J, Osband I, Graves A, Mnih V, Munos R, Hassabis D, Pietquin O, Blundell C, Legg S (2017) Noisy networks for exploration. CoRR abs/1706.10295, URL http://arxiv.org/abs/1706.10295, 1706.10295

Gal Y, Islam R, Ghahramani Z (2017) Deep bayesian active learning with image data. In: Precup D, Teh YW (eds) Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, PMLR, Proceedings of Machine Learning Research, vol 70, pp 1183–1192, URL http://proceedings.mlr.press/v70/gal17a.html van Hasselt H, Guez A, Silver D (2016) Deep reinforcement learning with

double q-learning. In: Schuurmans D, Wellman MP (eds) Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, February 12-17, 2016, Phoenix, Arizona, USA., AAAI Press, pp 2094–2100, URL http:

//www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/view/12389

Hester T, Vecerik M, Pietquin O, Lanctot M, Schaul T, Piot B, Horgan D, Quan J, Sendonaris A, Osband I, Dulac-Arnold G, Agapiou J, Leibo JZ, Gruslys A (2018) Deep q-learning from demonstrations. In: McIlraith SA, Weinberger KQ (eds) AAAI, AAAI Press, URL https://www.aaai.org/

ocs/index.php/AAAI/AAAI18/paper/view/16976

Hosu I, Rebedea T (2016) Playing atari games with deep reinforcement learning and human checkpoint replay. CoRR abs/1607.05077

Judah K, Fern AP, Dietterich TG, et al. (2014) Active lmitation learning: formal and practical reductions to iid learning. The Journal of Machine Learning Research 15(1):3925–3963

Kang B, Jie Z, Feng J (2018) Policy optimization with demonstrations. In:

Dy J, Krause A (eds) Proceedings of the 35th International Conference on Machine Learning, PMLR, Stockholmsmssan, Stockholm Sweden, Pro- ceedings of Machine Learning Research, vol 80, pp 2469–2478, URL http:

//proceedings.mlr.press/v80/kang18a.html

(21)

Krawczyk B, Wozniak M (2017) Online query by committee for active learning from drifting data streams. In: 2017 International Joint Conference on Neural Networks, IJCNN 2017, Anchorage, AK, USA, May 14-19, 2017, IEEE, pp 2120–2127, DOI 10.1109/IJCNN.2017.7966111, URL https://doi.org/10.

1109/IJCNN.2017.7966111

Lewis DD, Gale WA (1994) A sequential algorithm for training text classifiers.

CoRR abs/cmp-lg/9407020, URL http://arxiv.org/abs/cmp-lg/9407020, cmp-lg/9407020

Lipton ZC, Gao J, Li L, Li X, Ahmed F, Deng L (2016) Efficient exploration for dialog policy learning with deep BBQ networks \& replay buffer spiking.

CoRR abs/1608.05081

Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller MA, Fidjeland A, Ostrovski G, Petersen S, Beattie C, Sadik A, Antonoglou I, King H, Kumaran D, Wierstra D, Legg S, Hassabis D (2015) Human-level control through deep reinforcement learning. Nature 518(7540):529–533, DOI 10.1038/nature14236, URL https://doi.org/10.

1038/nature14236

Osband I, Blundell C, Pritzel A, Roy BV (2016) Deep exploration via bootstrapped DQN. In: Lee DD, Sugiyama M, von Luxburg U, Guyon I, Gar- nett R (eds) NIPS, pp 4026–4034, URL http://papers.nips.cc/paper/

6501-deep-exploration-via-bootstrapped-dqn

Piot B, Geist M, Pietquin O (2014) Boosted bellman residual minimization handling expert demonstrations. In: ECML/PKDD (2), Springer, Lecture Notes in Computer Science, vol 8725, pp 549–564

Plappert M, Houthooft R, Dhariwal P, Sidor S, Chen RY, Chen X, Asfour T, Abbeel P, Andrychowicz M (2017) Parameter space noise for exploration. CoRR abs/1706.01905, URL http://arxiv.org/abs/1706.01905, 1706.01905

Ross S, Gordon GJ, Bagnell D (2011) A reduction of imitation learning and structured prediction to no-regret online learning. In: AISTATS, JMLR.org, JMLR Proceedings, vol 15, pp 627–635

Schaal S (1996) Learning from demonstration. In: NIPS, MIT Press, pp 1040–

1046

Schaul T, Quan J, Antonoglou I, Silver D (2016) Prioritized experience replay.

In: International Conference on Learning Representations, Puerto Rico Settles B (2009) Active learning literature survey. Computer Sciences Technical

Report 1648, University of Wisconsin–Madison

Shon AP, Verma D, Rao RPN (2007) Active imitation learning. In: AAAI, AAAI Press, pp 756–762

(22)

Silver D, Huang A, Maddison CJ, Guez A, Sifre L, van den Driessche G, Schrit- twieser J, Antonoglou I, Panneershelvam V, Lanctot M, Dieleman S, Grewe D, Nham J, Kalchbrenner N, Sutskever I, Lillicrap TP, Leach M, Kavukcuoglu K, Graepel T, Hassabis D (2016) Mastering the game of go with deep neural networks and tree search. Nature 529(7587):484–489

Suay HB, Brys T, Taylor ME, Chernova S (2016) Learning from demonstration for shaping through inverse reinforcement learning. In: AAMAS, ACM, pp 429–437

Subramanian K, Jr CLI, Thomaz AL (2016) Exploration from demonstration for interactive reinforcement learning. In: AAMAS, ACM, pp 447–456 Sun W, Venkatraman A, Gordon GJ, Boots B, Bagnell JA (2017) Deeply aggre-

vated: Differentiable imitation learning for sequential prediction. In: ICML, PMLR, Proceedings of Machine Learning Research, vol 70, pp 3309–3318 Sutton RS, Barto AG (1998) Reinforcement learning - an introduction. Adaptive

computation and machine learning, MIT Press

Taylor ME, Suay HB, Chernova S (2011) Integrating reinforcement learning with human demonstrations of varying ability. In: AAMAS, IFAAMAS, pp 617–624

Vecerik M, Hester T, Scholz J, Wang F, Pietquin O, Piot B, Heess N, Roth¨orl T, Lampe T, Riedmiller MA (2017) Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. CoRR abs/1707.08817, URL http://arxiv.org/abs/1707.08817, 1707.08817 Wang Z, Taylor ME (2017) Improving reinforcement learning with confidence-

based demonstrations. In: IJCAI, ijcai.org, pp 3027–3033

Watter M, Springenberg JT, Boedecker J, Riedmiller MA (2015) Embed to control: A locally linear latent dynamics model for control from raw images.

In: NIPS, pp 2746–2754