Q-Learning
Applied Deep Learning
March 28th, 2020 http://adl.miulab.tw
Reinforcement Learning Approach
◉ Value-based RL
○ Estimate the optimal value function
◉ Policy-based RL
○ Search directly for optimal policy
◉ Model-based RL
○ Build a model of the environment
○ Plan (e.g. by lookahead) using model
is the policy achieving maximum future reward
is maximum value achievable under any policy
2
RL Agent Taxonomy
Model-Free
Model
Value Policy
Learning a Critic
Actor-Critic
Learning an Actor 3
Learning a Critic
Value-Based Approach
4
Value Function
◉ A value function is a prediction of future reward (with action a in state s)
◉ Q-value function gives expected total reward
○
from state and action○ under policy
○ with discount factor
◉ Value functions decompose into a Bellman equation
5
Optimal Value Function
◉
An optimal value function is the maximum achievable value◉
The optimal value function allows us act optimally◉
The optimal value informally maximizes over all decisions◉
Optimal values decompose into a Bellman equation6
Value Function Approximation
◉ Value functions are represented by a lookup table
○ too many states and/or actions to store
○ too slow to learn the value of each entry individually
◉ Values can be estimated with function approximation
7
Q-Networks
◉
Q-networksrepresent value functions with weights
○ generalize from seen states to unseen states
○ update parameter for function approximation
8
Q-Learning
◉ Goal: estimate optimal Q-values
○ Optimal Q-values obey a Bellman equation
○ Value iteration algorithms solve the Bellman equation
learning target
9
Critic = Value Function
◉
Idea: how good the actor is◉
State value function: when using actor 𝜋, the expected total reward after seeing observation (state) sA critic does not determine the action An actor can be found from a critic
scalar
larger
smaller
10
Monte-Carlo for Estimating
◉ Monte-Carlo (MC)
○ The critic watches 𝜋 playing the game
○
MC learns directly from complete episodes: no bootstrappingAfter seeing 𝑠𝑎,
until the end of the episode, the cumulated reward is 𝐺𝑎
After seeing 𝑠𝑏,
until the end of the episode, the cumulated reward is 𝐺𝑏
Idea: value = empirical mean return
Issue: long episodes delay learning
11
Temporal-Difference for Estimating
◉ Temporal-difference (TD)
○ The critic watches 𝜋 playing the game
○
TD learns directly from incomplete episodes by bootstrapping○ TD updates a guess towards a guess
-
Idea: update value toward estimated return
12
MC v.s. TD
◉ Monte-Carlo (MC)
○ Large variance
○
Unbiased○ No Markov property
◉ Temporal-Difference (TD)
○ Small variance
○
Biased○ Markov property
smaller variance
may be biased 13
…
MC v.s. TD
14
Critic = Value Function
◉
State-action value function: when using actor 𝜋, the expected total reward after seeing observation (state) 𝑠 and taking action 𝑎scalar
for discrete action only 15
Q-Learning
◉ Given 𝑄
𝜋𝑠, 𝑎 , find a new actor 𝜋
′“better” than 𝜋
𝜋 interacts with the environment
Learning 𝑄𝜋 𝑠, 𝑎 Find a new actor
𝜋′ “better” than 𝜋
TD or MC
?
𝜋 = 𝜋′ 𝜋′ does not have extra parameters
(depending on value function)
not suitable for continuous action
16
Q-Learning
◉ Goal: estimate optimal Q-values
○ Optimal Q-values obey a Bellman equation
○ Value iteration algorithms solve the Bellman equation
learning target
17
Deep Q-Networks (DQN)
◉
Estimate value function by TD◉
Represent value function by deep Q-network with weights◉
Objective is to minimize MSE loss by SGD18
Deep Q-Networks (DQN)
◉
Objective is to minimize MSE loss by SGD◉
Leading to the following Q-learning gradientIssue: naïve Q-learning oscillates or diverges using NN due to:
1) correlations between samples 2) non-stationary targets
19
Stability Issues with Deep RL
◉ Naive Q-learning oscillates or diverges with neural nets
1. Data is sequential■
Successive samples are correlated, non-iid (independent and identically distributed)2. Policy changes rapidly with slight changes to Q-values
■ Policy may oscillate
■ Distribution of data can swing from one extreme to another 3. Scale of rewards and Q-values is unknown
■ Naive Q-learning gradients can be unstable when backpropagated
20
Stable Solutions for DQN
◉ DQN provides a stable solutions to deep value-based RL
1. Use experience replay■
Break correlations in data, bring us back to iid setting■ Learn from all past policies 2. Freeze target Q-network
■ Avoid oscillation
■ Break correlations between Q-network and target
3. Clip rewards or normalize network adaptively to sensible range
■ Robust gradients
21
Stable Solution 1: Experience Replay
◉
To remove correlations, build a dataset from agent’s experience○ Take action at according to 𝜖-greedy policy
○ Store transition in replay memory D
○ Sample random mini-batch of transitions from D
○
Optimize MSE between Q-network and Q-learning targetssmall prob for exploration
22
Exploration
◉
The policy is based on Q-function◉
Exploration algorithms○ Epsilon greedy
○
Boltzmann samplingnot good for data collection
→ inefficient learning 𝑠
𝑎1 𝑎2 𝑎3
always sampled never explored never explored
𝜀 would decay during learning
23
𝜋 interacts with the environment
Learning 𝑄𝜋 𝑠, 𝑎 Find a new actor
𝜋′ “better” than 𝜋
Replay Buffer
……
exp exp exp exp
put the experience into buffer
the experience in the buffer comes from different 𝜋
drop the old one if full 𝜋 = 𝜋′
24
Replay Buffer
In each iteration:
1. Sample a batch 2. Update Q-function
Off-policy 𝜋 interacts with
the environment
Learning 𝑄𝜋 𝑠, 𝑎 Find a new actor
𝜋′ “better” than 𝜋
……
exp exp exp exp
put the experience into buffer
𝜋 = 𝜋′
25
Stable Solution 2: Fixed Target Q-Network
◉ To avoid oscillations, fix parameters used in Q-learning target
○ Compute Q-learning targets w.r.t. old, fixed parameters
○
Optimize MSE between Q-network and Q-learning targets○ Periodically update fixed parameters
freeze
freeze 26
Stable Solution 3: Reward / Value Range
◉ To avoid oscillations, control the reward / value range
○ DQN clips the rewards to [−1, +1]
▪
Prevents too large Q-values▪ Ensures gradients are well-conditioned 27
Typical Q-Learning Algorithm
◉
Initialize Q-function 𝑄, target Q-function 𝑄 = 𝑄◉
In each episode○
For each time step 𝑡■ Given state 𝑠𝑡, take action 𝑎𝑡 based on 𝑄 (epsilon greedy)
■ Obtain reward 𝑟𝑡, and reach new state 𝑠𝑡+1
■ Store into buffer
■ Sample from buffer (usually a batch)
■ Update the parameters of 𝑄 to make
■ Every 𝐶 steps reset 28
Deep RL in Atari Games
29
DQN in Atari
◉
Goal: end-to-end learning of values Q(s, a) from pixels○ Input: state is stack of raw pixels from last 4 frames
○ Output: Q(s, a) for all joystick/button positions a
○ Reward is the score change for that step
DQN Nature Paper [link] [code]
30
DQN in Atari
DQN Nature Paper [link] [code]
31
Concluding Remarks
◉
RL is a general purpose framework for decision making under interactions between agent and environment◉
A value-based RL measures how good each state and/or action is via a value function○ Monte-Carlo (MC) v.s. Temporal-Difference (TD) 32
DQN 進階模型
Advanced DQN
33
Double DQN
◉ Q value is usually over-estimated
34
Double DQN
◉
Nature DQNIssue: tend to select the action that is over-estimated
Hasselt et al.,“Deep Reinforcement Learning with Double Q-learning”, AAAI 2016.
35
Double DQN
◉ Nature DQN
◉ Double DQN: remove upward bias caused by
○ Current Q-network is used to select actions
○ Older Q-network is used to evaluate actions
Hasselt et al.,“Deep Reinforcement Learning with Double Q-learning”, AAAI 2016.
If 𝑄 over-estimate 𝑎, so it is selected. 𝑄 would give it proper value.
How about 𝑄 overestimate? The action will not be selected by 𝑄.
36
Dueling DQN
◉ Dueling Network: split Q-network into two channels
○ Action-independent value function
▪ Value function estimates how good the state is
○ Action-dependent advantage function
▪ Advantage function estimates the additional benefit
Wang et al., “Dueling Network Architectures for Deep Reinforcement Learning”, arXiv preprint, 2015.
37
Dueling DQN
Action Action
State
State
Wang et al., “Dueling Network Architectures for Deep Reinforcement Learning”, arXiv preprint, 2015.
38
Dueling DQN
3 3 3 1
1 -1 6 1
2 -2 3 1
state
action
1 3 -1 0
-1 -1 2 0
0 -2 -1 0
2 0 4 1
0
1 4
= +
average of column -1+ =
sum of column = 0
Wang et al., “Dueling Network Architectures for Deep Reinforcement Learning”, arXiv preprint, 2015.
39
1.0
Dueling DQN
7 3 2
3 -1 -2
normalize A(s,a) before adding with V(s)
Wang et al., “Dueling Network Architectures for Deep Reinforcement Learning”, arXiv preprint, 2015.
40
Dueling DQN - Visualization
Wang et al., “Dueling Network Architectures for Deep Reinforcement Learning”, arXiv preprint, 2015.
41
Dueling DQN - Visualization
Wang et al., “Dueling Network Architectures for Deep Reinforcement Learning”, arXiv preprint, 2015.
42
Prioritized Replay
◉ Prioritized Replay: weight experience based on surprise
○ Store experience in priority queue according to the error
Schaulet al., “Prioritized Experience Replay”, arXiv preprint, 2015.
The data with larger TD error in previous training has higher probability to be sampled.
Parameter update procedure is also modified.
TD error
43
Multi-Step
◉ Idea: balance between MC and TD
44
Distributional Q-function
◉ State-action value function
○ When using actor 𝜋, the cumulated reward expects to be obtained after seeing observation 𝑠 and taking 𝑎
Different distributions can have the same values.
-10 10 -10 10
45
Distributional Q-function
A network with 3 outputs A network with 15 outputs (each action has 5 bins)
46
Demo
https://youtu.be/yFBwyPuO2Vg
47
Rainbow
Hessel et al., “Rainbow: Combining Improvements in Deep Reinforcement Learning”, arXiv preprint, 2017.
48
Rainbow
Hessel et al., “Rainbow: Combining Improvements in Deep Reinforcement Learning”, arXiv preprint, 2017.
49
Concluding Remarks
◉ DQN training tips
○ Double DQN
○
Dueling DQN○ Prioritized replay
○ Multi-step
○ Distributional DQN
50