1
Reinforcement Learning Approach
Value-based RL
◦Estimate the optimal value function
Policy-based RL
◦Search directly for optimal policy
Model-based RL
◦Build a model of the environment
◦Plan (e.g. by lookahead) using model
is the policy achieving maximum future reward
is maximum value achievable under any policy
RL Agent Taxonomy
Model-Free
Model
Value Policy
Learning a Critic
Actor-Critic
Learning an Actor
Value-Based Approach
LEARNING A CRITIC
Value Function
A value function is a prediction of future reward (with action a in state s)
Q-value function gives expected total reward
◦from state and action
◦under policy
◦with discount factor
Value functions decompose into a Bellman equation
Optimal Value Function
An optimal value function is the maximum achievable value
The optimal value function allows us act optimally
The optimal value informally maximizes over all decisions
Optimal values decompose into a Bellman equation
Value Function Approximation
Value functions are represented by a lookup table
◦too many states and/or actions to store
◦too slow to learn the value of each entry individually
Values can be estimated with function approximation
Q-Networks
Q-networks represent value functions with weights
◦generalize from seen states to unseen states
◦update parameter for function approximation
Q-Learning
Goal: estimate optimal Q-values
◦Optimal Q-values obey a Bellman equation
◦Value iteration algorithms solve the Bellman equation
learning target
Critic = Value Function
Idea: how good the actor is
State value function: when using actor 𝜋, the expected total reward after seeing observation (state) s
A critic does not determine the action An actor can be found from a critic
scalar
larger
smaller
Monte-Carlo for Estimating
Monte-Carlo (MC)
◦The critic watches 𝜋 playing the game
◦MC learns directly from complete episodes: no bootstrapping
After seeing 𝑠𝑎,
until the end of the episode, the cumulated reward is 𝐺𝑎 After seeing 𝑠𝑏,
until the end of the episode, the cumulated reward is 𝐺𝑏
Idea: value = empirical mean return
Issue: long episodes delay learning
Temporal-Difference for Estimating
Temporal-difference (TD)
◦The critic watches 𝜋 playing the game
◦TD learns directly from incomplete episodes by bootstrapping
◦TD updates a guess towards a guess
-
Idea: update value toward estimated return
MC v.s. TD
Monte-Carlo (MC)
◦Large variance
◦Unbiased
◦No Markov property
Temporal-Difference (TD)
◦Small variance
◦Biased
◦Markov property
smaller variance
may be biased
…
MC v.s. TD
Critic = Value Function
State-action value function: when using actor 𝜋, the
expected total reward after seeing observation (state) 𝑠 and taking action 𝑎
scalar
for discrete action only
Q-Learning
Given 𝑄𝜋 𝑠, 𝑎 , find a new actor 𝜋′ “better” than 𝜋
𝜋 interacts with the environment
Learning 𝑄𝜋 𝑠, 𝑎 Find a new actor 𝜋′
“better” than 𝜋
TD or MC
?
𝜋 = 𝜋′
𝜋′ does not have extra parameters (depending on value function) not suitable for continuous action
Q-Learning
Goal: estimate optimal Q-values
◦Optimal Q-values obey a Bellman equation
◦Value iteration algorithms solve the Bellman equation
learning target
Deep Q-Networks (DQN)
Estimate value function by TD
Represent value function by deep Q-network with weights Objective is to minimize MSE loss by SGD
Deep Q-Networks (DQN)
Objective is to minimize MSE loss by SGD
Leading to the following Q-learning gradient
Issue: naïve Q-learning oscillates or diverges using NN due to:
1) correlations between samples 2) non-stationary targets
Stability Issues with Deep RL
Naive Q-learning oscillates or diverges with neural nets
1. Data is sequential
◦Successive samples are correlated, non-iid (independent and identically distributed)
2. Policy changes rapidly with slight changes to Q-values
◦Policy may oscillate
◦Distribution of data can swing from one extreme to another 3. Scale of rewards and Q-values is unknown
◦Naive Q-learning gradients can be unstable when backpropagated
Stable Solutions for DQN
DQN provides a stable solutions to deep value-based RL
1. Use experience replay
◦Break correlations in data, bring us back to iid setting
◦Learn from all past policies 2. Freeze target Q-network
◦Avoid oscillation
◦Break correlations between Q-network and target
3. Clip rewards or normalize network adaptively to sensible range
◦Robust gradients
Stable Solution 1: Experience Replay
To remove correlations, build a dataset from agent’s experience
◦Take action at according to 𝜖-greedy policy
◦Store transition in replay memory D
◦Sample random mini-batch of transitions from D
◦Optimize MSE between Q-network and Q-learning targets
small prob for exploration
Exploration
The policy is based on Q-function
Exploration algorithms
◦Epsilon greedy
◦Boltzmann sampling
not good for data collection
→ inefficient learning
𝑠
𝑎1 𝑎2 𝑎3
always sampled never explored never explored
𝜀 would decay during learning
𝜋 interacts with the environment
Learning 𝑄𝜋 𝑠, 𝑎 Find a new actor 𝜋′
“better” than 𝜋
Replay Buffer
……
exp exp exp exp
put the experience into buffer
the experience in the buffer comes from different 𝜋
drop the old one if full 𝜋 = 𝜋′
𝜋 interacts with the environment
Learning 𝑄𝜋 𝑠, 𝑎 Find a new actor 𝜋′
“better” than 𝜋
Replay Buffer
put the experience into buffer
𝜋 = 𝜋′
In each iteration:
1. Sample a batch 2. Update Q-function
……
exp exp exp exp
Off-policy
Stable Solution 2: Fixed Target Q-Network
To avoid oscillations, fix parameters used in Q-learning target
◦Compute Q-learning targets w.r.t. old, fixed parameters
◦Optimize MSE between Q-network and Q-learning targets
◦Periodically update fixed parameters
freeze
freeze
Stable Solution 3: Reward / Value Range
To avoid oscillations, control the reward / value range
◦DQN clips the rewards to [−1, +1]
▪Prevents too large Q-values
▪Ensures gradients are well-conditioned
Typical Q-Learning Algorithm
Initialize Q-function 𝑄, target Q-function 𝑄 = 𝑄 In each episode
◦For each time step 𝑡
◦Given state 𝑠𝑡, take action 𝑎𝑡 based on 𝑄 (epsilon greedy)
◦Obtain reward 𝑟𝑡, and reach new state 𝑠𝑡+1
◦Store into buffer
◦Sample from buffer (usually a batch)
◦Update the parameters of 𝑄 to make
◦Every 𝐶 steps reset
Deep RL in Atari Games
DQN in Atari
Goal: end-to-end learning of values Q(s, a) from pixels
◦Input: state is stack of raw pixels from last 4 frames
◦Output: Q(s, a) for all joystick/button positions a
◦Reward is the score change for that step
DQN Nature Paper [link] [code] 30
DQN in Atari
DQN Nature Paper [link] [code] 31
Concluding Remarks
RL is a general purpose framework for decision making under interactions between agent and environment
A value-based RL measures how good each state and/or action is via a value function
◦Monte-Carlo (MC) v.s. Temporal-Difference (TD)