Review
Reinforcement Learning
Reinforcement Learning
RL is a general purpose framework for decision making
◦RL is for an agent with the capacity to act
◦Each action influences the agent’s future state
◦Success is measured by a scalar reward signal
Big three: action, state, reward
Agent and Environment
→←
MoveRight MoveLeft
observation o_{t} action a_{t}
reward r_{t} Agent
Environment
Major Components in an RL Agent
An RL agent may include one or more of these components
◦Value function: how good is each state and/or action
◦Policy: agent’s behavior function
◦Model: agent’s representation of the environment
Reinforcement Learning Approach
Valuebased RL
◦Estimate the optimal value function
Policybased RL
◦Search directly for optimal policy
Modelbased RL
◦Build a model of the environment
◦Plan (e.g. by lookahead) using model
is the policy achieving maximum future reward
is maximum value achievable under any policy
RL Agent Taxonomy
ModelFree
Model
Value Policy
Learning a Critic
ActorCritic
Learning an Actor
Deep Reinforcement Learning
Idea: deep learning for reinforcement learning
◦Use deep neural networks to represent
• Value function
• Policy
• Model
◦Optimize loss function by SGD
ValueBased Approach
LEARNING A CRITIC
Critic = Value Function
Idea: how good the actor is
State value function: when using actor 𝜋, the expected total reward after seeing observation (state) s
scalar
larger
smaller
MonteCarlo for Estimating
MonteCarlo (MC)
◦The critic watches 𝜋 playing the game
◦MC learns directly from complete episodes: no bootstrapping
After seeing 𝑠_{𝑎},
until the end of the episode, the cumulated reward is 𝐺_{𝑎} After seeing 𝑠_{𝑏},
until the end of the episode, the cumulated reward is 𝐺_{𝑏}
Idea: value = empirical mean return
TemporalDifference for Estimating
Temporaldifference (TD)
◦The critic watches 𝜋 playing the game
◦TD learns directly from incomplete episodes by bootstrapping
◦TD updates a guess towards a guess

Idea: update value toward estimated return
MC v.s. TD
MonteCarlo (MC)
◦Large variance
◦Unbiased
◦No Markov property
TemporalDifference (TD)
◦Small variance
◦Biased
◦Markov property
smaller may be
…
MC v.s. TD
Critic = Value Function
Stateaction value function: when using actor 𝜋, the
expected total reward after seeing observation (state) 𝑠 and taking action 𝑎
scalar
for discrete action only
QLearning
Given 𝑄^{𝜋} 𝑠, 𝑎 , find a new actor 𝜋^{′} “better” than 𝜋
𝜋 interacts with the environment
Learning 𝑄^{𝜋} 𝑠, 𝑎 Find a new actor 𝜋^{′}
“better” than 𝜋
TD or MC
?
𝜋 = 𝜋^{′}
𝜋′ does not have extra parameters (depending on value function) not suitable for continuous action
QLearning
Goal: estimate optimal Qvalues
◦Optimal Qvalues obey a Bellman equation
◦Value iteration algorithms solve the Bellman equation
learning target
Deep QNetworks (DQN)
Estimate value function by TD
Represent value function by deep Qnetwork with weights Objective is to minimize MSE loss by SGD
Deep QNetworks (DQN)
Objective is to minimize MSE loss by SGD
Leading to the following Qlearning gradient
Issue: naïve Qlearning oscillates or diverges using NN due to:
1) correlations between samples 2) nonstationary targets
Stability Issues with Deep RL
Naive Qlearning oscillates or diverges with neural nets
1. Data is sequential
◦Successive samples are correlated, noniid (independent and identically distributed)
2. Policy changes rapidly with slight changes to Qvalues
◦Policy may oscillate
◦Distribution of data can swing from one extreme to another 3. Scale of rewards and Qvalues is unknown
◦Naive Qlearning gradients can be unstable when backpropagated
Stable Solutions for DQN
DQN provides a stable solutions to deep valuebased RL
1. Use experience replay
◦Break correlations in data, bring us back to iid setting
◦Learn from all past policies 2. Freeze target Qnetwork
◦Avoid oscillation
◦Break correlations between Qnetwork and target
3. Clip rewards or normalize network adaptively to sensible range
◦Robust gradients
Stable Solution 1: Experience Replay
To remove correlations, build a dataset from agent’s experience
◦Take action at according to 𝜖greedy policy
◦Store transition in replay memory D
◦Sample random minibatch of transitions from D
◦Optimize MSE between Qnetwork and Qlearning targets
small prob for exploration
Stable Solution 2: Fixed Target QNetwork
To avoid oscillations, fix parameters used in Qlearning target
◦Compute Qlearning targets w.r.t. old, fixed parameters
◦Optimize MSE between Qnetwork and Qlearning targets
freeze
freeze
Stable Solution 3: Reward / Value Range
To avoid oscillations, control the reward / value range
◦DQN clips the rewards to [−1, +1]
Prevents too large Qvalues
Ensures gradients are wellconditioned
Deep RL in Atari Games
DQN in Atari
Goal: endtoend learning of values Q(s, a) from pixels
◦Input: state is stack of raw pixels from last 4 frames
◦Output: Q(s, a) for all joystick/button positions a
◦Reward is the score change for that step
DQN in Atari
Other Improvements: Double DQN
Nature DQN
Issue: tend to select the action that is overestimated
Other Improvements: Double DQN
Nature DQN
Double DQN: remove upward bias caused by
◦Current Qnetwork is used to select actions
◦Older Qnetwork is used to evaluate actions
Other Improvements: Prioritized Replay
Prioritized Replay: weight experience based on surprise
◦Store experience in priority queue according to DQN error
Other Improvements: Dueling Network
Dueling Network: split Qnetwork into two channels
◦Actionindependent value function
Value function estimates how good the state is
◦Actiondependent advantage function
Advantage function estimates the additional benefit
Other Improvements: Dueling Network
Action Action
State
State
PolicyBased Approach
LEARNING AN ACTOR
OnPolicy v.s. OffPolicy
Onpolicy: The agent learned and the agent interacting with the environment is the same
Offpolicy: The agent learned and the agent interacting with the environment is different
Goodness of Actor
An episode is considered as a trajectory 𝜏
◦
◦Reward:
control by your actor not related to your actor
Actor
left right
fire
0.1 0.2 0.7
Goodness of Actor
An episode is considered as a trajectory 𝜏
◦
◦Reward:
We define as the expected value of reward
◦If you use an actor to play game, each 𝜏 has 𝑃 𝜏𝜃 to be sampled
•Use 𝜋_{𝜃} to play the game N times, obtain 𝜏^{1}, 𝜏^{2}, ⋯ , 𝜏^{𝑁}
•Sampling 𝜏 from 𝑃 𝜏𝜃 N times
sum over all possible trajectory
Deep Policy Networks
Represent policy by deep network with weights
Objective is to maximize total discounted reward by SGD
Update the model parameters iteratively
Policy Gradient
Gradient assent to maximize the expected reward
do not have to be differentiable can even be a black box
use 𝜋_{𝜃} to play the game N times, obtain 𝜏^{1}, 𝜏^{2}, ⋯ , 𝜏^{𝑁}
Policy Gradient
An episode trajectory
ignore the terms not related to 𝜃
Policy Gradient
Gradient assent for iteratively updating the parameters
◦If 𝜏^{𝑛} machine takes 𝑎_{𝑡}^{𝑛} when seeing 𝑠_{𝑡}^{𝑛}
Tuning 𝜃 to increase Tuning 𝜃 to decrease
Policy Gradient
Given actor parameter 𝜃
model update data collection
… … … …
Improvement: Adding Baseline
Ideally
Sampling
it is probability
not sampled
ActorCritic Approach
LEARNING AN ACTOR & A CRITIC
ActorCritic (ValueBased + PolicyBased)
Estimate value function 𝑄^{𝜋} 𝑠, 𝑎 , 𝑉^{𝜋} 𝑠
Update policy based on the value function evaluation 𝜋
𝜋 interacts with the environment
Learning 𝑄^{𝜋} 𝑠, 𝑎 , 𝑉^{𝜋} 𝑠 Update actor from
𝜋 → 𝜋’ based on 𝑄^{𝜋} 𝑠, 𝑎 , 𝑉^{𝜋} 𝑠
TD or MC 𝜋 = 𝜋^{′}
𝜋 is a actual function that maximizes the value
may works for continuous action
Advantage ActorCritic
Learning the policy (actor) using the value evaluated by critic
◦Positive advantage function ↔ increasing the prob. of action 𝑎_{𝑡}^{𝑛}
◦Negative advantage function ↔ decreasing the prob. of action 𝑎_{𝑡}^{𝑛}
𝜋 interacts with the environment
Learning 𝑉^{𝜋} 𝑠 Update actor
based on 𝑉^{𝜋} 𝑠
TD or MC 𝜋 = 𝜋^{′}
evaluated by critic
Advantage function:
expected reward 𝑟_{𝑡}^{𝑛} we obtain if we use actor 𝜋
the reward 𝑟_{𝑡}^{𝑛} we truly obtain when taking action 𝑎_{𝑡}^{𝑛}
baseline is added
Advantage ActorCritic
Tips
◦The parameters of actor 𝜋 𝑠 and critic 𝑉^{𝜋} 𝑠 can be shared
◦Use output entropy as regularization for 𝜋 𝑠
◦Larger entropy is preferred → exploration Network
𝑠
Network
fire right
left
Network
𝑉^{𝜋} 𝑠
Asynchronous Advantage ActorCritic (A3C)
Asynchronous
1. Copy global parameters 2. Sampling some data
3. Compute gradients 4. Update global models
𝜃
^{1}∆𝜃
∆𝜃 𝜃
^{1}𝜃
^{1}+𝜂∆𝜃
(other workers also update models)
Pathwise Derivative Policy Gradient
Original actorcritic tells that a given action is good or bad Pathwise derivative policy gradient tells that which action is good
Pathwise Derivative Policy Gradient
=
This is a large network
Fixed Gradient ascent:
an actor’s output
Actor
Deep Deterministic Policy Gradient (DDPG)
Idea
◦Critic estimates value of current policy by DQN
◦Actor updates policy in direction that
improves Q
Actor
=
𝜋 interacts with the environment
Learning 𝑄^{𝜋} 𝑠, 𝑎 Update actor 𝜋 → 𝜋’
based on 𝑄^{𝜋} 𝑠, 𝑎
Replay Buffer add noise
exploration
Critic provides loss function for actor
DDPG Algorithm
Initialize critic network 𝜃^{𝑄} and actor network 𝜃^{𝜋}
Initialize target critic network 𝜃^{𝑄}^{′} = 𝜃^{𝑄} and target actor network 𝜃^{𝜋}^{′} = 𝜃^{𝜋} Initialize replay buffer R
In each iteration
◦ Use 𝜋 𝑠 + noise to interact with the environment, collect a set of 𝑠_{𝑡}, 𝑎_{𝑡}, 𝑟_{𝑡}, 𝑠_{𝑡+1} , put them in R
◦ Sample N examples 𝑠_{𝑛}, 𝑎_{𝑛}, 𝑟_{𝑛}, 𝑠_{𝑛+1} from R
◦ Update critic 𝑄 to minimize
◦ Update actor 𝜋 to maximize
◦ Update target networks: the target networks
update slower using target networks
DDPG in Simulated Physics
Goal: endtoend learning of control policy from pixels
◦Input: state is stack of raw pixels from last 4 frames
◦Output: two separate CNNs for Q and 𝜋
ModelBased
Agent ’s Representation of the Environment
ModelBased Deep RL
Goal: learn a transition model of the environment and plan based on the transition model
Objective is to maximize the measured goodness of model
Modelbased deep RL is challenging, and so far has failed in Atari
Issues for ModelBased Deep RL
Compounding errors
◦Errors in the transition model compound over the trajectory
◦A long trajectory may result in totally wrong rewards
Deep networks of value/policy can “plan” implicitly
◦Each layer of network performs arbitrary computational step
◦nlayer network can “lookahead” n steps
ModelBased Deep RL in Go
MonteCarlo tree search (MCTS)
◦MCTS simulates future trajectories
◦Builds large lookahead search tree with millions of positions
◦Stateoftheart Go programs use MCTS
Convolutional Networks
◦12layer CNN trained to predict expert moves
◦ Raw CNN (looking at 1 position, no search at all) equals performance of MoGo with 105 position search tree
1st strong Go program
OpenAI Universe
Software platform for measuring and training an AI's general intelligence via the OpenAI gym environment
Concluding Remarks
RL is a general purpose framework for decision making under interactions between agent and environment
An RL agent may include one or more of these components
RL problems can be solved by endtoend deep learning Reinforcement Learning + Deep Learning = AI
Value Policy
Learning
a Critic ActorCritic Learning an Actor
◦Value function: how good is each state and/or action
◦Policy: agent’s behavior function
◦Model: agent’s representation of the environment
References
Course materials by David Silver: http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Teaching.html ICLR 2015 Tutorial: http://www.iclr.cc/lib/exe/fetch.php?media=iclr2015:silvericlr2015.pdf ICML 2016 Tutorial: http://icml.cc/2016/tutorials/deep_rl_tutorial.pdf