Review
Reinforcement Learning
Reinforcement Learning
RL is a general purpose framework for decision making
◦RL is for an agent with the capacity to act
◦Each action influences the agent’s future state
◦Success is measured by a scalar reward signal
Big three: action, state, reward
Agent and Environment
→←
MoveRight MoveLeft
observation ot action at
reward rt Agent
Major Components in an RL Agent
An RL agent may include one or more of these components
◦Policy: agent’s behavior function
◦Value function: how good is each state and/or action
◦Model: agent’s representation of the environment
Reinforcement Learning Approach
Policy-based RL
◦Search directly for optimal policy
Value-based RL
◦Estimate the optimal value function
Model-based RL
◦Build a model of the environment
is the policy achieving maximum future reward
is maximum value achievable under any policy
RL Agent Taxonomy
Deep Reinforcement Learning
Idea: deep learning for reinforcement learning
◦Use deep neural networks to represent
• Value function
• Policy
• Model
◦Optimize loss function by SGD
Value-Based Deep RL
Estimate How Good Each State and/or Action is
Value Function Approximation
Value functions are represented by a lookup table
◦too many states and/or actions to store
◦too slow to learn the value of each entry individually
Values can be estimated with function approximation
Q-Networks
Q-networks represent value functions with weights
◦generalize from seen states to unseen states
◦update parameter for function approximation
Q-Learning
Goal: estimate optimal Q-values
◦Optimal Q-values obey a Bellman equation
◦Value iteration algorithms solve the Bellman equation
learning target
Deep Q-Networks (DQN)
Represent value function by deep Q-network with weights Objective is to minimize MSE loss by SGD
Leading to the following Q-learning gradient
Issue: naïve Q-learning oscillates or diverges using NN due to:
1) correlations between samples 2) non-stationary targets
Stability Issues with Deep RL
Naive Q-learning oscillates or diverges with neural nets
1. Data is sequential
◦Successive samples are correlated, non-iid (independent and identically distributed)
2. Policy changes rapidly with slight changes to Q-values
◦Policy may oscillate
◦Distribution of data can swing from one extreme to another 3. Scale of rewards and Q-values is unknown
◦Naive Q-learning gradients can be unstable when backpropagated
Stable Solutions for DQN
DQN provides a stable solutions to deep value-based RL
1. Use experience replay
◦Break correlations in data, bring us back to iid setting
◦Learn from all past policies 2. Freeze target Q-network
◦Avoid oscillation
◦Break correlations between Q-network and target
3. Clip rewards or normalize network adaptively to sensible range
◦Robust gradients
Stable Solution 1: Experience Replay
To remove correlations, build a dataset from agent’s experience
◦Take action at according to 𝜖-greedy policy
◦Store transition in replay memory D
◦Sample random mini-batch of transitions from D
◦Optimize MSE between Q-network and Q-learning targets
small prob for exploration
Stable Solution 2: Fixed Target Q-Network
To avoid oscillations, fix parameters used in Q-learning target
◦Compute Q-learning targets w.r.t. old, fixed parameters
◦Optimize MSE between Q-network and Q-learning targets
◦Periodically update fixed parameters
Stable Solution 3: Reward / Value Range
To avoid oscillations, control the reward / value range
◦DQN clips the rewards to [−1, +1]
Prevents too large Q-values
Ensures gradients are well-conditioned
Deep RL in Atari Games
DQN in Atari
Goal: end-to-end learning of values Q(s, a) from pixels
◦Input: state is stack of raw pixels from last 4 frames
◦Output: Q(s, a) for all joystick/button positions a
◦Reward is the score change for that step
DQN in Atari
Other Improvements: Double DQN
Nature DQN
Double DQN: remove upward bias caused by
◦Current Q-network is used to select actions
◦Older Q-network is used to evaluate actions
Other Improvements: Prioritized Replay
Prioritized Replay: weight experience based on surprise
◦Store experience in priority queue according to DQN error
Other Improvements: Dueling Network
Dueling Network: split Q-network into two channels
◦Action-independent value function
Value function estimates how good the state is
◦Action-dependent advantage function
Advantage function estimates the additional benefit
Policy-Based Deep RL
Estimate How Good An Agent ’s Behavior is
Deep Policy Networks
Represent policy by deep network with weights
Objective is to maximize total discounted reward by SGD
stochastic policy deterministic policy
Policy Gradient
The gradient of a stochastic policy is given by
The gradient of a deterministic policy is given by
How to deal with continuous actions
Actor-Critic (Value-Based + Policy-Based)
Estimate value function
Update policy parameters by SGD
◦Stochastic policy
◦Deterministic policy
Deterministic Deep Actor-Critic
Deep deterministic policy gradient (DDPG) is the continuous analogue of DQN
◦Experience replay: build dataset from agent’s experience
◦Critic estimates value of current policy by DQN
◦Actor updates policy in direction that improves Q
Critic provides loss function for actor
DDPG in Simulated Physics
Goal: end-to-end learning of control policy from pixels
◦Input: state is stack of raw pixels from last 4 frames
◦Output: two separate CNNs for Q and 𝜋
Model-Based Deep RL
Agent ’s Representation of the Environment
Model-Based Deep RL
Goal: learn a transition model of the environment and plan based on the transition model
Objective is to maximize the measured goodness of model
Issues for Model-Based Deep RL
Compounding errors
◦Errors in the transition model compound over the trajectory
◦A long trajectory may result in totally wrong rewards
Deep networks of value/policy can “plan” implicitly
◦Each layer of network performs arbitrary computational step
◦n-layer network can “lookahead” n steps
Model-Based Deep RL in Go
Monte-Carlo tree search (MCTS)
◦MCTS simulates future trajectories
◦Builds large lookahead search tree with millions of positions
◦State-of-the-art Go programs use MCTS
Convolutional Networks
◦12-layer CNN trained to predict expert moves
◦ Raw CNN (looking at 1 position, no search at all) equals performance of MoGo with 105 position search tree
1st strong Go program
OpenAI Universe
Software platform for measuring and training an AI's general intelligence via the OpenAI gym environment
Concluding Remarks
RL is a general purpose framework for decision making under interactions between agent and environment
An RL agent may include one or more of these components
◦Policy: agent’s behavior function
◦Value function: how good is each state and/or action
◦Model: agent’s representation of the environment
RL problems can be solved by end-to-end deep learning Reinforcement Learning + Deep Learning = AI
References
Course materials by David Silver: http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Teaching.html ICLR 2015 Tutorial: http://www.iclr.cc/lib/exe/fetch.php?media=iclr2015:silver-iclr2015.pdf ICML 2016 Tutorial: http://icml.cc/2016/tutorials/deep_rl_tutorial.pdf