## Review

Reinforcement Learning

### Reinforcement Learning

**RL is a general purpose framework for decision making**

◦RL is for an agent with the capacity to act

◦Each action influences the agent’s future state

◦Success is measured by a scalar reward signal

Big three: action, state, reward

### Agent and Environment

→←

MoveRight MoveLeft

**observation o**_{t}**action a**_{t}

**reward r***_{t}*
Agent

### Major Components in an RL Agent

An RL agent may include one or more of these components

◦**Policy: agent’s behavior function**

◦**Value function: how good is each state and/or action**

◦**Model: agent’s representation of the environment**

### Reinforcement Learning Approach

Policy-based RL

◦Search directly for optimal policy

Value-based RL

◦Estimate the optimal value function

Model-based RL

◦Build a model of the environment

is the policy achieving maximum future reward

is maximum value achievable under any policy

### RL Agent Taxonomy

### Deep Reinforcement Learning

Idea: deep learning for reinforcement learning

◦Use deep neural networks to represent

• Value function

• Policy

• Model

◦Optimize loss function by SGD

## Value-Based Deep RL

Estimate How Good Each State and/or Action is

### Value Function Approximation

*Value functions are represented by a lookup table*

◦too many states and/or actions to store

◦too slow to learn the value of each entry individually

*Values can be estimated with function approximation*

### Q-Networks

Q-networks represent value functions with weights

◦generalize from seen states to unseen states

◦update parameter for function approximation

### Q-Learning

Goal: estimate optimal Q-values

◦Optimal Q-values obey a Bellman equation

◦*Value iteration* algorithms solve the Bellman equation

learning target

### Deep Q-Networks (DQN)

Represent value function by deep Q-network with weights Objective is to minimize MSE loss by SGD

Leading to the following Q-learning gradient

Issue: naïve Q-learning oscillates or diverges using NN due to:

1) correlations between samples 2) non-stationary targets

### Stability Issues with Deep RL

Naive Q-learning oscillates or diverges with neural nets

1. Data is sequential

◦Successive samples are correlated, non-iid (independent and identically distributed)

2. Policy changes rapidly with slight changes to Q-values

◦Policy may oscillate

◦Distribution of data can swing from one extreme to another 3. Scale of rewards and Q-values is unknown

◦Naive Q-learning gradients can be unstable when backpropagated

### Stable Solutions for DQN

DQN provides a stable solutions to deep value-based RL

1. Use experience replay

◦Break correlations in data, bring us back to iid setting

◦Learn from all past policies 2. Freeze target Q-network

◦Avoid oscillation

◦Break correlations between Q-network and target

3. Clip rewards or normalize network adaptively to sensible range

◦Robust gradients

### Stable Solution 1: Experience Replay

To remove correlations, build a dataset from agent’s experience

◦Take action at according to 𝜖-greedy policy

◦*Store transition in replay memory D*

◦Sample random mini-batch of transitions *from D*

◦Optimize MSE between Q-network and Q-learning targets

small prob for exploration

### Stable Solution 2: Fixed Target Q-Network

To avoid oscillations, fix parameters used in Q-learning target

◦Compute Q-learning targets w.r.t. old, fixed parameters

◦Optimize MSE between Q-network and Q-learning targets

◦Periodically update fixed parameters

### Stable Solution 3: Reward / Value Range

To avoid oscillations, control the reward / value range

◦DQN clips the rewards to [−1, +1]

Prevents too large Q-values

Ensures gradients are well-conditioned

### Deep RL in Atari Games

### DQN in Atari

*Goal: end-to-end learning of values Q(s, a) from pixels*

◦Input: state is stack of raw pixels from last 4 frames

◦*Output: Q(s, a) for all joystick/button positions a*

◦Reward is the score change for that step

### DQN in Atari

### Other Improvements: Double DQN

Nature DQN

Double DQN: remove upward bias caused by

◦Current Q-network is used to select actions

◦Older Q-network is used to evaluate actions

### Other Improvements: Prioritized Replay

Prioritized Replay: weight experience based on surprise

◦Store experience in priority queue according to DQN error

### Other Improvements: Dueling Network

Dueling Network: split Q-network into two channels

◦Action-independent value function

Value function estimates how good the state is

◦Action-dependent advantage function

Advantage function estimates the additional benefit

## Policy-Based Deep RL

Estimate How Good An Agent ’s Behavior is

### Deep Policy Networks

Represent policy by deep network with weights

Objective is to maximize total discounted reward by SGD

stochastic policy deterministic policy

### Policy Gradient

The gradient of a stochastic policy is given by

The gradient of a deterministic policy is given by

How to deal with continuous actions

### Actor-Critic (Value-Based + Policy-Based)

Estimate value function

Update policy parameters by SGD

◦Stochastic policy

◦Deterministic policy

### Deterministic Deep Actor-Critic

Deep deterministic policy gradient (DDPG) is the continuous analogue of DQN

◦Experience replay: build dataset from agent’s experience

◦Critic estimates value of current policy by DQN

◦Actor updates policy in direction that improves Q

Critic provides loss function for actor

### DDPG in Simulated Physics

Goal: end-to-end learning of control policy from pixels

◦Input: state is stack of raw pixels from last 4 frames

◦*Output: two separate CNNs for Q and 𝜋*

## Model-Based Deep RL

Agent ’s Representation of the Environment

### Model-Based Deep RL

Goal: learn a transition model of the environment and plan based on the transition model

Objective is to maximize the measured goodness of model

### Issues for Model-Based Deep RL

Compounding errors

◦Errors in the transition model compound over the trajectory

◦A long trajectory may result in totally wrong rewards

Deep networks of value/policy can “plan” implicitly

◦Each layer of network performs arbitrary computational step

◦n-layer network can “lookahead” n steps

### Model-Based Deep RL in Go

Monte-Carlo tree search (MCTS)

◦MCTS simulates future trajectories

◦Builds large lookahead search tree with millions of positions

◦State-of-the-art Go programs use MCTS

Convolutional Networks

◦12-layer CNN trained to predict expert moves

◦ Raw CNN (looking at 1 position, no search at all) equals performance of MoGo with 105 position search tree

1st strong Go program

### OpenAI Universe

Software platform for measuring and training an AI's general intelligence via the OpenAI gym environment

### Concluding Remarks

**RL is a general purpose framework for decision making **
under interactions between agent and environment

An RL agent may include one or more of these components

◦Policy: agent’s behavior function

◦Value function: how good is each state and/or action

◦Model: agent’s representation of the environment

RL problems can be solved by end-to-end deep learning Reinforcement Learning + Deep Learning = AI

### References

Course materials by David Silver: http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Teaching.html ICLR 2015 Tutorial: http://www.iclr.cc/lib/exe/fetch.php?media=iclr2015:silver-iclr2015.pdf ICML 2016 Tutorial: http://icml.cc/2016/tutorials/deep_rl_tutorial.pdf