• 沒有找到結果。

Slide credit from David Silver

N/A
N/A
Protected

Academic year: 2022

Share "Slide credit from David Silver"

Copied!
37
0
0

加載中.... (立即查看全文)

全文

(1)
(2)

Review

Reinforcement Learning

(3)

Reinforcement Learning

RL is a general purpose framework for decision making

RL is for an agent with the capacity to act

Each action influences the agent’s future state

Success is measured by a scalar reward signal

Big three: action, state, reward

(4)

Agent and Environment

→←

MoveRight MoveLeft

observation ot action at

reward rt Agent

(5)

Major Components in an RL Agent

An RL agent may include one or more of these components

Policy: agent’s behavior function

Value function: how good is each state and/or action

Model: agent’s representation of the environment

(6)

Reinforcement Learning Approach

Policy-based RL

Search directly for optimal policy

Value-based RL

Estimate the optimal value function

Model-based RL

Build a model of the environment

is the policy achieving maximum future reward

is maximum value achievable under any policy

(7)

RL Agent Taxonomy

(8)

Deep Reinforcement Learning

Idea: deep learning for reinforcement learning

Use deep neural networks to represent

Value function

Policy

Model

Optimize loss function by SGD

(9)

Value-Based Deep RL

Estimate How Good Each State and/or Action is

(10)

Value Function Approximation

Value functions are represented by a lookup table

too many states and/or actions to store

too slow to learn the value of each entry individually

Values can be estimated with function approximation

(11)

Q-Networks

Q-networks represent value functions with weights

generalize from seen states to unseen states

update parameter for function approximation

(12)

Q-Learning

Goal: estimate optimal Q-values

Optimal Q-values obey a Bellman equation

Value iteration algorithms solve the Bellman equation

learning target

(13)

Deep Q-Networks (DQN)

Represent value function by deep Q-network with weights Objective is to minimize MSE loss by SGD

Leading to the following Q-learning gradient

Issue: naïve Q-learning oscillates or diverges using NN due to:

1) correlations between samples 2) non-stationary targets

(14)

Stability Issues with Deep RL

Naive Q-learning oscillates or diverges with neural nets

1. Data is sequential

Successive samples are correlated, non-iid (independent and identically distributed)

2. Policy changes rapidly with slight changes to Q-values

Policy may oscillate

Distribution of data can swing from one extreme to another 3. Scale of rewards and Q-values is unknown

Naive Q-learning gradients can be unstable when backpropagated

(15)

Stable Solutions for DQN

DQN provides a stable solutions to deep value-based RL

1. Use experience replay

Break correlations in data, bring us back to iid setting

Learn from all past policies 2. Freeze target Q-network

Avoid oscillation

Break correlations between Q-network and target

3. Clip rewards or normalize network adaptively to sensible range

Robust gradients

(16)

Stable Solution 1: Experience Replay

To remove correlations, build a dataset from agent’s experience

Take action at according to 𝜖-greedy policy

Store transition in replay memory D

Sample random mini-batch of transitions from D

Optimize MSE between Q-network and Q-learning targets

small prob for exploration

(17)

Stable Solution 2: Fixed Target Q-Network

To avoid oscillations, fix parameters used in Q-learning target

◦Compute Q-learning targets w.r.t. old, fixed parameters

◦Optimize MSE between Q-network and Q-learning targets

◦Periodically update fixed parameters

(18)

Stable Solution 3: Reward / Value Range

To avoid oscillations, control the reward / value range

◦DQN clips the rewards to [−1, +1]

Prevents too large Q-values

Ensures gradients are well-conditioned

(19)

Deep RL in Atari Games

(20)

DQN in Atari

Goal: end-to-end learning of values Q(s, a) from pixels

Input: state is stack of raw pixels from last 4 frames

Output: Q(s, a) for all joystick/button positions a

Reward is the score change for that step

(21)

DQN in Atari

(22)

Other Improvements: Double DQN

Nature DQN

Double DQN: remove upward bias caused by

Current Q-network is used to select actions

Older Q-network is used to evaluate actions

(23)

Other Improvements: Prioritized Replay

Prioritized Replay: weight experience based on surprise

Store experience in priority queue according to DQN error

(24)

Other Improvements: Dueling Network

Dueling Network: split Q-network into two channels

Action-independent value function

Value function estimates how good the state is

Action-dependent advantage function

Advantage function estimates the additional benefit

(25)

Policy-Based Deep RL

Estimate How Good An Agent ’s Behavior is

(26)

Deep Policy Networks

Represent policy by deep network with weights

Objective is to maximize total discounted reward by SGD

stochastic policy deterministic policy

(27)

Policy Gradient

The gradient of a stochastic policy is given by

The gradient of a deterministic policy is given by

How to deal with continuous actions

(28)

Actor-Critic (Value-Based + Policy-Based)

Estimate value function

Update policy parameters by SGD

Stochastic policy

Deterministic policy

(29)

Deterministic Deep Actor-Critic

Deep deterministic policy gradient (DDPG) is the continuous analogue of DQN

Experience replay: build dataset from agent’s experience

Critic estimates value of current policy by DQN

Actor updates policy in direction that improves Q

Critic provides loss function for actor

(30)

DDPG in Simulated Physics

Goal: end-to-end learning of control policy from pixels

Input: state is stack of raw pixels from last 4 frames

Output: two separate CNNs for Q and 𝜋

(31)

Model-Based Deep RL

Agent ’s Representation of the Environment

(32)

Model-Based Deep RL

Goal: learn a transition model of the environment and plan based on the transition model

Objective is to maximize the measured goodness of model

(33)

Issues for Model-Based Deep RL

Compounding errors

Errors in the transition model compound over the trajectory

A long trajectory may result in totally wrong rewards

Deep networks of value/policy can “plan” implicitly

Each layer of network performs arbitrary computational step

n-layer network can “lookahead” n steps

(34)

Model-Based Deep RL in Go

Monte-Carlo tree search (MCTS)

MCTS simulates future trajectories

Builds large lookahead search tree with millions of positions

State-of-the-art Go programs use MCTS

Convolutional Networks

12-layer CNN trained to predict expert moves

Raw CNN (looking at 1 position, no search at all) equals performance of MoGo with 105 position search tree

1st strong Go program

(35)

OpenAI Universe

Software platform for measuring and training an AI's general intelligence via the OpenAI gym environment

(36)

Concluding Remarks

RL is a general purpose framework for decision making under interactions between agent and environment

An RL agent may include one or more of these components

Policy: agent’s behavior function

Value function: how good is each state and/or action

Model: agent’s representation of the environment

RL problems can be solved by end-to-end deep learning Reinforcement Learning + Deep Learning = AI

(37)

References

Course materials by David Silver: http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Teaching.html ICLR 2015 Tutorial: http://www.iclr.cc/lib/exe/fetch.php?media=iclr2015:silver-iclr2015.pdf ICML 2016 Tutorial: http://icml.cc/2016/tutorials/deep_rl_tutorial.pdf

參考文獻

相關文件

◦ Action, State, and Reward Markov Decision Process Reinforcement Learning.

○ Value function: how good is each state and/or action. ○ Policy: agent’s

Constrain the data distribution for learned latent codes Generate the latent code via a prior

Reinforcement learning is based on reward hypothesis A reward r t is a scalar feedback signal. ◦ Indicates how well agent is doing at

 Sequence-to-sequence learning: both input and output are both sequences with different lengths..

Training two networks jointly  the generator knows how to adapt its parameters in order to produce output data that can fool the

◦ Value function: how good is each state and/or action.. ◦ Policy: agent’s

State value function: when using