• 沒有找到結果。

Slides credited from Dr. David Silver & Hung-Yi Lee 1

N/A
N/A
Protected

Academic year: 2022

Share "Slides credited from Dr. David Silver & Hung-Yi Lee 1"

Copied!
32
0
0

加載中.... (立即查看全文)

全文

(1)

1

(2)

Reinforcement Learning Approach

Value-based RL

Estimate the optimal value function

Policy-based RL

Search directly for optimal policy

Model-based RL

Build a model of the environment

Plan (e.g. by lookahead) using model

is the policy achieving maximum future reward

is maximum value achievable under any policy

(3)

RL Agent Taxonomy

Model-Free

Model

Value Policy

Learning a Critic

Actor-Critic

Learning an Actor

(4)

Value-Based Approach

LEARNING A CRITIC

(5)

Value Function

A value function is a prediction of future reward (with action a in state s)

Q-value function gives expected total reward

from state and action

under policy

with discount factor

Value functions decompose into a Bellman equation

(6)

Optimal Value Function

An optimal value function is the maximum achievable value

The optimal value function allows us act optimally

The optimal value informally maximizes over all decisions

Optimal values decompose into a Bellman equation

(7)

Value Function Approximation

Value functions are represented by a lookup table

too many states and/or actions to store

too slow to learn the value of each entry individually

Values can be estimated with function approximation

(8)

Q-Networks

Q-networks represent value functions with weights

generalize from seen states to unseen states

update parameter for function approximation

(9)

Q-Learning

Goal: estimate optimal Q-values

Optimal Q-values obey a Bellman equation

Value iteration algorithms solve the Bellman equation

learning target

(10)

Critic = Value Function

Idea: how good the actor is

State value function: when using actor 𝜋, the expected total reward after seeing observation (state) s

A critic does not determine the action An actor can be found from a critic

scalar

larger

smaller

(11)

Monte-Carlo for Estimating

Monte-Carlo (MC)

The critic watches 𝜋 playing the game

MC learns directly from complete episodes: no bootstrapping

After seeing 𝑠𝑎,

until the end of the episode, the cumulated reward is 𝐺𝑎 After seeing 𝑠𝑏,

until the end of the episode, the cumulated reward is 𝐺𝑏

Idea: value = empirical mean return

Issue: long episodes delay learning

(12)

Temporal-Difference for Estimating

Temporal-difference (TD)

The critic watches 𝜋 playing the game

TD learns directly from incomplete episodes by bootstrapping

TD updates a guess towards a guess

-

Idea: update value toward estimated return

(13)

MC v.s. TD

Monte-Carlo (MC)

Large variance

Unbiased

No Markov property

Temporal-Difference (TD)

Small variance

Biased

Markov property

smaller variance

may be biased

(14)

MC v.s. TD

(15)

Critic = Value Function

State-action value function: when using actor 𝜋, the

expected total reward after seeing observation (state) 𝑠 and taking action 𝑎

scalar

for discrete action only

(16)

Q-Learning

Given 𝑄𝜋 𝑠, 𝑎 , find a new actor 𝜋 “better” than 𝜋

𝜋 interacts with the environment

Learning 𝑄𝜋 𝑠, 𝑎 Find a new actor 𝜋

“better” than 𝜋

TD or MC

?

𝜋 = 𝜋

𝜋′ does not have extra parameters (depending on value function) not suitable for continuous action

(17)

Q-Learning

Goal: estimate optimal Q-values

Optimal Q-values obey a Bellman equation

Value iteration algorithms solve the Bellman equation

learning target

(18)

Deep Q-Networks (DQN)

Estimate value function by TD

Represent value function by deep Q-network with weights Objective is to minimize MSE loss by SGD

(19)

Deep Q-Networks (DQN)

Objective is to minimize MSE loss by SGD

Leading to the following Q-learning gradient

Issue: naïve Q-learning oscillates or diverges using NN due to:

1) correlations between samples 2) non-stationary targets

(20)

Stability Issues with Deep RL

Naive Q-learning oscillates or diverges with neural nets

1. Data is sequential

Successive samples are correlated, non-iid (independent and identically distributed)

2. Policy changes rapidly with slight changes to Q-values

Policy may oscillate

Distribution of data can swing from one extreme to another 3. Scale of rewards and Q-values is unknown

Naive Q-learning gradients can be unstable when backpropagated

(21)

Stable Solutions for DQN

DQN provides a stable solutions to deep value-based RL

1. Use experience replay

Break correlations in data, bring us back to iid setting

Learn from all past policies 2. Freeze target Q-network

Avoid oscillation

Break correlations between Q-network and target

3. Clip rewards or normalize network adaptively to sensible range

Robust gradients

(22)

Stable Solution 1: Experience Replay

To remove correlations, build a dataset from agent’s experience

Take action at according to 𝜖-greedy policy

Store transition in replay memory D

Sample random mini-batch of transitions from D

Optimize MSE between Q-network and Q-learning targets

small prob for exploration

(23)

Exploration

The policy is based on Q-function

Exploration algorithms

Epsilon greedy

Boltzmann sampling

not good for data collection

→ inefficient learning

𝑠

𝑎1 𝑎2 𝑎3

always sampled never explored never explored

𝜀 would decay during learning

(24)

𝜋 interacts with the environment

Learning 𝑄𝜋 𝑠, 𝑎 Find a new actor 𝜋

“better” than 𝜋

Replay Buffer

exp exp exp exp

put the experience into buffer

the experience in the buffer comes from different 𝜋

drop the old one if full 𝜋 = 𝜋

(25)

𝜋 interacts with the environment

Learning 𝑄𝜋 𝑠, 𝑎 Find a new actor 𝜋

“better” than 𝜋

Replay Buffer

put the experience into buffer

𝜋 = 𝜋

In each iteration:

1. Sample a batch 2. Update Q-function

exp exp exp exp

Off-policy

(26)

Stable Solution 2: Fixed Target Q-Network

To avoid oscillations, fix parameters used in Q-learning target

Compute Q-learning targets w.r.t. old, fixed parameters

Optimize MSE between Q-network and Q-learning targets

Periodically update fixed parameters

freeze

freeze

(27)

Stable Solution 3: Reward / Value Range

To avoid oscillations, control the reward / value range

◦DQN clips the rewards to [−1, +1]

Prevents too large Q-values

Ensures gradients are well-conditioned

(28)

Typical Q-Learning Algorithm

Initialize Q-function 𝑄, target Q-function ෠𝑄 = 𝑄 In each episode

For each time step 𝑡

Given state 𝑠𝑡, take action 𝑎𝑡 based on 𝑄 (epsilon greedy)

Obtain reward 𝑟𝑡, and reach new state 𝑠𝑡+1

Store into buffer

Sample from buffer (usually a batch)

Update the parameters of 𝑄 to make

Every 𝐶 steps reset

(29)

Deep RL in Atari Games

(30)

DQN in Atari

Goal: end-to-end learning of values Q(s, a) from pixels

Input: state is stack of raw pixels from last 4 frames

Output: Q(s, a) for all joystick/button positions a

Reward is the score change for that step

DQN Nature Paper [link] [code] 30

(31)

DQN in Atari

DQN Nature Paper [link] [code] 31

(32)

Concluding Remarks

RL is a general purpose framework for decision making under interactions between agent and environment

A value-based RL measures how good each state and/or action is via a value function

Monte-Carlo (MC) v.s. Temporal-Difference (TD)

參考文獻

相關文件

▪ Step 2: Run DFS on the transpose

 “Greedy”: always makes the choice that looks best at the moment in the hope that this choice will lead to a globally optimal solution.  When to

Machine Translation Speech Recognition Image Captioning Question Answering Sensory Memory.

◦ Action, State, and Reward Markov Decision Process Reinforcement Learning.

Constrain the data distribution for learned latent codes Generate the latent code via a prior

Reinforcement learning is based on reward hypothesis A reward r t is a scalar feedback signal. ◦ Indicates how well agent is doing at

 Sequence-to-sequence learning: both input and output are both sequences with different lengths..

Training two networks jointly  the generator knows how to adapt its parameters in order to produce output data that can fool the