• 沒有找到結果。

Slides credited from Dr. David Silver & Hung-Yi Lee

N/A
N/A
Protected

Academic year: 2022

Share "Slides credited from Dr. David Silver & Hung-Yi Lee"

Copied!
59
0
0

加載中.... (立即查看全文)

全文

(1)
(2)

Review

Reinforcement Learning

(3)

Reinforcement Learning

RL is a general purpose framework for decision making

RL is for an agent with the capacity to act

Each action influences the agent’s future state

Success is measured by a scalar reward signal

Big three: action, state, reward

(4)

Agent and Environment

→←

MoveRight MoveLeft

observation ot action at

reward rt Agent

Environment

(5)

Major Components in an RL Agent

An RL agent may include one or more of these components

Value function: how good is each state and/or action

Policy: agent’s behavior function

Model: agent’s representation of the environment

(6)

Reinforcement Learning Approach

Value-based RL

Estimate the optimal value function

Policy-based RL

Search directly for optimal policy

Model-based RL

Build a model of the environment

Plan (e.g. by lookahead) using model

is the policy achieving maximum future reward

is maximum value achievable under any policy

(7)

RL Agent Taxonomy

Model-Free

Model

Value Policy

Learning a Critic

Actor-Critic

Learning an Actor

(8)

Deep Reinforcement Learning

Idea: deep learning for reinforcement learning

Use deep neural networks to represent

Value function

Policy

Model

Optimize loss function by SGD

(9)

Value-Based Approach

LEARNING A CRITIC

(10)

Critic = Value Function

Idea: how good the actor is

State value function: when using actor 𝜋, the expected total reward after seeing observation (state) s

scalar

larger

smaller

(11)

Monte-Carlo for Estimating

Monte-Carlo (MC)

The critic watches 𝜋 playing the game

MC learns directly from complete episodes: no bootstrapping

After seeing 𝑠𝑎,

until the end of the episode, the cumulated reward is 𝐺𝑎 After seeing 𝑠𝑏,

until the end of the episode, the cumulated reward is 𝐺𝑏

Idea: value = empirical mean return

(12)

Temporal-Difference for Estimating

Temporal-difference (TD)

The critic watches 𝜋 playing the game

TD learns directly from incomplete episodes by bootstrapping

TD updates a guess towards a guess

-

Idea: update value toward estimated return

(13)

MC v.s. TD

Monte-Carlo (MC)

Large variance

Unbiased

No Markov property

Temporal-Difference (TD)

Small variance

Biased

Markov property

smaller may be

(14)

MC v.s. TD

(15)

Critic = Value Function

State-action value function: when using actor 𝜋, the

expected total reward after seeing observation (state) 𝑠 and taking action 𝑎

scalar

for discrete action only

(16)

Q-Learning

Given 𝑄𝜋 𝑠, 𝑎 , find a new actor 𝜋 “better” than 𝜋

𝜋 interacts with the environment

Learning 𝑄𝜋 𝑠, 𝑎 Find a new actor 𝜋

“better” than 𝜋

TD or MC

?

𝜋 = 𝜋

𝜋′ does not have extra parameters (depending on value function) not suitable for continuous action

(17)

Q-Learning

Goal: estimate optimal Q-values

Optimal Q-values obey a Bellman equation

Value iteration algorithms solve the Bellman equation

learning target

(18)

Deep Q-Networks (DQN)

Estimate value function by TD

Represent value function by deep Q-network with weights Objective is to minimize MSE loss by SGD

(19)

Deep Q-Networks (DQN)

Objective is to minimize MSE loss by SGD

Leading to the following Q-learning gradient

Issue: naïve Q-learning oscillates or diverges using NN due to:

1) correlations between samples 2) non-stationary targets

(20)

Stability Issues with Deep RL

Naive Q-learning oscillates or diverges with neural nets

1. Data is sequential

Successive samples are correlated, non-iid (independent and identically distributed)

2. Policy changes rapidly with slight changes to Q-values

Policy may oscillate

Distribution of data can swing from one extreme to another 3. Scale of rewards and Q-values is unknown

Naive Q-learning gradients can be unstable when backpropagated

(21)

Stable Solutions for DQN

DQN provides a stable solutions to deep value-based RL

1. Use experience replay

Break correlations in data, bring us back to iid setting

Learn from all past policies 2. Freeze target Q-network

Avoid oscillation

Break correlations between Q-network and target

3. Clip rewards or normalize network adaptively to sensible range

Robust gradients

(22)

Stable Solution 1: Experience Replay

To remove correlations, build a dataset from agent’s experience

Take action at according to 𝜖-greedy policy

Store transition in replay memory D

Sample random mini-batch of transitions from D

Optimize MSE between Q-network and Q-learning targets

small prob for exploration

(23)

Stable Solution 2: Fixed Target Q-Network

To avoid oscillations, fix parameters used in Q-learning target

Compute Q-learning targets w.r.t. old, fixed parameters

Optimize MSE between Q-network and Q-learning targets

freeze

freeze

(24)

Stable Solution 3: Reward / Value Range

To avoid oscillations, control the reward / value range

◦DQN clips the rewards to [−1, +1]

Prevents too large Q-values

Ensures gradients are well-conditioned

(25)

Deep RL in Atari Games

(26)

DQN in Atari

Goal: end-to-end learning of values Q(s, a) from pixels

Input: state is stack of raw pixels from last 4 frames

Output: Q(s, a) for all joystick/button positions a

Reward is the score change for that step

(27)

DQN in Atari

(28)

Other Improvements: Double DQN

Nature DQN

Issue: tend to select the action that is over-estimated

(29)

Other Improvements: Double DQN

Nature DQN

Double DQN: remove upward bias caused by

Current Q-network is used to select actions

Older Q-network is used to evaluate actions

(30)

Other Improvements: Prioritized Replay

Prioritized Replay: weight experience based on surprise

Store experience in priority queue according to DQN error

(31)

Other Improvements: Dueling Network

Dueling Network: split Q-network into two channels

Action-independent value function

Value function estimates how good the state is

Action-dependent advantage function

Advantage function estimates the additional benefit

(32)

Other Improvements: Dueling Network

Action Action

State

State

(33)

Policy-Based Approach

LEARNING AN ACTOR

(34)

On-Policy v.s. Off-Policy

On-policy: The agent learned and the agent interacting with the environment is the same

Off-policy: The agent learned and the agent interacting with the environment is different

(35)

Goodness of Actor

An episode is considered as a trajectory 𝜏

Reward:

control by your actor not related to your actor

Actor

left right

fire

0.1 0.2 0.7

(36)

Goodness of Actor

An episode is considered as a trajectory 𝜏

Reward:

We define as the expected value of reward

If you use an actor to play game, each 𝜏 has 𝑃 𝜏|𝜃 to be sampled

Use 𝜋𝜃 to play the game N times, obtain 𝜏1, 𝜏2, ⋯ , 𝜏𝑁

Sampling 𝜏 from 𝑃 𝜏|𝜃 N times

sum over all possible trajectory

(37)

Deep Policy Networks

Represent policy by deep network with weights

Objective is to maximize total discounted reward by SGD

Update the model parameters iteratively

(38)

Policy Gradient

Gradient assent to maximize the expected reward

do not have to be differentiable can even be a black box

use 𝜋𝜃 to play the game N times, obtain 𝜏1, 𝜏2, ⋯ , 𝜏𝑁

(39)

Policy Gradient

An episode trajectory

ignore the terms not related to 𝜃

(40)

Policy Gradient

Gradient assent for iteratively updating the parameters

If 𝜏𝑛 machine takes 𝑎𝑡𝑛 when seeing 𝑠𝑡𝑛

Tuning 𝜃 to increase Tuning 𝜃 to decrease

(41)

Policy Gradient

Given actor parameter 𝜃

model update data collection

… … … …

(42)

Improvement: Adding Baseline

Ideally

Sampling

it is probability

not sampled

(43)

Actor-Critic Approach

LEARNING AN ACTOR & A CRITIC

(44)

Actor-Critic (Value-Based + Policy-Based)

Estimate value function 𝑄𝜋 𝑠, 𝑎 , 𝑉𝜋 𝑠

Update policy based on the value function evaluation 𝜋

𝜋 interacts with the environment

Learning 𝑄𝜋 𝑠, 𝑎 , 𝑉𝜋 𝑠 Update actor from

𝜋 → 𝜋’ based on 𝑄𝜋 𝑠, 𝑎 , 𝑉𝜋 𝑠

TD or MC 𝜋 = 𝜋

𝜋 is a actual function that maximizes the value

may works for continuous action

(45)

Advantage Actor-Critic

Learning the policy (actor) using the value evaluated by critic

Positive advantage function ↔ increasing the prob. of action 𝑎𝑡𝑛

Negative advantage function ↔ decreasing the prob. of action 𝑎𝑡𝑛

𝜋 interacts with the environment

Learning 𝑉𝜋 𝑠 Update actor

based on 𝑉𝜋 𝑠

TD or MC 𝜋 = 𝜋

evaluated by critic

Advantage function:

expected reward 𝑟𝑡𝑛 we obtain if we use actor 𝜋

the reward 𝑟𝑡𝑛 we truly obtain when taking action 𝑎𝑡𝑛

baseline is added

(46)

Advantage Actor-Critic

Tips

◦The parameters of actor 𝜋 𝑠 and critic 𝑉𝜋 𝑠 can be shared

◦Use output entropy as regularization for 𝜋 𝑠

◦Larger entropy is preferred → exploration Network

𝑠

Network

fire right

left

Network

𝑉𝜋 𝑠

(47)

Asynchronous Advantage Actor-Critic (A3C)

Asynchronous

1. Copy global parameters 2. Sampling some data

3. Compute gradients 4. Update global models

𝜃

1

∆𝜃

∆𝜃 𝜃

1

𝜃

1

+𝜂∆𝜃

(other workers also update models)

(48)

Pathwise Derivative Policy Gradient

Original actor-critic tells that a given action is good or bad Pathwise derivative policy gradient tells that which action is good

(49)

Pathwise Derivative Policy Gradient

=

This is a large network

Fixed Gradient ascent:

an actor’s output

Actor

(50)

Deep Deterministic Policy Gradient (DDPG)

Idea

Critic estimates value of current policy by DQN

Actor updates policy in direction that

improves Q

Actor

=

𝜋 interacts with the environment

Learning 𝑄𝜋 𝑠, 𝑎 Update actor 𝜋 → 𝜋’

based on 𝑄𝜋 𝑠, 𝑎

Replay Buffer add noise

 exploration

Critic provides loss function for actor

(51)

DDPG Algorithm

Initialize critic network 𝜃𝑄 and actor network 𝜃𝜋

Initialize target critic network 𝜃𝑄 = 𝜃𝑄 and target actor network 𝜃𝜋 = 𝜃𝜋 Initialize replay buffer R

In each iteration

Use 𝜋 𝑠 + noise to interact with the environment, collect a set of 𝑠𝑡, 𝑎𝑡, 𝑟𝑡, 𝑠𝑡+1 , put them in R

Sample N examples 𝑠𝑛, 𝑎𝑛, 𝑟𝑛, 𝑠𝑛+1 from R

Update critic 𝑄 to minimize

Update actor 𝜋 to maximize

Update target networks: the target networks

update slower using target networks

(52)

DDPG in Simulated Physics

Goal: end-to-end learning of control policy from pixels

Input: state is stack of raw pixels from last 4 frames

Output: two separate CNNs for Q and 𝜋

(53)

Model-Based

Agent ’s Representation of the Environment

(54)

Model-Based Deep RL

Goal: learn a transition model of the environment and plan based on the transition model

Objective is to maximize the measured goodness of model

Model-based deep RL is challenging, and so far has failed in Atari

(55)

Issues for Model-Based Deep RL

Compounding errors

Errors in the transition model compound over the trajectory

A long trajectory may result in totally wrong rewards

Deep networks of value/policy can “plan” implicitly

Each layer of network performs arbitrary computational step

n-layer network can “lookahead” n steps

(56)

Model-Based Deep RL in Go

Monte-Carlo tree search (MCTS)

MCTS simulates future trajectories

Builds large lookahead search tree with millions of positions

State-of-the-art Go programs use MCTS

Convolutional Networks

12-layer CNN trained to predict expert moves

Raw CNN (looking at 1 position, no search at all) equals performance of MoGo with 105 position search tree

1st strong Go program

(57)

OpenAI Universe

Software platform for measuring and training an AI's general intelligence via the OpenAI gym environment

(58)

Concluding Remarks

RL is a general purpose framework for decision making under interactions between agent and environment

An RL agent may include one or more of these components

RL problems can be solved by end-to-end deep learning Reinforcement Learning + Deep Learning = AI

Value Policy

Learning

a Critic Actor-Critic Learning an Actor

Value function: how good is each state and/or action

Policy: agent’s behavior function

Model: agent’s representation of the environment

(59)

References

Course materials by David Silver: http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Teaching.html ICLR 2015 Tutorial: http://www.iclr.cc/lib/exe/fetch.php?media=iclr2015:silver-iclr2015.pdf ICML 2016 Tutorial: http://icml.cc/2016/tutorials/deep_rl_tutorial.pdf

參考文獻

相關文件

 “Greedy”: always makes the choice that looks best at the moment in the hope that this choice will lead to a globally optimal solution.  When to

Machine Translation Speech Recognition Image Captioning Question Answering Sensory Memory.

◦ Action, State, and Reward Markov Decision Process Reinforcement Learning.

○ Value function: how good is each state and/or action. ○ Policy: agent’s

Constrain the data distribution for learned latent codes Generate the latent code via a prior

Reinforcement learning is based on reward hypothesis A reward r t is a scalar feedback signal. ◦ Indicates how well agent is doing at

 Sequence-to-sequence learning: both input and output are both sequences with different lengths..

Training two networks jointly  the generator knows how to adapt its parameters in order to produce output data that can fool the