• 沒有找到結果。

Q-Learning Applied Deep Learning

N/A
N/A
Protected

Academic year: 2022

Share "Q-Learning Applied Deep Learning"

Copied!
50
0
0

加載中.... (立即查看全文)

全文

(1)

Q-Learning

Applied Deep Learning

March 28th, 2020 http://adl.miulab.tw

(2)

Reinforcement Learning Approach

◉ Value-based RL

○ Estimate the optimal value function

◉ Policy-based RL

○ Search directly for optimal policy

◉ Model-based RL

○ Build a model of the environment

○ Plan (e.g. by lookahead) using model

is the policy achieving maximum future reward

is maximum value achievable under any policy

2

(3)

RL Agent Taxonomy

Model-Free

Model

Value Policy

Learning a Critic

Actor-Critic

Learning an Actor 3

(4)

Learning a Critic

Value-Based Approach

4

(5)

Value Function

A value function is a prediction of future reward (with action a in state s)

◉ Q-value function gives expected total reward

from state and action

under policy

with discount factor

◉ Value functions decompose into a Bellman equation

5

(6)

Optimal Value Function

An optimal value function is the maximum achievable value

The optimal value function allows us act optimally

The optimal value informally maximizes over all decisions

Optimal values decompose into a Bellman equation

6

(7)

Value Function Approximation

Value functions are represented by a lookup table

○ too many states and/or actions to store

○ too slow to learn the value of each entry individually

Values can be estimated with function approximation

7

(8)

Q-Networks

Q-networks

represent value functions with weights

○ generalize from seen states to unseen states

○ update parameter for function approximation

8

(9)

Q-Learning

◉ Goal: estimate optimal Q-values

○ Optimal Q-values obey a Bellman equation

Value iteration algorithms solve the Bellman equation

learning target

9

(10)

Critic = Value Function

Idea: how good the actor is

State value function: when using actor 𝜋, the expected total reward after seeing observation (state) s

A critic does not determine the action An actor can be found from a critic

scalar

larger

smaller

10

(11)

Monte-Carlo for Estimating

◉ Monte-Carlo (MC)

○ The critic watches 𝜋 playing the game

MC learns directly from complete episodes: no bootstrapping

After seeing 𝑠𝑎,

until the end of the episode, the cumulated reward is 𝐺𝑎

After seeing 𝑠𝑏,

until the end of the episode, the cumulated reward is 𝐺𝑏

Idea: value = empirical mean return

Issue: long episodes delay learning

11

(12)

Temporal-Difference for Estimating

◉ Temporal-difference (TD)

○ The critic watches 𝜋 playing the game

TD learns directly from incomplete episodes by bootstrapping

○ TD updates a guess towards a guess

-

Idea: update value toward estimated return

12

(13)

MC v.s. TD

◉ Monte-Carlo (MC)

○ Large variance

Unbiased

○ No Markov property

◉ Temporal-Difference (TD)

○ Small variance

Biased

○ Markov property

smaller variance

may be biased 13

(14)

MC v.s. TD

14

(15)

Critic = Value Function

State-action value function: when using actor 𝜋, the expected total reward after seeing observation (state) 𝑠 and taking action 𝑎

scalar

for discrete action only 15

(16)

Q-Learning

Given 𝑄

𝜋

𝑠, 𝑎 , find a new actor 𝜋

“better” than 𝜋

𝜋 interacts with the environment

Learning 𝑄𝜋 𝑠, 𝑎 Find a new actor

𝜋 “better” than 𝜋

TD or MC

?

𝜋 = 𝜋 𝜋′ does not have extra parameters

(depending on value function)

not suitable for continuous action

16

(17)

Q-Learning

◉ Goal: estimate optimal Q-values

○ Optimal Q-values obey a Bellman equation

Value iteration algorithms solve the Bellman equation

learning target

17

(18)

Deep Q-Networks (DQN)

Estimate value function by TD

Represent value function by deep Q-network with weights

Objective is to minimize MSE loss by SGD

18

(19)

Deep Q-Networks (DQN)

Objective is to minimize MSE loss by SGD

Leading to the following Q-learning gradient

Issue: naïve Q-learning oscillates or diverges using NN due to:

1) correlations between samples 2) non-stationary targets

19

(20)

Stability Issues with Deep RL

◉ Naive Q-learning oscillates or diverges with neural nets

1. Data is sequential

Successive samples are correlated, non-iid (independent and identically distributed)

2. Policy changes rapidly with slight changes to Q-values

■ Policy may oscillate

■ Distribution of data can swing from one extreme to another 3. Scale of rewards and Q-values is unknown

■ Naive Q-learning gradients can be unstable when backpropagated

20

(21)

Stable Solutions for DQN

◉ DQN provides a stable solutions to deep value-based RL

1. Use experience replay

Break correlations in data, bring us back to iid setting

■ Learn from all past policies 2. Freeze target Q-network

■ Avoid oscillation

■ Break correlations between Q-network and target

3. Clip rewards or normalize network adaptively to sensible range

■ Robust gradients

21

(22)

Stable Solution 1: Experience Replay

To remove correlations, build a dataset from agent’s experience

Take action at according to 𝜖-greedy policy

Store transition in replay memory D

Sample random mini-batch of transitions from D

Optimize MSE between Q-network and Q-learning targets

small prob for exploration

22

(23)

Exploration

The policy is based on Q-function

Exploration algorithms

Epsilon greedy

Boltzmann sampling

not good for data collection

→ inefficient learning 𝑠

𝑎1 𝑎2 𝑎3

always sampled never explored never explored

𝜀 would decay during learning

23

(24)

𝜋 interacts with the environment

Learning 𝑄𝜋 𝑠, 𝑎 Find a new actor

𝜋 “better” than 𝜋

Replay Buffer

exp exp exp exp

put the experience into buffer

the experience in the buffer comes from different 𝜋

drop the old one if full 𝜋 = 𝜋

24

(25)

Replay Buffer

In each iteration:

1. Sample a batch 2. Update Q-function

Off-policy 𝜋 interacts with

the environment

Learning 𝑄𝜋 𝑠, 𝑎 Find a new actor

𝜋 “better” than 𝜋

exp exp exp exp

put the experience into buffer

𝜋 = 𝜋

25

(26)

Stable Solution 2: Fixed Target Q-Network

◉ To avoid oscillations, fix parameters used in Q-learning target

○ Compute Q-learning targets w.r.t. old, fixed parameters

Optimize MSE between Q-network and Q-learning targets

○ Periodically update fixed parameters

freeze

freeze 26

(27)

Stable Solution 3: Reward / Value Range

◉ To avoid oscillations, control the reward / value range

○ DQN clips the rewards to [−1, +1]

Prevents too large Q-values

Ensures gradients are well-conditioned 27

(28)

Typical Q-Learning Algorithm

Initialize Q-function 𝑄, target Q-function ෠𝑄 = 𝑄

In each episode

For each time step 𝑡

Given state 𝑠𝑡, take action 𝑎𝑡 based on 𝑄 (epsilon greedy)

Obtain reward 𝑟𝑡, and reach new state 𝑠𝑡+1

Store into buffer

Sample from buffer (usually a batch)

Update the parameters of 𝑄 to make

Every 𝐶 steps reset 28

(29)

Deep RL in Atari Games

29

(30)

DQN in Atari

Goal: end-to-end learning of values Q(s, a) from pixels

Input: state is stack of raw pixels from last 4 frames

Output: Q(s, a) for all joystick/button positions a

Reward is the score change for that step

DQN Nature Paper [link] [code]

30

(31)

DQN in Atari

DQN Nature Paper [link] [code]

31

(32)

Concluding Remarks

RL is a general purpose framework for decision making under interactions between agent and environment

A value-based RL measures how good each state and/or action is via a value function

Monte-Carlo (MC) v.s. Temporal-Difference (TD) 32

(33)

DQN 進階模型

Advanced DQN

33

(34)

Double DQN

◉ Q value is usually over-estimated

34

(35)

Double DQN

Nature DQN

Issue: tend to select the action that is over-estimated

Hasselt et al.,“Deep Reinforcement Learning with Double Q-learning”, AAAI 2016.

35

(36)

Double DQN

Nature DQN

◉ Double DQN: remove upward bias caused by

○ Current Q-network is used to select actions

○ Older Q-network is used to evaluate actions

Hasselt et al.,“Deep Reinforcement Learning with Double Q-learning”, AAAI 2016.

If 𝑄 over-estimate 𝑎, so it is selected. ෠𝑄 would give it proper value.

How about ෠𝑄 overestimate? The action will not be selected by 𝑄.

36

(37)

Dueling DQN

◉ Dueling Network: split Q-network into two channels

○ Action-independent value function

Value function estimates how good the state is

○ Action-dependent advantage function

Advantage function estimates the additional benefit

Wang et al., “Dueling Network Architectures for Deep Reinforcement Learning”, arXiv preprint, 2015.

37

(38)

Dueling DQN

Action Action

State

State

Wang et al., “Dueling Network Architectures for Deep Reinforcement Learning”, arXiv preprint, 2015.

38

(39)

Dueling DQN

3 3 3 1

1 -1 6 1

2 -2 3 1

state

action

1 3 -1 0

-1 -1 2 0

0 -2 -1 0

2 0 4 1

0

1 4

= +

average of column -1

+ =

sum of column = 0

Wang et al., “Dueling Network Architectures for Deep Reinforcement Learning”, arXiv preprint, 2015.

39

(40)

1.0

Dueling DQN

7 3 2

3 -1 -2

normalize A(s,a) before adding with V(s)

Wang et al., “Dueling Network Architectures for Deep Reinforcement Learning”, arXiv preprint, 2015.

40

(41)

Dueling DQN - Visualization

Wang et al., “Dueling Network Architectures for Deep Reinforcement Learning”, arXiv preprint, 2015.

41

(42)

Dueling DQN - Visualization

Wang et al., “Dueling Network Architectures for Deep Reinforcement Learning”, arXiv preprint, 2015.

42

(43)

Prioritized Replay

Prioritized Replay: weight experience based on surprise

○ Store experience in priority queue according to the error

Schaulet al., “Prioritized Experience Replay”, arXiv preprint, 2015.

The data with larger TD error in previous training has higher probability to be sampled.

Parameter update procedure is also modified.

TD error

43

(44)

Multi-Step

◉ Idea: balance between MC and TD

44

(45)

Distributional Q-function

◉ State-action value function

○ When using actor 𝜋, the cumulated reward expects to be obtained after seeing observation 𝑠 and taking 𝑎

Different distributions can have the same values.

-10 10 -10 10

45

(46)

Distributional Q-function

A network with 3 outputs A network with 15 outputs (each action has 5 bins)

46

(47)

Demo

https://youtu.be/yFBwyPuO2Vg

47

(48)

Rainbow

Hessel et al., “Rainbow: Combining Improvements in Deep Reinforcement Learning”, arXiv preprint, 2017.

48

(49)

Rainbow

Hessel et al., “Rainbow: Combining Improvements in Deep Reinforcement Learning”, arXiv preprint, 2017.

49

(50)

Concluding Remarks

◉ DQN training tips

○ Double DQN

Dueling DQN

○ Prioritized replay

○ Multi-step

○ Distributional DQN

50

參考文獻

相關文件

develop students’ career-related competencies, foundation skills (notably communication skills), thinking skills and people skills as well as to nurture their positive values

Through an open and flexible curriculum framework, which consists of the Learning Targets, Learning Objectives, examples of learning activities, schemes of work, suggestions for

Schools will be requested to report their use of the OITG through the ITE4 annual surveys to review the effectiveness of

This thesis applied Q-learning algorithm of reinforcement learning to improve a simple intra-day trading system of Taiwan stock index future. We simulate the performance

◦ Action, State, and Reward Markov Decision Process Reinforcement Learning.

○ Value function: how good is each state and/or action. ○ Policy: agent’s

Agent learns to take actions maximizing expected reward.. Machine Learning ≈ Looking for

 Goal: select actions to maximize future reward Big three: action, state, reward.. Scenario of Reinforcement Learning.. Agent learns to take actions to maximize expected