Deep Reinforcement Learning

(1)

Deep Reinforcement Learning

Applied Deep Learning

May 10th, 2021 http://adl.miulab.tw

(2)

Outline

◉

Machine Learning

○ Supervised Learning v.s. Reinforcement Learning

○ Reinforcement Learning v.s. Deep Learning

◉

Introduction to Reinforcement Learning

○ Agent and Environment

○ Action, State, and Reward

◉

Reinforcement Learning Approach

○ Value-Based

○ Policy-Based

○

Model-Based 2

(3)

Outline

◉

Machine Learning

◉

○ Value-Based

○ Policy-Based

○

Model-Based 3

(4)

Machine Learning

4

Machine Learning

Unsupervised Learning Supervised

Learning

Reinforcement Learning

(5)

Supervised v.s. Reinforcement

◉

Supervised Learning

○ Training based on

supervisor/label/annotation

○

Feedback is instantaneous

○ Time does not matter 5

◉

○ Training only based on reward signal

○ Feedback is delayed

○

Time matters

○ Agent actions affect subsequent data

(6)

Supervised v.s. Reinforcement

◉

Supervised

◉

Reinforcement

6

……

Say “Hi”

Say “Good bye”

Learning from teacher

Learning from critics Hello ☺ ^……

“Hello”

“Bye bye”

……. ^……. ^OXX??

?!

Bad

(7)

Reinforcement Learning

◉

RL is a general purpose framework for decision making

○ RL is for an agent with the capacity to act

○ Each action influences the agent’s future state

○ Success is measured by a scalar reward signal

○ Goal: select actions to maximize future reward 7

(8)

Deep Learning

◉

DL is a general purpose framework for representation learning

○ Given an objective

○ Learn representation that is required to achieve objective

○ Directly from raw inputs

○ Use minimal domain knowledge 8

x1

x2

… …

y1

y2

… …

…

… …

…

yM

xN

vector x

vector y

(9)

Deep Reinforcement Learning

◉

AI is an agent that can solve human-level task

○ RL defines the objective

○ DL gives the mechanism

○ RL + DL = general intelligence 9

(10)

Deep RL AI Examples

◉

Play games: Atari, poker, Go, …

◉

Explore worlds: 3D worlds, …

◉

Control physical systems: manipulate, …

◉

Interact with users: recommend, optimize, personalize, …

10

(11)

Introduction to RL

11

(12)

Outline

◉

Machine Learning

◉

○ Value-Based

○ Policy-Based

○

Model-Based 12

(13)

Reinforcement Learning

◉

RL is a general purpose framework for decision making

○ RL is for an agent with the capacity to act

○ Each action influences the agent’s future state

○ Success is measured by a scalar reward signal 13

Big three: action, state, reward

(14)

Agent and Environment

14

→←

MoveRight MoveLeft

observation o_t action a_t

reward r_t Agent

Environment

(15)

Agent and Environment

◉

At time step t

○ The agent

■ Executes action a_t

■

Receives observation o_t

■ Receives scalar reward r_t

○ The environment

■ Receives action a_t

■ Emits observation o_t+1

■ Emits scalar reward r_t+1

○ t increments at env. step 15

observation o_t

action a_t

reward r_t

(16)

State

◉

Experience is the sequence of observations, actions, rewards

◉

^State is the information used to determine what happens next

○ what happens depends on the history experience

• The agent selects actions

• The environment selects observations/rewards

◉

The state is the function of the history experience

16

(17)

observation o_t

action a_t

reward r_t

Environment State

◉

The environment state 𝑠_𝑡^𝑒 is the

environment’s private representation

○

whether data the environment uses to pick the next

observation/reward

○ may not be visible to the agent

○ may contain irrelevant information 17

(18)

observation o_t

action a_t

reward r_t

Agent State

◉

The agent state 𝑠_𝑡^𝑎 is the agent’s internal representation

○

whether data the agent uses to pick the next action → information used by RL algorithms

○ can be any function of experience 18

(19)

Information State

◉

An information state (a.k.a. Markov state) contains all useful information from history

◉

The future is independent of the past given the present

○ Once the state is known, the history may be thrown away

○ The state is a sufficient statistics of the future 19

A state is Markov iff

(20)

Fully Observable Environment

◉

Full observability: agent directly observes environment state

information state = agent state = environment state

20

This is a Markov decision process (MDP)

(21)

Partially Observable Environment

◉

Partial observability: agent indirectly observes environment

agent state ≠ environment state

◉

Agent must construct its own state representation 𝑠_𝑡^𝑎

○ Complete history:

○

Beliefs of environment state:

○ Hidden state (from RNN):

21

This is partially observable Markov decision process (POMDP)

(22)

Reward

◉

Reinforcement learning is based on reward hypothesis

◉

^{A reward r}_t is a scalar feedback signal

○

Indicates how well agent is doing at step t 22

Reward hypothesis:

all agent goals can be desired by maximizing expected cumulative reward

(23)

Sequential Decision Making

◉

Goal: select actions to maximize total future reward

○ Actions may have long-term consequences

○ Reward may be delayed

○ It may be better to sacrifice immediate reward to gain more long-term reward 23

(24)

Scenario of Reinforcement Learning

24

Agent

Environment

Observation Action

Reward Don’t do

that

State

Change the environment

(25)

Scenario of Reinforcement Learning

25

Agent Observation

Reward Thank you.

State

Action Change the environment

Environment

Agent learns to take actions maximizing expected reward.

(26)

Machine Learning ≈ Looking for a Function

26

Observation Action

Reward Function

input

Used to pick the best function

Function output Actor/Policy

Action = π(Observation)

Environment

(27)

Learning to Play Go

27

Observation Action

Reward

Next Move

Environment

(28)

Learning to Play Go

28

Observation Action

Reward

Agent learns to take actions maximizing expected reward.

Environment If win, reward = 1

If loss, reward = -1 reward = 0 in most cases

(29)

Learning to Play Go

◉

^Supervised

◉

29

Next move:

“5-5”

Next move:

“3-3”

First move

…… many moves …… Win!

AlphaGo uses supervised learning + reinforcement learning.

Learning from teacher

Learning from experience

(Two agents play with each other.)

(30)

Learning a Chatbot

◉

Machine obtains feedback from user

30

How are you?

Bye bye ☺

Hello

Hi ☺

-10 3

Chatbot learns to maximize the expected reward

(31)

Learning a Chatbot

◉

Let two agents talk to each other (sometimes generate good dialogue, sometimes bad)

31

How old are you?

See you.

How old are you?

I am 16.

I though you were 12.

What make you think so?

(32)

Learning a chat-bot

◉

By this approach, we can generate a lot of dialogues.

◉

Use pre-defined rules to evaluate the goodness of a dialogue

32

Dialogue 1

Dialogue 2

Dialogue 3

Dialogue 4

Dialogue 5

Dialogue 6

Dialogue 7

Dialogue 8

Machine learns from the evaluation as rewards

(33)

Learning to Play Video Game

◉

Space invader: terminate when all aliens are killed, or your spaceship is destroyed

33

fire Score

(reward) Kill the aliens

shield

Play yourself: http://www.2600online.com/spaceinvaders.html

How about machine: https://gym.openai.com/evaluations/eval_Eduozx4HRyqgTCVk9ltw

(34)

Learning to Play Video Game

34

Start with observation 𝑠₁

Observation 𝑠₂ Observation 𝑠₃

Action 𝑎₁: “right”

Obtain reward 𝑟₁ = 0

Action 𝑎₂: “fire”

(kill an alien) Obtain reward 𝑟₂ = 5

Usually there is some randomness in the environment

(35)

Learning to Play Video Game

35

Start with observation 𝑠₁

Observation 𝑠₂ Observation 𝑠₃

After many turns

Action 𝑎_𝑇 Obtain reward 𝑟_𝑇

Game Over

(spaceship destroyed)

This is an episode.

Learn to maximize the expected cumulative

reward per episode

(36)

More Applications

◉

Flying Helicopter

○ https://www.youtube.com/watch?v=0JL04JJjocc

◉

^Driving

○ https://www.youtube.com/watch?v=0xo1Ldx3L5Q

◉

^Robot

○ https://www.youtube.com/watch?v=370cT-OAzzM

◉

Google Cuts Its Giant Electricity Bill With DeepMind-Powered AI

○ http://www.bloomberg.com/news/articles/2016-07-19/google-cuts-its-giant-electricity-bill-with- deepmind-powered-ai

◉

Text Generation

○ https://www.youtube.com/watch?v=pbQ4qe8EwLo

36

(37)

Reinforcement Learning

37

(38)

Outline

◉

Machine Learning

◉

○ Value-Based

○ Policy-Based

○

Model-Based 38

(39)

Major Components in an RL Agent

◉

An RL agent may include one or more of these components

○ Value function: how good is each state and/or action

○

Policy: agent’s behavior function

○ Model: agent’s representation of the environment 39

(40)

Reinforcement Learning Approach

◉

Value-based RL

○ Estimate the optimal value function

◉

Policy-based RL

○ Search directly for optimal policy

◉

Model-based RL

○

Build a model of the environment

○ Plan (e.g. by lookahead) using model 40

is the policy achieving maximum future reward

is maximum value achievable under any policy

(41)

Maze Example

◉

Rewards: -1 per time-step

◉

Actions: N, E, S, W

◉

States: agent’s location

41

(42)

Maze Example: Value Function

◉

Actions: N, E, S, W

◉

42

Numbers represent value Q_π(s) of each state s

(43)

Maze Example: Value Function

◉

Actions: N, E, S, W

◉

43

Grid layout represents transition model P

Numbers represent immediate reward R from each state s (same for all a)

(44)

Maze Example: Policy

◉

Actions: N, E, S, W

◉

44

Arrows represent policy π(s) for each state s

(45)

Categorizing RL Agents

◉

Value-Based

○ No Policy (implicit)

○ Value Function

◉

Policy-Based

○ Policy

○ No Value Function

◉

Actor-Critic

○ Policy

○ Value Function 45

◉

^Model-Free

○ Policy and/or Value Function

○ No Model

◉

Model-Based

○ Policy and/or Value Function

○ Model

(46)

RL Agent Taxonomy

46

Model-Free

Model

Value Policy

Learning a Critic

Actor-Critic

Learning an Actor

(47)

Concluding Remarks

◉

RL is a general purpose framework for decision making under interactions between agent and environment

○

RL is for an agent with the capacity to act

○ Each action influences the agent’s future state

○ Success is measured by a scalar reward signal

○ Goal: select actions to maximize future reward

◉

An RL agent may include one or more of these components

○ Value function: how good is each state and/or action

○ Policy: agent’s behavior function

○

Model: agent’s representation of the environment 47

action state reward

(48)

References

◉ Course materials by David Silver:

http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Teaching.html

◉ ICLR 2015 Tutorial:

http://www.iclr.cc/lib/exe/fetch.php?media=iclr2015:silver-iclr2015.pdf

◉ ICML 2016 Tutorial:

http://icml.cc/2016/tutorials/deep_rl_tutorial.pdf

48