• 沒有找到結果。

Deep Reinforcement Learning

N/A
N/A
Protected

Academic year: 2022

Share "Deep Reinforcement Learning"

Copied!
48
0
0

加載中.... (立即查看全文)

全文

(1)

Deep Reinforcement Learning

Applied Deep Learning

May 10th, 2021 http://adl.miulab.tw

(2)

Outline

Machine Learning

Supervised Learning v.s. Reinforcement Learning

Reinforcement Learning v.s. Deep Learning

Introduction to Reinforcement Learning

Agent and Environment

Action, State, and Reward

Reinforcement Learning Approach

Value-Based

Policy-Based

Model-Based 2

(3)

Outline

Machine Learning

Supervised Learning v.s. Reinforcement Learning

Reinforcement Learning v.s. Deep Learning

Introduction to Reinforcement Learning

Agent and Environment

Action, State, and Reward

Reinforcement Learning Approach

Value-Based

Policy-Based

Model-Based 3

(4)

Machine Learning

4

Machine Learning

Unsupervised Learning Supervised

Learning

Reinforcement Learning

(5)

Supervised v.s. Reinforcement

Supervised Learning

Training based on

supervisor/label/annotation

Feedback is instantaneous

Time does not matter 5

Reinforcement Learning

Training only based on reward signal

Feedback is delayed

Time matters

Agent actions affect subsequent data

(6)

Supervised v.s. Reinforcement

Supervised

Reinforcement

6

……

Say “Hi”

Say “Good bye”

Learning from teacher

Learning from critics Hello ☺ ……

“Hello”

“Bye bye”

……. ……. OXX??

?!

Bad

(7)

Reinforcement Learning

RL is a general purpose framework for decision making

RL is for an agent with the capacity to act

Each action influences the agent’s future state

Success is measured by a scalar reward signal

Goal: select actions to maximize future reward 7

(8)

Deep Learning

DL is a general purpose framework for representation learning

Given an objective

Learn representation that is required to achieve objective

Directly from raw inputs

Use minimal domain knowledge 8

x1

x2

… …

y1

y2

… …

… …

yM

xN

vector x

vector y

(9)

Deep Reinforcement Learning

AI is an agent that can solve human-level task

RL defines the objective

DL gives the mechanism

RL + DL = general intelligence 9

(10)

Deep RL AI Examples

Play games: Atari, poker, Go, …

Explore worlds: 3D worlds, …

Control physical systems: manipulate, …

Interact with users: recommend, optimize, personalize, …

10

(11)

Reinforcement Learning

Introduction to RL

11

(12)

Outline

Machine Learning

Supervised Learning v.s. Reinforcement Learning

Reinforcement Learning v.s. Deep Learning

Introduction to Reinforcement Learning

Agent and Environment

Action, State, and Reward

Reinforcement Learning Approach

Value-Based

Policy-Based

Model-Based 12

(13)

Reinforcement Learning

RL is a general purpose framework for decision making

RL is for an agent with the capacity to act

Each action influences the agent’s future state

Success is measured by a scalar reward signal 13

Big three: action, state, reward

(14)

Agent and Environment

14

→←

MoveRight MoveLeft

observation ot action at

reward rt Agent

Environment

(15)

Agent and Environment

At time step t

The agent

Executes action at

Receives observation ot

Receives scalar reward rt

The environment

Receives action at

Emits observation ot+1

Emits scalar reward rt+1

t increments at env. step 15

observation ot

action at

reward rt

(16)

State

Experience is the sequence of observations, actions, rewards

State is the information used to determine what happens next

what happens depends on the history experience

The agent selects actions

The environment selects observations/rewards

The state is the function of the history experience

16

(17)

observation ot

action at

reward rt

Environment State

The environment state 𝑠𝑡𝑒 is the

environment’s private representation

whether data the environment uses to pick the next

observation/reward

may not be visible to the agent

may contain irrelevant information 17

(18)

observation ot

action at

reward rt

Agent State

The agent state 𝑠𝑡𝑎 is the agent’s internal representation

whether data the agent uses to pick the next action → information used by RL algorithms

can be any function of experience 18

(19)

Information State

An information state (a.k.a. Markov state) contains all useful information from history

The future is independent of the past given the present

Once the state is known, the history may be thrown away

The state is a sufficient statistics of the future 19

A state is Markov iff

(20)

Fully Observable Environment

Full observability: agent directly observes environment state

information state = agent state = environment state

20

This is a Markov decision process (MDP)

(21)

Partially Observable Environment

Partial observability: agent indirectly observes environment

agent state ≠ environment state

Agent must construct its own state representation 𝑠𝑡𝑎

Complete history:

Beliefs of environment state:

Hidden state (from RNN):

21

This is partially observable Markov decision process (POMDP)

(22)

Reward

Reinforcement learning is based on reward hypothesis

A reward rt is a scalar feedback signal

Indicates how well agent is doing at step t 22

Reward hypothesis:

all agent goals can be desired by maximizing expected cumulative reward

(23)

Sequential Decision Making

Goal: select actions to maximize total future reward

Actions may have long-term consequences

Reward may be delayed

It may be better to sacrifice immediate reward to gain more long-term reward 23

(24)

Scenario of Reinforcement Learning

24

Agent

Environment

Observation Action

Reward Don’t do

that

State

Change the environment

(25)

Scenario of Reinforcement Learning

25

Agent Observation

Reward Thank you.

State

Action Change the environment

Environment

Agent learns to take actions maximizing expected reward.

(26)

Machine Learning ≈ Looking for a Function

26

Observation Action

Reward Function

input

Used to pick the best function

Function output Actor/Policy

Action = π(Observation)

Environment

(27)

Learning to Play Go

27

Observation Action

Reward

Next Move

Environment

(28)

Learning to Play Go

28

Observation Action

Reward

Agent learns to take actions maximizing expected reward.

Environment If win, reward = 1

If loss, reward = -1 reward = 0 in most cases

(29)

Learning to Play Go

Supervised

Reinforcement Learning

29

Next move:

“5-5”

Next move:

“3-3”

First move

…… many moves …… Win!

AlphaGo uses supervised learning + reinforcement learning.

Learning from teacher

Learning from experience

(Two agents play with each other.)

(30)

Learning a Chatbot

Machine obtains feedback from user

30

How are you?

Bye bye ☺

Hello

Hi ☺

-10 3

Chatbot learns to maximize the expected reward

(31)

Learning a Chatbot

Let two agents talk to each other (sometimes generate good dialogue, sometimes bad)

31

How old are you?

See you.

See you.

See you.

How old are you?

I am 16.

I though you were 12.

What make you think so?

(32)

Learning a chat-bot

By this approach, we can generate a lot of dialogues.

Use pre-defined rules to evaluate the goodness of a dialogue

32

Dialogue 1

Dialogue 2

Dialogue 3

Dialogue 4

Dialogue 5

Dialogue 6

Dialogue 7

Dialogue 8

Machine learns from the evaluation as rewards

(33)

Learning to Play Video Game

Space invader: terminate when all aliens are killed, or your spaceship is destroyed

33

fire Score

(reward) Kill the aliens

shield

Play yourself: http://www.2600online.com/spaceinvaders.html

How about machine: https://gym.openai.com/evaluations/eval_Eduozx4HRyqgTCVk9ltw

(34)

Learning to Play Video Game

34

Start with observation 𝑠1

Observation 𝑠2 Observation 𝑠3

Action 𝑎1: “right”

Obtain reward 𝑟1 = 0

Action 𝑎2: “fire”

(kill an alien) Obtain reward 𝑟2 = 5

Usually there is some randomness in the environment

(35)

Learning to Play Video Game

35

Start with observation 𝑠1

Observation 𝑠2 Observation 𝑠3

After many turns

Action 𝑎𝑇 Obtain reward 𝑟𝑇

Game Over

(spaceship destroyed)

This is an episode.

Learn to maximize the expected cumulative

reward per episode

(36)

More Applications

Flying Helicopter

https://www.youtube.com/watch?v=0JL04JJjocc

Driving

https://www.youtube.com/watch?v=0xo1Ldx3L5Q

Robot

https://www.youtube.com/watch?v=370cT-OAzzM

Google Cuts Its Giant Electricity Bill With DeepMind-Powered AI

http://www.bloomberg.com/news/articles/2016-07-19/google-cuts-its-giant-electricity-bill-with- deepmind-powered-ai

Text Generation

https://www.youtube.com/watch?v=pbQ4qe8EwLo

36

(37)

Reinforcement Learning

37

(38)

Outline

Machine Learning

Supervised Learning v.s. Reinforcement Learning

Reinforcement Learning v.s. Deep Learning

Introduction to Reinforcement Learning

Agent and Environment

Action, State, and Reward

Reinforcement Learning

Value-Based

Policy-Based

Model-Based 38

(39)

Major Components in an RL Agent

An RL agent may include one or more of these components

Value function: how good is each state and/or action

Policy: agent’s behavior function

Model: agent’s representation of the environment 39

(40)

Reinforcement Learning Approach

Value-based RL

Estimate the optimal value function

Policy-based RL

Search directly for optimal policy

Model-based RL

Build a model of the environment

Plan (e.g. by lookahead) using model 40

is the policy achieving maximum future reward

is maximum value achievable under any policy

(41)

Maze Example

Rewards: -1 per time-step

Actions: N, E, S, W

States: agent’s location

41

(42)

Maze Example: Value Function

Rewards: -1 per time-step

Actions: N, E, S, W

States: agent’s location

42

Numbers represent value Qπ(s) of each state s

(43)

Maze Example: Value Function

Rewards: -1 per time-step

Actions: N, E, S, W

States: agent’s location

43

Grid layout represents transition model P

Numbers represent immediate reward R from each state s (same for all a)

(44)

Maze Example: Policy

Rewards: -1 per time-step

Actions: N, E, S, W

States: agent’s location

44

Arrows represent policy π(s) for each state s

(45)

Categorizing RL Agents

Value-Based

No Policy (implicit)

Value Function

Policy-Based

Policy

No Value Function

Actor-Critic

Policy

Value Function 45

Model-Free

Policy and/or Value Function

No Model

Model-Based

Policy and/or Value Function

Model

(46)

RL Agent Taxonomy

46

Model-Free

Model

Value Policy

Learning a Critic

Actor-Critic

Learning an Actor

(47)

Concluding Remarks

RL is a general purpose framework for decision making under interactions between agent and environment

RL is for an agent with the capacity to act

Each action influences the agent’s future state

Success is measured by a scalar reward signal

Goal: select actions to maximize future reward

An RL agent may include one or more of these components

Value function: how good is each state and/or action

Policy: agent’s behavior function

Model: agent’s representation of the environment 47

action state reward

(48)

References

◉ Course materials by David Silver:

http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Teaching.html

◉ ICLR 2015 Tutorial:

http://www.iclr.cc/lib/exe/fetch.php?media=iclr2015:silver-iclr2015.pdf

◉ ICML 2016 Tutorial:

http://icml.cc/2016/tutorials/deep_rl_tutorial.pdf

48

參考文獻

相關文件

S15 Expectation value of the total spin-squared operator h ˆ S 2 i for the ground state of cationic n-PP as a function of the chain length, calculated using KS-DFT with various

• An algorithm is any well-defined computational procedure that takes some value, or set of values, as input and produces some value, or set of values, as output.. • An algorithm is

• However, inv(A) may return a weird result even if A is ill-conditioned, indicates how much the output value of the function can change for a small change in the

Schools will be requested to report their use of the OITG through the ITE4 annual surveys to review the effectiveness of

◦ Action, State, and Reward Markov Decision Process Reinforcement Learning.

Reinforcement learning is based on reward hypothesis A reward r t is a scalar feedback signal. ◦ Indicates how well agent is doing at

Agent learns to take actions maximizing expected reward.. Machine Learning ≈ Looking for

 Goal: select actions to maximize future reward Big three: action, state, reward.. Scenario of Reinforcement Learning.. Agent learns to take actions to maximize expected