• 沒有找到結果。

Slide credit from David Silver


Academic year: 2022

Share "Slide credit from David Silver"


加載中.... (立即查看全文)




Machine Learning

Supervised Learning v.s. Reinforcement Learning

Reinforcement Learning v.s. Deep Learning

Introduction to Reinforcement Learning

Agent and Environment

Action, State, and Reward

Markov Decision Process

Reinforcement Learning Approach




Problems within RL



Machine Learning

Supervised Learning v.s. Reinforcement Learning

Reinforcement Learning v.s. Deep Learning

Introduction to Reinforcement Learning

Agent and Environment

Action, State, and Reward

Markov Decision Process

Reinforcement Learning Approach




Problems within RL

Learning and Planning

Exploration and Exploitation


Machine Learning


Supervised v.s. Reinforcement

Supervised Learning

Training based on


Feedback is instantaneous

Time does not matter

Reinforcement Learning

Training only based on reward signal

Feedback is delayed

Time matters

Agent actions affect subsequent data


Reinforcement Learning

RL is a general purpose framework for decision making

RL is for an agent with the capacity to act

Each action influences the agent’s future state

Success is measured by a scalar reward signal

Goal: select actions to maximize future reward


Deep Learning

DL is a general purpose framework for representation learning

Given an objective

Learn representation that is required to achieve objective

Directly from raw inputs

Use minimal domain knowledge



… …



… …

… …



vector x

vector y


Deep Reinforcement Learning

AI is an agent that can solve human-level task

RL defines the objective

DL gives the mechanism

RL + DL = general intelligence


Deep RL AI Examples

Play games: Atari, poker, Go, … Explore worlds: 3D worlds, …

Control physical systems: manipulate, …

Interact with users: recommend, optimize, personalize, …


Introduction to RL

Reinforcement Learning



Machine Learning

Supervised Learning v.s. Reinforcement Learning

Reinforcement Learning v.s. Deep Learning

Introduction to Reinforcement Learning

Agent and Environment

Action, State, and Reward

Markov Decision Process

Reinforcement Learning Approach




Problems within RL

Learning and Planning

Exploration and Exploitation


Reinforcement Learning

RL is a general purpose framework for decision making

RL is for an agent with the capacity to act

Each action influences the agent’s future state

Success is measured by a scalar reward signal

Big three: action, state, reward


Agent and Environment


MoveRight MoveLeft

observation ot action at

reward rt Agent



Agent and Environment

At time step t

The agent

Executes action at

Receives observation ot

Receives scalar reward rt

The environment

Receives action at

Emits observation ot+1

Emits scalar reward rt+1

t increments at env. step

observation ot

action at

reward rt



Experience is the sequence of observations, actions, rewards

State is the information used to determine what happens next

what happens depends on the history experience

The agent selects actions

The environment selects observations/rewards

The state is the function of the history experience


observation ot

action at

reward rt

Environment State

The environment state 𝑠𝑡𝑒 is the

environment’s private representation

whether data the environment uses to pick the next observation/reward

may not be visible to the agent

may contain irrelevant information


observation ot

action at

reward rt

Agent State

The agent state 𝑠𝑡𝑎 is the agent’s internal representation

whether data the agent uses to pick the next action  information used by RL algorithms

can be any function of experience


Information State

An information state (a.k.a. Markov state) contains all useful information from history

The future is independent of the past given the present

Once the state is known, the history may be thrown away

A state is Markov iff


Fully Observable Environment

Full observability: agent directly observes environment state

information state = agent state = environment state This is a Markov decision process (MDP)


Partially Observable Environment

Partial observability: agent indirectly observes environment

agent state ≠ environment state

Agent must construct its own state representation 𝑠𝑡𝑎

Complete history:

Beliefs of environment state:

Hidden state (from RNN):

This is partially observable Markov decision process (POMDP)



Reinforcement learning is based on reward hypothesis A reward rt is a scalar feedback signal

Indicates how well agent is doing at step t

Reward hypothesis: all agent goals can be desired by maximizing expected cumulative reward


Sequential Decision Making

Goal: select actions to maximize total future reward

Actions may have long-term consequences

Reward may be delayed

It may be better to sacrifice immediate reward to gain more long-term reward


Markov Decision Process

Fully Observable Environment



Machine Learning

Supervised Learning v.s. Reinforcement Learning

Reinforcement Learning v.s. Deep Learning

Introduction to Reinforcement Learning

Agent and Environment

Action, State, and Reward

Markov Decision Process

Reinforcement Learning Approach




Problems within RL


Markov Process

Markov process is a memoryless random process

i.e. a sequence of random states S1, S2, ... with the Markov property

Student Markov chain

Sample episodes from S1=C1

• C1 C2 C3 Pass Sleep

• C1 FB FB C1 C2 Sleep

• C1 C2 C3 Pub C2 C3 Pass Sleep

• C1 FB FB C1 C2 C3 Pub

• C1 FB FB FB C1 C2 C3 Pub C2 Sleep


Student MRP

Markov Reward Process (MRP)

Markov reward process is a Markov chain with values

The return Gt is the total discounted reward from time-step t


Markov decision process is a MRP with decisions

It is an environment in which all states are Markov

Markov Decision Process (MDP)

Student MDP


Markov Decision Process (MDP)

S : finite set of states/observations A : finite set of actions

P : transition probability R : immediate reward γ : discount factor

Goal is to choose policy π at time t that maximizes expected overall return:


Reinforcement Learning



Machine Learning

Supervised Learning v.s. Reinforcement Learning

Reinforcement Learning v.s. Deep Learning

Introduction to Reinforcement Learning

Agent and Environment

Action, State, and Reward

Markov Decision Process Reinforcement Learning




Problems within RL


Major Components in an RL Agent

An RL agent may include one or more of these components

Policy: agent’s behavior function

Value function: how good is each state and/or action

Model: agent’s representation of the environment



A policy is the agent’s behavior A policy maps from state to action

Deterministic policy:

Stochastic policy:


Value Function

A value function is a prediction of future reward (with action a in state s)

Q-value function gives expected total reward

from state and action

under policy

with discount factor

Value functions decompose into a Bellman equation


Optimal Value Function

An optimal value function is the maximum achievable value

The optimal value function allows us act optimally

The optimal value informally maximizes over all decisions



observation ot

action at

reward rt

A model predicts what the environment will do next

P predicts the next state

R predicts the next immediate reward


Reinforcement Learning Approach

Policy-based RL

Search directly for optimal policy

Value-based RL

Estimate the optimal value function

Model-based RL

Build a model of the environment

Plan (e.g. by lookahead) using model

is the policy achieving maximum future reward

is maximum value achievable under any policy


Maze Example

Rewards: -1 per time-step Actions: N, E, S, W

States: agent’s location


Maze Example: Policy

Rewards: -1 per time-step Actions: N, E, S, W

States: agent’s location


Maze Example: Value Function

Rewards: -1 per time-step Actions: N, E, S, W

States: agent’s location

Numbers represent value Qπ(s) of each state s


Maze Example: Value Function

Rewards: -1 per time-step Actions: N, E, S, W

States: agent’s location


Categorizing RL Agents


No Policy (implicit)

Value Function



No Value Function



Value Function


Policy and/or Value Function

No Model


Policy and/or Value Function



RL Agent Taxonomy


Problems within RL



Machine Learning

Supervised Learning v.s. Reinforcement Learning

Reinforcement Learning v.s. Deep Learning

Introduction to Reinforcement Learning

Agent and Environment

Action, State, and Reward

Markov Decision Process Reinforcement Learning




Problems within RL


Learning and Planning

In sequential decision making

Reinforcement learning

The environment is initially unknown

The agent interacts with the environment

The agent improves its policy


A model of the environment is known

The agent performs computations with its model (w/o any external interaction)

The agent improves its policy (a.k.a. deliberation, reasoning, introspection, pondering, thought, search)


Atari Example: Reinforcement Learning

Rules of the game are unknown

Learn directly from interactive game-play Pick actions on joystick, see pixels and scores


Atari Example: Planning

Rules of the game are known

Query emulator based on the perfect model inside agent’s brain

If I take action a from state s:

what would the next state be?

what would the score be?

Plan ahead to find optimal policy e.g. tree search


Exploration and Exploitation

Reinforcement learning is like trial-and-error learning The agent should discover a good policy from the

experience without losing too much reward along the way

Exploration finds more information about the environment Exploitation exploits known information to maximize reward

When to try?

It is usually important to explore as well as exploit


Concluding Remarks

RL is a general purpose framework for decision making under interactions between agent and environment

RL is for an agent with the capacity to act

Each action influences the agent’s future state

Success is measured by a scalar reward signal

Goal: select actions to maximize future reward

An RL agent may include one or more of these components

Policy: agent’s behavior function

Value function: how good is each state and/or action

Model: agent’s representation of the environment

action state reward



Course materials by David Silver: http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Teaching.html ICLR 2015 Tutorial: http://www.iclr.cc/lib/exe/fetch.php?media=iclr2015:silver-iclr2015.pdf ICML 2016 Tutorial: http://icml.cc/2016/tutorials/deep_rl_tutorial.pdf



why he/she is doing it before even starting work Unwittingly working on a previously.

Bootstrapping is a general approach to statistical in- ference based on building a sampling distribution for a statistic by resampling from the data at hand.. • The

“Asymptomatic persons who had visited Fook Wai Buddhist Temple on or after 25 January are advised to call our hotline 21251122 for quarantine or medical surveillance


• However, inv(A) may return a weird result even if A is ill-conditioned, indicates how much the output value of the function can change for a small change in the

– The futures price at time 0 is (p. 275), the expected value of S at time ∆t in a risk-neutral economy is..

request even if the header is absent), O (optional), T (the header should be included in the request if a stream-based transport is used), C (the presence of the header depends on

◦ Action, State, and Reward Markov Decision Process Reinforcement Learning.