Slide credit from David Silver

(1)

(2)

Outline

Machine Learning

◦ Supervised Learning v.s. Reinforcement Learning

◦ Reinforcement Learning v.s. Deep Learning

Introduction to Reinforcement Learning

◦ Agent and Environment

◦ Action, State, and Reward

Markov Decision Process

Reinforcement Learning Approach

◦Policy-Based

◦Value-Based

◦Model-Based

Problems within RL

(3)

Outline

Machine Learning

◦Policy-Based

◦Value-Based

◦Model-Based

Problems within RL

◦Learning and Planning

◦Exploration and Exploitation

(4)

Machine Learning

(5)

Supervised v.s. Reinforcement

Supervised Learning

◦Training based on

supervisor/label/annotation

◦Feedback is instantaneous

◦Time does not matter

Reinforcement Learning

◦Training only based on reward signal

◦Feedback is delayed

◦Time matters

◦Agent actions affect subsequent data

(6)

Reinforcement Learning

RL is a general purpose framework for decision making

◦RL is for an agent with the capacity to act

◦Each action influences the agent’s future state

◦Success is measured by a scalar reward signal

◦Goal: select actions to maximize future reward

(7)

Deep Learning

DL is a general purpose framework for representation learning

◦Given an objective

◦Learn representation that is required to achieve objective

◦Directly from raw inputs

◦Use minimal domain knowledge

x1

x2

… …

y1

y2

… …

…

… …

…

^y^M

xN

vector x

vector y

(8)

Deep Reinforcement Learning

AI is an agent that can solve human-level task

◦RL defines the objective

◦DL gives the mechanism

◦RL + DL = general intelligence

(9)

Deep RL AI Examples

Play games: Atari, poker, Go, … Explore worlds: 3D worlds, …

Control physical systems: manipulate, …

Interact with users: recommend, optimize, personalize, …

(10)

Introduction to RL

Reinforcement Learning

(11)

Outline

Machine Learning

◦Policy-Based

◦Value-Based

◦Model-Based

Problems within RL

◦Learning and Planning

◦Exploration and Exploitation

(12)

Reinforcement Learning

RL is a general purpose framework for decision making

◦RL is for an agent with the capacity to act

◦Each action influences the agent’s future state

Big three: action, state, reward

(13)

Agent and Environment

→←

MoveRight MoveLeft

observation o_t action a_t

reward r_t Agent

Environment

(14)

Agent and Environment

At time step t

◦The agent

◦ Executes action a_t

◦ Receives observation o_t

◦ Receives scalar reward r_t

◦The environment

◦ Receives action a_t

◦ Emits observation o_t+1

◦ Emits scalar reward r_t+1

◦t increments at env. step

observation o_t

action a_t

reward r_t

(15)

State

Experience is the sequence of observations, actions, rewards

State is the information used to determine what happens next

◦what happens depends on the history experience

• The agent selects actions

• The environment selects observations/rewards

The state is the function of the history experience

(16)

observation o_t

action a_t

reward r_t

Environment State

The environment state 𝑠_𝑡^𝑒 is the

environment’s private representation

◦whether data the environment uses to pick the next observation/reward

◦may not be visible to the agent

◦may contain irrelevant information

(17)

observation o_t

action a_t

reward r_t

Agent State

The agent state 𝑠_𝑡^𝑎 is the agent’s internal representation

◦whether data the agent uses to pick the next action  information used by RL algorithms

◦can be any function of experience

(18)

Information State

An information state (a.k.a. Markov state) contains all useful information from history

The future is independent of the past given the present

◦Once the state is known, the history may be thrown away

A state is Markov iff

(19)

Fully Observable Environment

Full observability: agent directly observes environment state

information state = agent state = environment state This is a Markov decision process (MDP)

(20)

Partially Observable Environment

Partial observability: agent indirectly observes environment

agent state ≠ environment state

Agent must construct its own state representation 𝑠_𝑡^𝑎

◦Complete history:

◦Beliefs of environment state:

◦Hidden state (from RNN):

This is partially observable Markov decision process (POMDP)

(21)

Reward

Reinforcement learning is based on reward hypothesis A reward r_t is a scalar feedback signal

◦Indicates how well agent is doing at step t

Reward hypothesis: all agent goals can be desired by maximizing expected cumulative reward

(22)

Sequential Decision Making

Goal: select actions to maximize total future reward

◦Actions may have long-term consequences

◦Reward may be delayed

◦It may be better to sacrifice immediate reward to gain more long-term reward

(23)

Markov Decision Process

Fully Observable Environment

(24)

Outline

Machine Learning

◦Policy-Based

◦Value-Based

◦Model-Based

Problems within RL

(25)

Markov Process

Markov process is a memoryless random process

◦i.e. a sequence of random states S₁, S₂, ... with the Markov property

Student Markov chain

Sample episodes from S₁=C1

• C1 C2 C3 Pass Sleep

• C1 FB FB C1 C2 Sleep

• C1 C2 C3 Pub C2 C3 Pass Sleep

• C1 FB FB C1 C2 C3 Pub

• C1 FB FB FB C1 C2 C3 Pub C2 Sleep

(26)

Student MRP

Markov Reward Process (MRP)

Markov reward process is a Markov chain with values

◦The return G_t is the total discounted reward from time-step t

(27)

Markov decision process is a MRP with decisions

◦It is an environment in which all states are Markov

Markov Decision Process (MDP)

Student MDP

(28)

Markov Decision Process (MDP)

S : finite set of states/observations A : finite set of actions

P : transition probability R : immediate reward γ : discount factor

Goal is to choose policy π at time t that maximizes expected overall return:

(29)

Reinforcement Learning

(30)

Outline

Machine Learning

Markov Decision Process Reinforcement Learning

◦Policy-Based

◦Value-Based

◦Model-Based

Problems within RL

(31)

Major Components in an RL Agent

An RL agent may include one or more of these components

◦Policy: agent’s behavior function

◦Value function: how good is each state and/or action

◦Model: agent’s representation of the environment

(32)

Policy

A policy is the agent’s behavior A policy maps from state to action

◦Deterministic policy:

◦Stochastic policy:

(33)

Value Function

A value function is a prediction of future reward (with action a in state s)

Q-value function gives expected total reward

◦from state and action

◦under policy

◦with discount factor

Value functions decompose into a Bellman equation

(34)

Optimal Value Function

An optimal value function is the maximum achievable value

The optimal value function allows us act optimally

The optimal value informally maximizes over all decisions

(35)

Model

observation o_t

action a_t

reward r_t

A model predicts what the environment will do next

◦P predicts the next state

◦R predicts the next immediate reward

(36)

Reinforcement Learning Approach

Policy-based RL

◦Search directly for optimal policy

Value-based RL

◦Estimate the optimal value function

Model-based RL

◦Build a model of the environment

◦Plan (e.g. by lookahead) using model

is the policy achieving maximum future reward

is maximum value achievable under any policy

(37)

Maze Example

Rewards: -1 per time-step Actions: N, E, S, W

States: agent’s location

(38)

Maze Example: Policy

(39)

Maze Example: Value Function

Numbers represent value Q_π(s) of each state s

(40)

Maze Example: Value Function

(41)

Categorizing RL Agents

Value-Based

◦No Policy (implicit)

◦Value Function

Policy-Based

◦Policy

◦No Value Function

Actor-Critic

◦Policy

◦Value Function

Model-Free

◦Policy and/or Value Function

◦No Model

Model-Based

◦Policy and/or Value Function

◦Model

(42)

RL Agent Taxonomy

(43)

Problems within RL

(44)

Outline

Machine Learning

Markov Decision Process Reinforcement Learning

◦Policy-Based

◦Value-Based

◦Model-Based

Problems within RL

(45)

Learning and Planning

In sequential decision making

◦Reinforcement learning

• The environment is initially unknown

• The agent interacts with the environment

• The agent improves its policy

◦Planning

• A model of the environment is known

• The agent performs computations with its model (w/o any external interaction)

• The agent improves its policy (a.k.a. deliberation, reasoning, introspection, pondering, thought, search)

(46)

Atari Example: Reinforcement Learning

Rules of the game are unknown

Learn directly from interactive game-play Pick actions on joystick, see pixels and scores

(47)

Atari Example: Planning

Rules of the game are known

Query emulator based on the perfect model inside agent’s brain

◦ If I take action a from state s:

• what would the next state be?

• what would the score be?

Plan ahead to find optimal policy e.g. tree search

(48)

Exploration and Exploitation

Reinforcement learning is like trial-and-error learning The agent should discover a good policy from the

experience without losing too much reward along the way

Exploration finds more information about the environment Exploitation exploits known information to maximize reward

When to try?

It is usually important to explore as well as exploit

(49)

Concluding Remarks

RL is a general purpose framework for decision making under interactions between agent and environment

◦RL is for an agent with the capacity to act

◦Each action influences the agent’s future state

◦Goal: select actions to maximize future reward

An RL agent may include one or more of these components

◦Policy: agent’s behavior function

◦Value function: how good is each state and/or action

◦Model: agent’s representation of the environment

action state reward

(50)

References

Course materials by David Silver: http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Teaching.html ICLR 2015 Tutorial: http://www.iclr.cc/lib/exe/fetch.php?media=iclr2015:silver-iclr2015.pdf ICML 2016 Tutorial: http://icml.cc/2016/tutorials/deep_rl_tutorial.pdf