### Outline

Machine Learning

◦ Supervised Learning v.s. Reinforcement Learning

◦ Reinforcement Learning v.s. Deep Learning Introduction to Reinforcement Learning

◦ Agent and Environment

◦ Action, State, and Reward Markov Decision Process

Reinforcement Learning Approach

◦ Value-Based

◦ Policy-Based

◦ Model-Based

Problems within RL

◦ Learning and Planning

◦ Exploration and Exploitation

RL for Unsupervised Model

2

### Outline

Machine Learning

◦ Supervised Learning v.s. Reinforcement Learning

◦ Reinforcement Learning v.s. Deep Learning Introduction to Reinforcement Learning

◦ Agent and Environment

◦ Action, State, and Reward Markov Decision Process

Reinforcement Learning Approach

◦ Value-Based

◦ Policy-Based

◦ Model-Based

Problems within RL

◦ Learning and Planning

◦ Exploration and Exploitation

### Machine Learning

4

**Machine **
**Learning**

Unsupervised Learning Supervised

Learning

Reinforcement Learning

### Supervised v.s. Reinforcement

Supervised Learning

◦Training based on

supervisor/label/annotation

◦Feedback is instantaneous

◦Time does not matter

Reinforcement Learning

◦Training only based on reward signal

◦Feedback is delayed

◦Time matters

◦Agent actions affect subsequent data

### Supervised v.s. Reinforcement

Supervised

Reinforcement

6

**……**

Say “Hi”

Say “Good bye”

Learning from teacher

Learning from critics

Hello ^{……}

“Hello”

“Bye bye”

……. ^{…….} ^{OXX???!}

### Bad

### Reinforcement Learning

**RL is a general purpose framework for decision making**

◦RL is for an agent with the capacity to act

◦Each action influences the agent’s future state

◦Success is measured by a scalar reward signal

◦Goal: select actions to maximize future reward

### Deep Learning

**DL is a general purpose framework for representation learning**

◦Given an objective

◦Learn representation that is required to achieve objective

◦Directly from raw inputs

◦Use minimal domain knowledge

8

*x*1

*x*2

### … …

*y*1

*y*2

### … …

### …

### … …

### …

### …

### …

*y*M

*x*N

vector
**x**

vector
**y**

### Deep Reinforcement Learning

AI is an agent that can solve human-level task

◦RL defines the objective

◦DL gives the mechanism

◦RL + DL = general intelligence

### Deep RL AI Examples

Play games: Atari, poker, Go, … Explore worlds: 3D worlds, …

Control physical systems: manipulate, …

Interact with users: recommend, optimize, personalize, …

10

## Introduction to RL

Reinforcement Learning

### Outline

Machine Learning

◦ Supervised Learning v.s. Reinforcement Learning

◦ Reinforcement Learning v.s. Deep Learning Introduction to Reinforcement Learning

◦ Agent and Environment

◦ Action, State, and Reward Markov Decision Process

Reinforcement Learning Approach

◦ Value-Based

◦ Policy-Based

◦ Model-Based

Problems within RL

◦ Learning and Planning

◦ Exploration and Exploitation

RL for Unsupervised Model

12

### Reinforcement Learning

**RL is a general purpose framework for decision making**

◦RL is for an agent with the capacity to act

◦Each action influences the agent’s future state

◦Success is measured by a scalar reward signal

Big three: action, state, reward

### Agent and Environment

14

→←

MoveRight MoveLeft

**observation o**_{t}**action a**_{t}

**reward r***_{t}*
Agent

Environment

### Agent and Environment

*At time step t*

◦The agent

◦ *Executes action a*_{t}

◦ *Receives observation o*_{t}

◦ *Receives scalar reward r*_{t}

◦The environment

◦ *Receives action a*_{t}

◦ *Emits observation o*_{t+1}

◦ *Emits scalar reward r*_{t+1}

◦*t increments at env. step*

**observation**
**o**_{t}

**action**
**a**_{t}

**reward**
**r**_{t}

### State

Experience is the sequence of observations, actions, rewards

State is the information used to determine what happens next

◦what happens depends on the history experience

• The agent selects actions

• The environment selects observations/rewards

The state is the function of the history experience

16

**observation**
**o**_{t}

**action**
**a**_{t}

**reward**
**r**_{t}

### Environment State

The environment state 𝑠_{𝑡}^{𝑒} is the

*environment’s private representation*

◦whether data the environment uses to pick the next observation/reward

◦may not be visible to the agent

◦may contain irrelevant information

**observation**
**o**_{t}

**action**
**a**_{t}

**reward**
**r**_{t}

### Agent State

The agent state 𝑠_{𝑡}^{𝑎} is the agent’s
*internal representation*

◦whether data the agent uses to pick the next action information used by RL algorithms

◦can be any function of experience

18

### Information State

An information state (a.k.a. Markov state) contains all useful information from history

The future is independent of the past given the present

◦Once the state is known, the history may be thrown away

◦The state is a sufficient statistics of the future

A state is Markov iff

### Fully Observable Environment

*Full observability: agent directly observes environment state*

information state = agent state = environment state

20

This is a Markov decision process (MDP)

### Partially Observable Environment

*Partial observability: agent indirectly observes environment *

agent state ≠ environment state

Agent must construct its own state representation 𝑠_{𝑡}^{𝑎}

◦Complete history:

◦Beliefs of environment state:

◦Hidden state (from RNN):

This is partially observable Markov decision process (POMDP)

### Reward

Reinforcement learning is based on reward hypothesis
*A reward r** _{t}* is a scalar feedback signal

◦*Indicates how well agent is doing at step t*

22

Reward hypothesis: all agent goals can be desired by maximizing expected cumulative reward

### Sequential Decision Making

Goal: select actions to maximize total future reward

◦Actions may have long-term consequences

◦Reward may be delayed

◦It may be better to sacrifice immediate reward to gain more long-term reward

### Scenario of Reinforcement Learning

Agent

Environment

Observation Action

Reward Don’t do

that

### State

Change theenvironment

### Scenario of Reinforcement Learning

Agent Observation

Reward Thank you.

### State

Action Change the environment

### Machine Learning ≈ Looking for a Function

Observation Action

Reward Function

input

Used to pick the best function

Function output Actor/Policy

Action = π(Observation)

Environment

### Learning to Play Go

Observation Action

Reward

Next Move

### Learning to Play Go

Observation Action

Reward

Agent learns to take actions maximizing expected reward.

Environment If win, reward = 1

If loss, reward = -1 reward = 0 in most cases

### Learning to Play Go

Supervised

Reinforcement Learning

Next move:

“5-5”

Next move:

“3-3”

First move …… many moves …… Win!

Learning from teacher

Learning from experience

(Two agents play with each other.)

### Learning a Chatbot

Machine obtains feedback from user

How are you?

Bye bye

Hello

Hi

-10 3

Chatbot learns to maximize the expected reward

### Learning a Chatbot

Let two agents talk to each other (sometimes generate good dialogue, sometimes bad)

How old are you?

See you.

See you.

How old are you?

I am 16.

I though you were 12.

What make you think so?

### Learning a chat-bot

By this approach, we can generate a lot of dialogues.

Use pre-defined rules to evaluate the goodness of a dialogue Dialogue 1 Dialogue 2 Dialogue 3 Dialogue 4

Dialogue 5 Dialogue 6 Dialogue 7 Dialogue 8

Machine learns from the evaluation as rewards

### Learning to Play Video Game

Space invader: terminate when all aliens are killed, or your spaceship is destroyed

fire Score

(reward) Kill the

aliens

shield

### Learning to Play Video Game

34

Start with observation 𝑠_{1} Observation 𝑠_{2} Observation 𝑠_{3}

Action 𝑎_{1}: “right”

Obtain reward
𝑟_{1} = 0

Action 𝑎_{2}: “fire”

(kill an alien) Obtain reward

𝑟_{2} = 5

Usually there is some randomness in the environment

### Learning to Play Video Game

Start with observation 𝑠_{1} Observation 𝑠_{2} Observation 𝑠_{3}

### After many turns

_{Game Over}

(spaceship destroyed)

**This is an episode.**

Learn to maximize the expected cumulative

reward per episode

### More applications

Flying Helicopter

◦ https://www.youtube.com/watch?v=0JL04JJjocc

Driving

◦ https://www.youtube.com/watch?v=0xo1Ldx3L5Q

Robot

◦ https://www.youtube.com/watch?v=370cT-OAzzM

Google Cuts Its Giant Electricity Bill With DeepMind-Powered AI

◦ http://www.bloomberg.com/news/articles/2016-07-19/google-cuts-its-giant- electricity-bill-with-deepmind-powered-ai

Text Generation

◦ https://www.youtube.com/watch?v=pbQ4qe8EwLo

### Markov Decision Process

Fully Observable Environment

### Outline

Machine Learning

◦ Supervised Learning v.s. Reinforcement Learning

◦ Reinforcement Learning v.s. Deep Learning Introduction to Reinforcement Learning

◦ Agent and Environment

◦ Action, State, and Reward Markov Decision Process

Reinforcement Learning Approach

◦ Value-Based

◦ Policy-Based

◦ Model-Based

Problems within RL

◦ Learning and Planning

◦ Exploration and Exploitation

RL for Unsupervised Model

38

### Markov Process

Markov process is a memoryless random process

◦i.e. a sequence of random states S_{1}, S_{2}, ... with the Markov property

Student Markov chain

Sample episodes from S_{1}=C1

• C1 C2 C3 Pass Sleep

• C1 FB FB C1 C2 Sleep

• C1 C2 C3 Pub C2 C3 Pass Sleep

• C1 FB FB C1 C2 C3 Pub

• C1 FB FB FB C1 C2 C3 Pub C2 Sleep

Student MRP

### Markov Reward Process (MRP)

40

Markov reward process is a Markov chain with values

◦*The return G*_{t}*is the total discounted reward from time-step t*

Markov decision process is a MRP with decisions

◦It is an environment in which all states are Markov

### Markov Decision Process (MDP)

Student MDP

### Markov Decision Process (MDP)

S : finite set of states/observations A : finite set of actions

P : transition probability R : immediate reward γ : discount factor

Goal is to choose policy *π at time t that maximizes expected *
overall return:

42

### Reinforcement Learning

### Outline

Machine Learning

◦ Supervised Learning v.s. Reinforcement Learning

◦ Reinforcement Learning v.s. Deep Learning Introduction to Reinforcement Learning

◦ Agent and Environment

◦ Action, State, and Reward Markov Decision Process Reinforcement Learning

◦ Value-Based

◦ Policy-Based

◦ Model-Based

Problems within RL

◦ Learning and Planning

◦ Exploration and Exploitation RL for Unsupervised Model

44

### Major Components in an RL Agent

An RL agent may include one or more of these components

◦**Value function: how good is each state and/or action**

◦**Policy: agent’s behavior function**

◦**Model: agent’s representation of the environment**

### Value Function

A value function is a prediction of future reward
*(with action a in state s)*

Q-value function gives expected total reward

◦from state and action

◦under policy

◦with discount factor

Value functions decompose into a Bellman equation

46

### Optimal Value Function

An optimal value function is the maximum achievable value

The optimal value function allows us act optimally

The optimal value informally maximizes over all decisions

Optimal values decompose into a Bellman equation

### Value Function Approximation

*Value functions are represented by a lookup table*

◦too many states and/or actions to store

◦too slow to learn the value of each entry individually

*Values can be estimated with function approximation*

48

### Q-Networks

Q-networks represent value functions with weights

◦generalize from seen states to unseen states

◦update parameter for function approximation

### Q-Learning

Goal: estimate optimal Q-values

◦Optimal Q-values obey a Bellman equation

◦*Value iteration* algorithms solve the Bellman equation

50

learning target

### Policy

A policy is the agent’s behavior A policy maps from state to action

◦Deterministic policy:

◦Stochastic policy:

### Policy Networks

Represent policy by a network with weights

Objective is to maximize total discounted reward by SGD

52

stochastic policy deterministic policy

### Policy Gradient

The gradient of a stochastic policy is given by

The gradient of a deterministic policy is given by

### Model

54

**observation**
**o**_{t}

**action**
**a**_{t}

**reward**
**r**_{t}

A model predicts what the environment will do next

◦*P predicts the next state*

◦*R predicts the next immediate reward*

### Reinforcement Learning Approach

Value-based RL

◦Estimate the optimal value function

Policy-based RL

◦Search directly for optimal policy

Model-based RL

◦Build a model of the environment

◦Plan (e.g. by lookahead) using model

is the policy achieving maximum future reward

is maximum value achievable under any policy

### Maze Example

Rewards: -1 per time-step Actions: N, E, S, W

States: agent’s location

56

### Maze Example: Value Function

Rewards: -1 per time-step Actions: N, E, S, W

States: agent’s location

*Numbers represent value Q* *(s) of each state s*

### Maze Example: Policy

Rewards: -1 per time-step Actions: N, E, S, W

States: agent’s location

58

*Arrows represent policy π(s) for each state s*

### Maze Example: Value Function

Rewards: -1 per time-step Actions: N, E, S, W

States: agent’s location

*Grid layout represents transition model P*

### Categorizing RL Agents

Value-Based

◦No Policy (implicit)

◦Value Function

Policy-Based

◦Policy

◦No Value Function

Actor-Critic

◦Policy

◦Value Function

Model-Free

◦Policy and/or Value Function

◦No Model

Model-Based

◦Policy and/or Value Function

◦Model

60

### RL Agent Taxonomy

### Problems within RL

62

### Outline

Machine Learning

◦ Supervised Learning v.s. Reinforcement Learning

◦ Reinforcement Learning v.s. Deep Learning Introduction to Reinforcement Learning

◦ Agent and Environment

◦ Action, State, and Reward Markov Decision Process Reinforcement Learning

◦ Value-Based

◦ Policy-Based

◦ Model-Based

Problems within RL

◦ Learning and Planning

◦ Exploration and Exploitation

### Learning and Planning

In sequential decision making

◦Reinforcement learning

• The environment is initially unknown

• The agent interacts with the environment

• The agent improves its policy

◦Planning

• A model of the environment is known

• The agent performs computations with its model (w/o any external interaction)

• The agent improves its policy (a.k.a. deliberation, reasoning, introspection, pondering, thought, search)

64

### Atari Example: Reinforcement Learning

Rules of the game are unknown

Learn directly from interactive game-play Pick actions on joystick, see pixels and scores

### Atari Example: Planning

66

Rules of the game are known

Query emulator based on the perfect model inside agent’s brain

◦ If I take action a from state s:

• what would the next state be?

• what would the score be?

Plan ahead to find optimal policy e.g. tree search

### Exploration and Exploitation

Reinforcement learning is like trial-and-error learning The agent should discover a good policy from the

experience without losing too much reward along the way

*Exploration finds more information about the environment*
*Exploitation exploits known information to maximize reward*

When to try?

It is usually important to explore as well as exploit

### Outline

Machine Learning

◦ Supervised Learning v.s. Reinforcement Learning

◦ Reinforcement Learning v.s. Deep Learning Introduction to Reinforcement Learning

◦ Agent and Environment

◦ Action, State, and Reward Markov Decision Process Reinforcement Learning

◦ Policy-Based

◦ Value-Based

◦ Model-Based

Problems within RL

◦ Learning and Planning

◦ Exploration and Exploitation

RL for Unsupervised Model

68

### RL for Unsupervised Model:

### Modularizing Unsupervised Sense

### Embeddings (MUSE)

### Word2Vec Polysemy Issue

Words are polysemy

◦ An apple a day, keeps the doctor away.

◦ Smartphone companies including apple, …

If words are polysemy, are their embeddings polysemy?

◦ No

◦ What’s the problem?

70

tree

trees

rock

rocks

Smartphone companies including blackberry, and sony will be invited.

### Modular Framework

Two key mechanisms

◦ **Sense selection given a text context**

◦ **Sense representation to embed statistical characteristics of sense identity**

apple

apple-1 apple-2 sense selection

sense embedding

sense representation sense selection

sense identity

reinforcement learning

### Sense Selection Module

Input: a text context ഥ𝐶_{𝑡} = 𝐶_{𝑡−𝑚}, … , 𝐶_{𝑡} = 𝑤_{𝑖}, … , 𝐶_{𝑡+𝑚}
Output: the fitness for each sense 𝑧_{𝑖1}, … , 𝑧_{𝑖3}

Model architecture: Continuous Bag-of-Words (CBOW) for efficiency

Sense selection

◦ Policy-based

◦ Value-based (Q-value)

72

**Sense Selection Module**

…
𝐶_{𝑡} = 𝑤_{𝑖}

𝐶_{𝑡−1}

𝑞(𝑧_{𝑖1}| ഥ𝐶_{𝑡}) 𝑞(𝑧_{𝑖2}| ഥ𝐶_{𝑡}) 𝑞(𝑧_{𝑖3}| ഥ𝐶_{𝑡})
**Sense selection for target word 𝐶**_{𝑡}

matrix 𝑄_{𝑖}

matrix 𝑃

… 𝐶_{𝑡+1}

including apple blackberry

companies and

### Sense Representation Module

Input: sense collocation 𝑧_{𝑖𝑘}, 𝑧_{𝑗𝑙}

Output: collocation likelihood estimation Model architecture: skip-gram architecture

Sense representation learning

𝑧_{𝑖1}

**Sense Representation Module**

…
𝑃(𝑧_{𝑗2}|𝑧_{𝑖1}) 𝑃(𝑧_{𝑢𝑣}|𝑧_{𝑖1})

matrix 𝑈 matrix 𝑉

### A Summary of MUSE

74 Corpus: { Smartphone companies including apple blackberry, and sony will be invited.}

sense selection ←

reward signal ←

sense selection →

sample collocation

1

2

2 3

**Sense selection for collocated word 𝐶**_{𝑡}^{′}

**Sense Selection Module**

…
𝐶_{𝑡′} = 𝑤_{𝑗}

𝐶_{𝑡′−1}

𝑞(𝑧_{𝑗1}|𝐶_{𝑡′}) 𝑞(𝑧_{𝑗2}|𝐶_{𝑡′}) 𝑞(𝑧_{𝑗3}|𝐶_{𝑡′})
matrix 𝑄_{𝑗}

matrix 𝑃

… 𝐶_{𝑡′+1}

apple and

including blackberry sony

𝑧_{𝑖1}

**Sense Representation Module**
𝑃(𝑧_{𝑗2}|𝑧_{𝑖1}) 𝑃(𝑧_{𝑢𝑣}|𝑧_{𝑖1}) …

negative sampling

matrix 𝑉

matrix 𝑈

**Sense Selection Module**
𝐶_{𝑡} = 𝑤_{𝑖} …

𝐶_{𝑡−1}

𝑞(𝑧_{𝑖1}| ഥ𝐶_{𝑡}) 𝑞(𝑧_{𝑖2}| ഥ𝐶_{𝑡}) 𝑞(𝑧_{𝑖3}| ഥ𝐶_{𝑡})
**Sense selection for target word 𝐶**_{𝑡}

matrix 𝑄_{𝑖}

matrix 𝑃

… 𝐶_{𝑡+1}

including apple blackberry

companies and

The first purely sense-level embedding learning with efficient sense selection.

Context … braves finish the
**season in tie with the **
los angeles dodgers …

… his later years proudly
**wore tie with the chinese**
characters for …

k-NN scoreless otl shootout 6- 6 hingis 3-3 7-7 0-0

pants trousers shirt juventus blazer socks anfield

Figure

### Qualitative Analysis

### Qualitative Analysis

76

Context … of the mulberry or the
**blackberry and minos**
sent him to …

… of the large number of
**blackberry users in the us **
federal …

k-NN cranberries maple

vaccinium apricot apple

smartphones sap microsoft ipv6 smartphone

Figure

### Demonstration

### Concluding Remarks

**RL is a general purpose framework for decision making **
*under interactions between agent and environment*

◦*RL is for an agent with the capacity to act*

◦*Each action influences the agent’s future state*

◦Success is measured by a scalar reward signal

◦*Goal: select actions to maximize future reward*

An RL agent may include one or more of these components

◦Value function: how good is each state and/or action

◦Policy: agent’s behavior function

◦Model: agent’s representation of the environment

78

action state reward

### References

Course materials by David Silver: http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Teaching.html ICLR 2015 Tutorial: http://www.iclr.cc/lib/exe/fetch.php?media=iclr2015:silver-iclr2015.pdf ICML 2016 Tutorial: http://icml.cc/2016/tutorials/deep_rl_tutorial.pdf