Outline
Machine Learning
◦ Supervised Learning v.s. Reinforcement Learning
◦ Reinforcement Learning v.s. Deep Learning Introduction to Reinforcement Learning
◦ Agent and Environment
◦ Action, State, and Reward Markov Decision Process
Reinforcement Learning Approach
◦ Value-Based
◦ Policy-Based
◦ Model-Based
Problems within RL
◦ Learning and Planning
◦ Exploration and Exploitation
RL for Unsupervised Model
2
Outline
Machine Learning
◦ Supervised Learning v.s. Reinforcement Learning
◦ Reinforcement Learning v.s. Deep Learning Introduction to Reinforcement Learning
◦ Agent and Environment
◦ Action, State, and Reward Markov Decision Process
Reinforcement Learning Approach
◦ Value-Based
◦ Policy-Based
◦ Model-Based
Problems within RL
◦ Learning and Planning
◦ Exploration and Exploitation
Machine Learning
4
Machine Learning
Unsupervised Learning Supervised
Learning
Reinforcement Learning
Supervised v.s. Reinforcement
Supervised Learning
◦Training based on
supervisor/label/annotation
◦Feedback is instantaneous
◦Time does not matter
Reinforcement Learning
◦Training only based on reward signal
◦Feedback is delayed
◦Time matters
◦Agent actions affect subsequent data
Supervised v.s. Reinforcement
Supervised
Reinforcement
6
……
Say “Hi”
Say “Good bye”
Learning from teacher
Learning from critics
Hello ……
“Hello”
“Bye bye”
……. ……. OXX???!
Bad
Reinforcement Learning
RL is a general purpose framework for decision making
◦RL is for an agent with the capacity to act
◦Each action influences the agent’s future state
◦Success is measured by a scalar reward signal
◦Goal: select actions to maximize future reward
Deep Learning
DL is a general purpose framework for representation learning
◦Given an objective
◦Learn representation that is required to achieve objective
◦Directly from raw inputs
◦Use minimal domain knowledge
8
x1
x2
… …
y1
y2
… …
…
… …
…
…
…
yM
xN
vector x
vector y
Deep Reinforcement Learning
AI is an agent that can solve human-level task
◦RL defines the objective
◦DL gives the mechanism
◦RL + DL = general intelligence
Deep RL AI Examples
Play games: Atari, poker, Go, … Explore worlds: 3D worlds, …
Control physical systems: manipulate, …
Interact with users: recommend, optimize, personalize, …
10
Introduction to RL
Reinforcement Learning
Outline
Machine Learning
◦ Supervised Learning v.s. Reinforcement Learning
◦ Reinforcement Learning v.s. Deep Learning Introduction to Reinforcement Learning
◦ Agent and Environment
◦ Action, State, and Reward Markov Decision Process
Reinforcement Learning Approach
◦ Value-Based
◦ Policy-Based
◦ Model-Based
Problems within RL
◦ Learning and Planning
◦ Exploration and Exploitation
RL for Unsupervised Model
12
Reinforcement Learning
RL is a general purpose framework for decision making
◦RL is for an agent with the capacity to act
◦Each action influences the agent’s future state
◦Success is measured by a scalar reward signal
Big three: action, state, reward
Agent and Environment
14
→←
MoveRight MoveLeft
observation ot action at
reward rt Agent
Environment
Agent and Environment
At time step t
◦The agent
◦ Executes action at
◦ Receives observation ot
◦ Receives scalar reward rt
◦The environment
◦ Receives action at
◦ Emits observation ot+1
◦ Emits scalar reward rt+1
◦t increments at env. step
observation ot
action at
reward rt
State
Experience is the sequence of observations, actions, rewards
State is the information used to determine what happens next
◦what happens depends on the history experience
• The agent selects actions
• The environment selects observations/rewards
The state is the function of the history experience
16
observation ot
action at
reward rt
Environment State
The environment state 𝑠𝑡𝑒 is the
environment’s private representation
◦whether data the environment uses to pick the next observation/reward
◦may not be visible to the agent
◦may contain irrelevant information
observation ot
action at
reward rt
Agent State
The agent state 𝑠𝑡𝑎 is the agent’s internal representation
◦whether data the agent uses to pick the next action information used by RL algorithms
◦can be any function of experience
18
Information State
An information state (a.k.a. Markov state) contains all useful information from history
The future is independent of the past given the present
◦Once the state is known, the history may be thrown away
◦The state is a sufficient statistics of the future
A state is Markov iff
Fully Observable Environment
Full observability: agent directly observes environment state
information state = agent state = environment state
20
This is a Markov decision process (MDP)
Partially Observable Environment
Partial observability: agent indirectly observes environment
agent state ≠ environment state
Agent must construct its own state representation 𝑠𝑡𝑎
◦Complete history:
◦Beliefs of environment state:
◦Hidden state (from RNN):
This is partially observable Markov decision process (POMDP)
Reward
Reinforcement learning is based on reward hypothesis A reward rt is a scalar feedback signal
◦Indicates how well agent is doing at step t
22
Reward hypothesis: all agent goals can be desired by maximizing expected cumulative reward
Sequential Decision Making
Goal: select actions to maximize total future reward
◦Actions may have long-term consequences
◦Reward may be delayed
◦It may be better to sacrifice immediate reward to gain more long-term reward
Scenario of Reinforcement Learning
Agent
Environment
Observation Action
Reward Don’t do
that
State
Change theenvironment
Scenario of Reinforcement Learning
Agent Observation
Reward Thank you.
State
Action Change the environment
Machine Learning ≈ Looking for a Function
Observation Action
Reward Function
input
Used to pick the best function
Function output Actor/Policy
Action = π(Observation)
Environment
Learning to Play Go
Observation Action
Reward
Next Move
Learning to Play Go
Observation Action
Reward
Agent learns to take actions maximizing expected reward.
Environment If win, reward = 1
If loss, reward = -1 reward = 0 in most cases
Learning to Play Go
Supervised
Reinforcement Learning
Next move:
“5-5”
Next move:
“3-3”
First move …… many moves …… Win!
Learning from teacher
Learning from experience
(Two agents play with each other.)
Learning a Chatbot
Machine obtains feedback from user
How are you?
Bye bye
Hello
Hi
-10 3
Chatbot learns to maximize the expected reward
Learning a Chatbot
Let two agents talk to each other (sometimes generate good dialogue, sometimes bad)
How old are you?
See you.
See you.
How old are you?
I am 16.
I though you were 12.
What make you think so?
Learning a chat-bot
By this approach, we can generate a lot of dialogues.
Use pre-defined rules to evaluate the goodness of a dialogue Dialogue 1 Dialogue 2 Dialogue 3 Dialogue 4
Dialogue 5 Dialogue 6 Dialogue 7 Dialogue 8
Machine learns from the evaluation as rewards
Learning to Play Video Game
Space invader: terminate when all aliens are killed, or your spaceship is destroyed
fire Score
(reward) Kill the
aliens
shield
Learning to Play Video Game
34
Start with observation 𝑠1 Observation 𝑠2 Observation 𝑠3
Action 𝑎1: “right”
Obtain reward 𝑟1 = 0
Action 𝑎2: “fire”
(kill an alien) Obtain reward
𝑟2 = 5
Usually there is some randomness in the environment
Learning to Play Video Game
Start with observation 𝑠1 Observation 𝑠2 Observation 𝑠3
After many turns
Game Over(spaceship destroyed)
This is an episode.
Learn to maximize the expected cumulative
reward per episode
More applications
Flying Helicopter
◦ https://www.youtube.com/watch?v=0JL04JJjocc
Driving
◦ https://www.youtube.com/watch?v=0xo1Ldx3L5Q
Robot
◦ https://www.youtube.com/watch?v=370cT-OAzzM
Google Cuts Its Giant Electricity Bill With DeepMind-Powered AI
◦ http://www.bloomberg.com/news/articles/2016-07-19/google-cuts-its-giant- electricity-bill-with-deepmind-powered-ai
Text Generation
◦ https://www.youtube.com/watch?v=pbQ4qe8EwLo
Markov Decision Process
Fully Observable Environment
Outline
Machine Learning
◦ Supervised Learning v.s. Reinforcement Learning
◦ Reinforcement Learning v.s. Deep Learning Introduction to Reinforcement Learning
◦ Agent and Environment
◦ Action, State, and Reward Markov Decision Process
Reinforcement Learning Approach
◦ Value-Based
◦ Policy-Based
◦ Model-Based
Problems within RL
◦ Learning and Planning
◦ Exploration and Exploitation
RL for Unsupervised Model
38
Markov Process
Markov process is a memoryless random process
◦i.e. a sequence of random states S1, S2, ... with the Markov property
Student Markov chain
Sample episodes from S1=C1
• C1 C2 C3 Pass Sleep
• C1 FB FB C1 C2 Sleep
• C1 C2 C3 Pub C2 C3 Pass Sleep
• C1 FB FB C1 C2 C3 Pub
• C1 FB FB FB C1 C2 C3 Pub C2 Sleep
Student MRP
Markov Reward Process (MRP)
40
Markov reward process is a Markov chain with values
◦The return Gt is the total discounted reward from time-step t
Markov decision process is a MRP with decisions
◦It is an environment in which all states are Markov
Markov Decision Process (MDP)
Student MDP
Markov Decision Process (MDP)
S : finite set of states/observations A : finite set of actions
P : transition probability R : immediate reward γ : discount factor
Goal is to choose policy π at time t that maximizes expected overall return:
42
Reinforcement Learning
Outline
Machine Learning
◦ Supervised Learning v.s. Reinforcement Learning
◦ Reinforcement Learning v.s. Deep Learning Introduction to Reinforcement Learning
◦ Agent and Environment
◦ Action, State, and Reward Markov Decision Process Reinforcement Learning
◦ Value-Based
◦ Policy-Based
◦ Model-Based
Problems within RL
◦ Learning and Planning
◦ Exploration and Exploitation RL for Unsupervised Model
44
Major Components in an RL Agent
An RL agent may include one or more of these components
◦Value function: how good is each state and/or action
◦Policy: agent’s behavior function
◦Model: agent’s representation of the environment
Value Function
A value function is a prediction of future reward (with action a in state s)
Q-value function gives expected total reward
◦from state and action
◦under policy
◦with discount factor
Value functions decompose into a Bellman equation
46
Optimal Value Function
An optimal value function is the maximum achievable value
The optimal value function allows us act optimally
The optimal value informally maximizes over all decisions
Optimal values decompose into a Bellman equation
Value Function Approximation
Value functions are represented by a lookup table
◦too many states and/or actions to store
◦too slow to learn the value of each entry individually
Values can be estimated with function approximation
48
Q-Networks
Q-networks represent value functions with weights
◦generalize from seen states to unseen states
◦update parameter for function approximation
Q-Learning
Goal: estimate optimal Q-values
◦Optimal Q-values obey a Bellman equation
◦Value iteration algorithms solve the Bellman equation
50
learning target
Policy
A policy is the agent’s behavior A policy maps from state to action
◦Deterministic policy:
◦Stochastic policy:
Policy Networks
Represent policy by a network with weights
Objective is to maximize total discounted reward by SGD
52
stochastic policy deterministic policy
Policy Gradient
The gradient of a stochastic policy is given by
The gradient of a deterministic policy is given by
Model
54
observation ot
action at
reward rt
A model predicts what the environment will do next
◦P predicts the next state
◦R predicts the next immediate reward
Reinforcement Learning Approach
Value-based RL
◦Estimate the optimal value function
Policy-based RL
◦Search directly for optimal policy
Model-based RL
◦Build a model of the environment
◦Plan (e.g. by lookahead) using model
is the policy achieving maximum future reward
is maximum value achievable under any policy
Maze Example
Rewards: -1 per time-step Actions: N, E, S, W
States: agent’s location
56
Maze Example: Value Function
Rewards: -1 per time-step Actions: N, E, S, W
States: agent’s location
Numbers represent value Q (s) of each state s
Maze Example: Policy
Rewards: -1 per time-step Actions: N, E, S, W
States: agent’s location
58
Arrows represent policy π(s) for each state s
Maze Example: Value Function
Rewards: -1 per time-step Actions: N, E, S, W
States: agent’s location
Grid layout represents transition model P
Categorizing RL Agents
Value-Based
◦No Policy (implicit)
◦Value Function
Policy-Based
◦Policy
◦No Value Function
Actor-Critic
◦Policy
◦Value Function
Model-Free
◦Policy and/or Value Function
◦No Model
Model-Based
◦Policy and/or Value Function
◦Model
60
RL Agent Taxonomy
Problems within RL
62
Outline
Machine Learning
◦ Supervised Learning v.s. Reinforcement Learning
◦ Reinforcement Learning v.s. Deep Learning Introduction to Reinforcement Learning
◦ Agent and Environment
◦ Action, State, and Reward Markov Decision Process Reinforcement Learning
◦ Value-Based
◦ Policy-Based
◦ Model-Based
Problems within RL
◦ Learning and Planning
◦ Exploration and Exploitation
Learning and Planning
In sequential decision making
◦Reinforcement learning
• The environment is initially unknown
• The agent interacts with the environment
• The agent improves its policy
◦Planning
• A model of the environment is known
• The agent performs computations with its model (w/o any external interaction)
• The agent improves its policy (a.k.a. deliberation, reasoning, introspection, pondering, thought, search)
64
Atari Example: Reinforcement Learning
Rules of the game are unknown
Learn directly from interactive game-play Pick actions on joystick, see pixels and scores
Atari Example: Planning
66
Rules of the game are known
Query emulator based on the perfect model inside agent’s brain
◦ If I take action a from state s:
• what would the next state be?
• what would the score be?
Plan ahead to find optimal policy e.g. tree search
Exploration and Exploitation
Reinforcement learning is like trial-and-error learning The agent should discover a good policy from the
experience without losing too much reward along the way
Exploration finds more information about the environment Exploitation exploits known information to maximize reward
When to try?
It is usually important to explore as well as exploit
Outline
Machine Learning
◦ Supervised Learning v.s. Reinforcement Learning
◦ Reinforcement Learning v.s. Deep Learning Introduction to Reinforcement Learning
◦ Agent and Environment
◦ Action, State, and Reward Markov Decision Process Reinforcement Learning
◦ Policy-Based
◦ Value-Based
◦ Model-Based
Problems within RL
◦ Learning and Planning
◦ Exploration and Exploitation
RL for Unsupervised Model
68
RL for Unsupervised Model:
Modularizing Unsupervised Sense
Embeddings (MUSE)
Word2Vec Polysemy Issue
Words are polysemy
◦ An apple a day, keeps the doctor away.
◦ Smartphone companies including apple, …
If words are polysemy, are their embeddings polysemy?
◦ No
◦ What’s the problem?
70
tree
trees
rock
rocks
Smartphone companies including blackberry, and sony will be invited.
Modular Framework
Two key mechanisms
◦ Sense selection given a text context
◦ Sense representation to embed statistical characteristics of sense identity
apple
apple-1 apple-2 sense selection
sense embedding
sense representation sense selection
sense identity
reinforcement learning
Sense Selection Module
Input: a text context ഥ𝐶𝑡 = 𝐶𝑡−𝑚, … , 𝐶𝑡 = 𝑤𝑖, … , 𝐶𝑡+𝑚 Output: the fitness for each sense 𝑧𝑖1, … , 𝑧𝑖3
Model architecture: Continuous Bag-of-Words (CBOW) for efficiency
Sense selection
◦ Policy-based
◦ Value-based (Q-value)
72
Sense Selection Module
… 𝐶𝑡 = 𝑤𝑖
𝐶𝑡−1
𝑞(𝑧𝑖1| ഥ𝐶𝑡) 𝑞(𝑧𝑖2| ഥ𝐶𝑡) 𝑞(𝑧𝑖3| ഥ𝐶𝑡) Sense selection for target word 𝐶𝑡
matrix 𝑄𝑖
matrix 𝑃
… 𝐶𝑡+1
including apple blackberry
companies and
Sense Representation Module
Input: sense collocation 𝑧𝑖𝑘, 𝑧𝑗𝑙
Output: collocation likelihood estimation Model architecture: skip-gram architecture
Sense representation learning
𝑧𝑖1
Sense Representation Module
… 𝑃(𝑧𝑗2|𝑧𝑖1) 𝑃(𝑧𝑢𝑣|𝑧𝑖1)
matrix 𝑈 matrix 𝑉
A Summary of MUSE
74 Corpus: { Smartphone companies including apple blackberry, and sony will be invited.}
sense selection ←
reward signal ←
sense selection →
sample collocation
1
2
2 3
Sense selection for collocated word 𝐶𝑡′
Sense Selection Module
… 𝐶𝑡′ = 𝑤𝑗
𝐶𝑡′−1
𝑞(𝑧𝑗1|𝐶𝑡′) 𝑞(𝑧𝑗2|𝐶𝑡′) 𝑞(𝑧𝑗3|𝐶𝑡′) matrix 𝑄𝑗
matrix 𝑃
… 𝐶𝑡′+1
apple and
including blackberry sony
𝑧𝑖1
Sense Representation Module 𝑃(𝑧𝑗2|𝑧𝑖1) 𝑃(𝑧𝑢𝑣|𝑧𝑖1) …
negative sampling
matrix 𝑉
matrix 𝑈
Sense Selection Module 𝐶𝑡 = 𝑤𝑖 …
𝐶𝑡−1
𝑞(𝑧𝑖1| ഥ𝐶𝑡) 𝑞(𝑧𝑖2| ഥ𝐶𝑡) 𝑞(𝑧𝑖3| ഥ𝐶𝑡) Sense selection for target word 𝐶𝑡
matrix 𝑄𝑖
matrix 𝑃
… 𝐶𝑡+1
including apple blackberry
companies and
The first purely sense-level embedding learning with efficient sense selection.
Context … braves finish the season in tie with the los angeles dodgers …
… his later years proudly wore tie with the chinese characters for …
k-NN scoreless otl shootout 6- 6 hingis 3-3 7-7 0-0
pants trousers shirt juventus blazer socks anfield
Figure
Qualitative Analysis
Qualitative Analysis
76
Context … of the mulberry or the blackberry and minos sent him to …
… of the large number of blackberry users in the us federal …
k-NN cranberries maple
vaccinium apricot apple
smartphones sap microsoft ipv6 smartphone
Figure
Demonstration
Concluding Remarks
RL is a general purpose framework for decision making under interactions between agent and environment
◦RL is for an agent with the capacity to act
◦Each action influences the agent’s future state
◦Success is measured by a scalar reward signal
◦Goal: select actions to maximize future reward
An RL agent may include one or more of these components
◦Value function: how good is each state and/or action
◦Policy: agent’s behavior function
◦Model: agent’s representation of the environment
78
action state reward
References
Course materials by David Silver: http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Teaching.html ICLR 2015 Tutorial: http://www.iclr.cc/lib/exe/fetch.php?media=iclr2015:silver-iclr2015.pdf ICML 2016 Tutorial: http://icml.cc/2016/tutorials/deep_rl_tutorial.pdf