Goal: estimate optimal Q-values
◦Optimal Q-values obey a Bellman equation
◦Value iteration algorithms solve the Bellman equation
50
learning target
Policy
A policy is the agent’s behavior A policy maps from state to action
◦Deterministic policy:
◦Stochastic policy:
Policy Networks
Represent policy by a network with weights
Objective is to maximize total discounted reward by SGD
52
stochastic policy deterministic policy
Policy Gradient
The gradient of a stochastic policy is given by
The gradient of a deterministic policy is given by
Model
54
observation ot
action at
reward rt
A model predicts what the environment will do next
◦P predicts the next state
◦R predicts the next immediate reward
Reinforcement Learning Approach
Value-based RL
◦Estimate the optimal value function
Policy-based RL
◦Search directly for optimal policy
Model-based RL
◦Build a model of the environment
◦Plan (e.g. by lookahead) using model
is the policy achieving maximum future reward
is maximum value achievable under any policy
Maze Example
Rewards: -1 per time-step Actions: N, E, S, W
States: agent’s location
56
Maze Example: Value Function
Rewards: -1 per time-step Actions: N, E, S, W
States: agent’s location
Numbers represent value Q (s) of each state s
Maze Example: Policy
Rewards: -1 per time-step Actions: N, E, S, W
States: agent’s location
58
Arrows represent policy π(s) for each state s
Maze Example: Value Function
Rewards: -1 per time-step Actions: N, E, S, W
States: agent’s location
Grid layout represents transition model P
Categorizing RL Agents
Value-Based
◦No Policy (implicit)
◦Value Function
Policy-Based
◦Policy
◦No Value Function
Actor-Critic
◦Policy
◦Value Function
Model-Free
◦Policy and/or Value Function
◦No Model
Model-Based
◦Policy and/or Value Function
◦Model
60
RL Agent Taxonomy
Problems within RL
62
Outline
Machine Learning
◦ Supervised Learning v.s. Reinforcement Learning
◦ Reinforcement Learning v.s. Deep Learning Introduction to Reinforcement Learning
◦ Agent and Environment
◦ Action, State, and Reward Markov Decision Process Reinforcement Learning
◦ Value-Based
◦ Policy-Based
◦ Model-Based
Problems within RL
◦ Learning and Planning
◦ Exploration and Exploitation
Learning and Planning
In sequential decision making
◦Reinforcement learning
• The environment is initially unknown
• The agent interacts with the environment
• The agent improves its policy
◦Planning
• A model of the environment is known
• The agent performs computations with its model (w/o any external interaction)
• The agent improves its policy (a.k.a. deliberation, reasoning, introspection, pondering, thought, search)
64
Atari Example: Reinforcement Learning
Rules of the game are unknown
Learn directly from interactive game-play Pick actions on joystick, see pixels and scores
Atari Example: Planning
66
Rules of the game are known
Query emulator based on the perfect model inside agent’s brain
◦ If I take action a from state s:
• what would the next state be?
• what would the score be?
Plan ahead to find optimal policy e.g. tree search
Exploration and Exploitation
Reinforcement learning is like trial-and-error learning The agent should discover a good policy from the
experience without losing too much reward along the way
Exploration finds more information about the environment Exploitation exploits known information to maximize reward
When to try?
It is usually important to explore as well as exploit
Outline
Machine Learning
◦ Supervised Learning v.s. Reinforcement Learning
◦ Reinforcement Learning v.s. Deep Learning Introduction to Reinforcement Learning
◦ Agent and Environment
◦ Action, State, and Reward Markov Decision Process Reinforcement Learning
◦ Policy-Based
◦ Value-Based
◦ Model-Based
Problems within RL
◦ Learning and Planning
◦ Exploration and Exploitation
RL for Unsupervised Model
68
RL for Unsupervised Model:
Modularizing Unsupervised Sense
Embeddings (MUSE)
Word2Vec Polysemy Issue
Words are polysemy
◦ An apple a day, keeps the doctor away.
◦ Smartphone companies including apple, …
If words are polysemy, are their embeddings polysemy?
◦ No
◦ What’s the problem?
70
tree
trees
rock
rocks
Smartphone companies including blackberry, and sony will be invited.
Modular Framework
Two key mechanisms
◦ Sense selection given a text context
◦ Sense representation to embed statistical characteristics of sense identity
apple
apple-1 apple-2 sense selection
sense embedding
sense representation sense selection
sense identity
reinforcement learning
Sense Selection Module
Input: a text context ഥ𝐶𝑡 = 𝐶𝑡−𝑚, … , 𝐶𝑡 = 𝑤𝑖, … , 𝐶𝑡+𝑚 Output: the fitness for each sense 𝑧𝑖1, … , 𝑧𝑖3
Model architecture: Continuous Bag-of-Words (CBOW) for efficiency
Sense selection
◦ Policy-based
◦ Value-based (Q-value)
72
Sense Selection Module
… 𝐶𝑡 = 𝑤𝑖
𝐶𝑡−1
𝑞(𝑧𝑖1| ഥ𝐶𝑡) 𝑞(𝑧𝑖2| ഥ𝐶𝑡) 𝑞(𝑧𝑖3| ഥ𝐶𝑡) Sense selection for target word 𝐶𝑡
matrix 𝑄𝑖
matrix 𝑃
… 𝐶𝑡+1
including apple blackberry
companies and
Sense Representation Module
Input: sense collocation 𝑧𝑖𝑘, 𝑧𝑗𝑙
Output: collocation likelihood estimation Model architecture: skip-gram architecture
Sense representation learning
𝑧𝑖1
Sense Representation Module
… 𝑃(𝑧𝑗2|𝑧𝑖1) 𝑃(𝑧𝑢𝑣|𝑧𝑖1)
matrix 𝑈 matrix 𝑉
A Summary of MUSE
74 Corpus: { Smartphone companies including apple blackberry, and sony will be invited.}
sense selection ←
reward signal ←
sense selection →
sample collocation
1
2
2 3
Sense selection for collocated word 𝐶𝑡′
Sense Selection Module
… 𝐶𝑡′ = 𝑤𝑗
𝐶𝑡′−1
𝑞(𝑧𝑗1|𝐶𝑡′) 𝑞(𝑧𝑗2|𝐶𝑡′) 𝑞(𝑧𝑗3|𝐶𝑡′) matrix 𝑄𝑗
matrix 𝑃
… 𝐶𝑡′+1
apple and
including blackberry sony
𝑧𝑖1
Sense Representation Module 𝑃(𝑧𝑗2|𝑧𝑖1) 𝑃(𝑧𝑢𝑣|𝑧𝑖1) …
negative sampling
matrix 𝑉
matrix 𝑈
Sense Selection Module 𝐶𝑡 = 𝑤𝑖 …
𝐶𝑡−1
𝑞(𝑧𝑖1| ഥ𝐶𝑡) 𝑞(𝑧𝑖2| ഥ𝐶𝑡) 𝑞(𝑧𝑖3| ഥ𝐶𝑡) Sense selection for target word 𝐶𝑡
matrix 𝑄𝑖
matrix 𝑃
… 𝐶𝑡+1
including apple blackberry
companies and
The first purely sense-level embedding learning with efficient sense selection.
Context … braves finish the season in tie with the los angeles dodgers …
… his later years proudly wore tie with the chinese characters for …
k-NN scoreless otl shootout 6-6 hingis 3-3 7-7 0-0
pants trousers shirt juventus blazer socks anfield
Figure
Qualitative Analysis
Qualitative Analysis
76
Context … of the mulberry or the blackberry and minos sent him to …
… of the large number of blackberry users in the us federal …
k-NN cranberries maple
vaccinium apricot apple
smartphones sap microsoft ipv6 smartphone
Figure
Demonstration
Concluding Remarks
RL is a general purpose framework for decision making under interactions between agent and environment
◦RL is for an agent with the capacity to act
◦Each action influences the agent’s future state
◦Success is measured by a scalar reward signal
◦Goal: select actions to maximize future reward
An RL agent may include one or more of these components
◦Value function: how good is each state and/or action
◦Policy: agent’s behavior function
◦Model: agent’s representation of the environment
78
action state reward