Learning - Slides credited from Dr. David Silver & Hung-Yi Lee

Goal: estimate optimal Q-values

◦Optimal Q-values obey a Bellman equation

◦Value iteration algorithms solve the Bellman equation

learning target

Policy

A policy is the agent’s behavior A policy maps from state to action

◦Deterministic policy:

◦Stochastic policy:

Policy Networks

Represent policy by a network with weights

Objective is to maximize total discounted reward by SGD

stochastic policy deterministic policy

Policy Gradient

The gradient of a stochastic policy is given by

The gradient of a deterministic policy is given by

Model

observation o_t

action a_t

reward r_t

A model predicts what the environment will do next

◦P predicts the next state

◦R predicts the next immediate reward

Reinforcement Learning Approach

Value-based RL

◦Estimate the optimal value function

Policy-based RL

◦Search directly for optimal policy

Model-based RL

◦Build a model of the environment

◦Plan (e.g. by lookahead) using model

is the policy achieving maximum future reward

is maximum value achievable under any policy

Maze Example

Rewards: -1 per time-step Actions: N, E, S, W

States: agent’s location

Maze Example: Value Function

Rewards: -1 per time-step Actions: N, E, S, W

States: agent’s location

Numbers represent value Q (s) of each state s

Maze Example: Policy

Rewards: -1 per time-step Actions: N, E, S, W

States: agent’s location

Arrows represent policy π(s) for each state s

Maze Example: Value Function

Rewards: -1 per time-step Actions: N, E, S, W

States: agent’s location

Grid layout represents transition model P

Categorizing RL Agents

Value-Based

◦No Policy (implicit)

◦Value Function

Policy-Based

◦Policy

◦No Value Function

Actor-Critic

◦Policy

◦Value Function

Model-Free

◦Policy and/or Value Function

◦No Model

Model-Based

◦Policy and/or Value Function

◦Model

RL Agent Taxonomy

Problems within RL

Outline

Machine Learning

◦ Supervised Learning v.s. Reinforcement Learning

◦ Reinforcement Learning v.s. Deep Learning Introduction to Reinforcement Learning

◦ Agent and Environment

◦ Action, State, and Reward Markov Decision Process Reinforcement Learning

◦ Value-Based

◦ Policy-Based

◦ Model-Based

Problems within RL

◦ Learning and Planning

◦ Exploration and Exploitation

Learning and Planning

In sequential decision making

◦Reinforcement learning

• The environment is initially unknown

• The agent interacts with the environment

• The agent improves its policy

◦Planning

• A model of the environment is known

• The agent performs computations with its model (w/o any external interaction)

• The agent improves its policy (a.k.a. deliberation, reasoning, introspection, pondering, thought, search)

Atari Example: Reinforcement Learning

Rules of the game are unknown

Learn directly from interactive game-play Pick actions on joystick, see pixels and scores

Atari Example: Planning

Rules of the game are known

Query emulator based on the perfect model inside agent’s brain

◦ If I take action a from state s:

• what would the next state be?

• what would the score be?

Plan ahead to find optimal policy e.g. tree search

Exploration and Exploitation

Reinforcement learning is like trial-and-error learning The agent should discover a good policy from the

experience without losing too much reward along the way

Exploration finds more information about the environment Exploitation exploits known information to maximize reward

When to try?

It is usually important to explore as well as exploit

Outline

Machine Learning

◦ Supervised Learning v.s. Reinforcement Learning

◦ Reinforcement Learning v.s. Deep Learning Introduction to Reinforcement Learning

◦ Agent and Environment

◦ Action, State, and Reward Markov Decision Process Reinforcement Learning

◦ Policy-Based

◦ Value-Based

◦ Model-Based

Problems within RL

◦ Learning and Planning

◦ Exploration and Exploitation

RL for Unsupervised Model

RL for Unsupervised Model:

Modularizing Unsupervised Sense

Embeddings (MUSE)

Word2Vec Polysemy Issue

Words are polysemy

◦ An apple a day, keeps the doctor away.

◦ Smartphone companies including apple, …

If words are polysemy, are their embeddings polysemy?

◦ No 

◦ What’s the problem?

tree

trees

rock

rocks

Smartphone companies including blackberry, and sony will be invited.

Modular Framework

Two key mechanisms

◦ Sense selection given a text context

◦ Sense representation to embed statistical characteristics of sense identity

apple

apple-1 apple-2 sense selection

sense embedding

sense representation sense selection

sense identity

reinforcement learning

Sense Selection Module

Input: a text context ഥ𝐶_𝑡 = 𝐶_𝑡−𝑚, … , 𝐶_𝑡 = 𝑤_𝑖, … , 𝐶_𝑡+𝑚 Output: the fitness for each sense 𝑧_𝑖1, … , 𝑧_𝑖3

Model architecture: Continuous Bag-of-Words (CBOW) for efficiency

Sense selection

◦ Policy-based

◦ Value-based (Q-value)

Sense Selection Module

… 𝐶_𝑡 = 𝑤_𝑖

𝐶_𝑡−1

𝑞(𝑧_𝑖1| ഥ𝐶_𝑡) 𝑞(𝑧_𝑖2| ഥ𝐶_𝑡) 𝑞(𝑧_𝑖3| ഥ𝐶_𝑡) Sense selection for target word 𝐶_𝑡

matrix 𝑄_𝑖

matrix 𝑃

… 𝐶_𝑡+1

including apple blackberry

companies and

Sense Representation Module

Input: sense collocation 𝑧_𝑖𝑘, 𝑧_𝑗𝑙

Output: collocation likelihood estimation Model architecture: skip-gram architecture

Sense representation learning

𝑧_𝑖1

Sense Representation Module

… 𝑃(𝑧_𝑗2|𝑧_𝑖1) 𝑃(𝑧_𝑢𝑣|𝑧_𝑖1)

matrix 𝑈 matrix 𝑉

A Summary of MUSE

74 Corpus: { Smartphone companies including apple blackberry, and sony will be invited.}

sense selection ←

reward signal ←

sense selection →

sample collocation

2 3

Sense selection for collocated word 𝐶_𝑡^′

Sense Selection Module

… 𝐶_𝑡′ = 𝑤_𝑗

𝐶_𝑡′−1

𝑞(𝑧_𝑗1|𝐶_𝑡′) 𝑞(𝑧_𝑗2|𝐶_𝑡′) 𝑞(𝑧_𝑗3|𝐶_𝑡′) matrix 𝑄_𝑗

matrix 𝑃

… 𝐶_𝑡′+1

apple and

including blackberry sony

𝑧_𝑖1

Sense Representation Module 𝑃(𝑧_𝑗2|𝑧_𝑖1) 𝑃(𝑧_𝑢𝑣|𝑧_𝑖1) …

negative sampling

matrix 𝑉

matrix 𝑈

Sense Selection Module 𝐶_𝑡 = 𝑤_𝑖 …

𝐶_𝑡−1

𝑞(𝑧_𝑖1| ഥ𝐶_𝑡) 𝑞(𝑧_𝑖2| ഥ𝐶_𝑡) 𝑞(𝑧_𝑖3| ഥ𝐶_𝑡) Sense selection for target word 𝐶_𝑡

matrix 𝑄_𝑖

matrix 𝑃

… 𝐶_𝑡+1

including apple blackberry

companies and

The first purely sense-level embedding learning with efficient sense selection.

Context … braves finish the season in tie with the los angeles dodgers …

… his later years proudly wore tie with the chinese characters for …

k-NN scoreless otl shootout 6-6 hingis 3-3 7-7 0-0

pants trousers shirt juventus blazer socks anfield

Figure

Qualitative Analysis

Context … of the mulberry or the blackberry and minos sent him to …

… of the large number of blackberry users in the us federal …

k-NN cranberries maple

vaccinium apricot apple

smartphones sap microsoft ipv6 smartphone

Figure

Demonstration

Concluding Remarks

RL is a general purpose framework for decision making under interactions between agent and environment

◦RL is for an agent with the capacity to act

◦Each action influences the agent’s future state

◦Success is measured by a scalar reward signal

◦Goal: select actions to maximize future reward

An RL agent may include one or more of these components

◦Value function: how good is each state and/or action

◦Policy: agent’s behavior function

◦Model: agent’s representation of the environment

action state reward

在文檔中 Slides credited from Dr. David Silver & Hung-Yi Lee (頁 50-79)