• 沒有找到結果。

Goal: estimate optimal Q-values

Optimal Q-values obey a Bellman equation

Value iteration algorithms solve the Bellman equation

50

learning target

Policy

A policy is the agent’s behavior A policy maps from state to action

Deterministic policy:

Stochastic policy:

Policy Networks

Represent policy by a network with weights

Objective is to maximize total discounted reward by SGD

52

stochastic policy deterministic policy

Policy Gradient

The gradient of a stochastic policy is given by

The gradient of a deterministic policy is given by

Model

54

observation ot

action at

reward rt

A model predicts what the environment will do next

P predicts the next state

R predicts the next immediate reward

Reinforcement Learning Approach

Value-based RL

Estimate the optimal value function

Policy-based RL

Search directly for optimal policy

Model-based RL

Build a model of the environment

Plan (e.g. by lookahead) using model

is the policy achieving maximum future reward

is maximum value achievable under any policy

Maze Example

Rewards: -1 per time-step Actions: N, E, S, W

States: agent’s location

56

Maze Example: Value Function

Rewards: -1 per time-step Actions: N, E, S, W

States: agent’s location

Numbers represent value Q (s) of each state s

Maze Example: Policy

Rewards: -1 per time-step Actions: N, E, S, W

States: agent’s location

58

Arrows represent policy π(s) for each state s

Maze Example: Value Function

Rewards: -1 per time-step Actions: N, E, S, W

States: agent’s location

Grid layout represents transition model P

Categorizing RL Agents

Value-Based

No Policy (implicit)

Value Function

Policy-Based

Policy

No Value Function

Actor-Critic

Policy

Value Function

Model-Free

Policy and/or Value Function

No Model

Model-Based

Policy and/or Value Function

Model

60

RL Agent Taxonomy

Problems within RL

62

Outline

Machine Learning

Supervised Learning v.s. Reinforcement Learning

Reinforcement Learning v.s. Deep Learning Introduction to Reinforcement Learning

Agent and Environment

Action, State, and Reward Markov Decision Process Reinforcement Learning

Value-Based

Policy-Based

Model-Based

Problems within RL

Learning and Planning

Exploration and Exploitation

Learning and Planning

In sequential decision making

Reinforcement learning

The environment is initially unknown

The agent interacts with the environment

The agent improves its policy

Planning

A model of the environment is known

The agent performs computations with its model (w/o any external interaction)

The agent improves its policy (a.k.a. deliberation, reasoning, introspection, pondering, thought, search)

64

Atari Example: Reinforcement Learning

Rules of the game are unknown

Learn directly from interactive game-play Pick actions on joystick, see pixels and scores

Atari Example: Planning

66

Rules of the game are known

Query emulator based on the perfect model inside agent’s brain

If I take action a from state s:

what would the next state be?

what would the score be?

Plan ahead to find optimal policy e.g. tree search

Exploration and Exploitation

Reinforcement learning is like trial-and-error learning The agent should discover a good policy from the

experience without losing too much reward along the way

Exploration finds more information about the environment Exploitation exploits known information to maximize reward

When to try?

It is usually important to explore as well as exploit

Outline

Machine Learning

Supervised Learning v.s. Reinforcement Learning

Reinforcement Learning v.s. Deep Learning Introduction to Reinforcement Learning

Agent and Environment

Action, State, and Reward Markov Decision Process Reinforcement Learning

Policy-Based

Value-Based

Model-Based

Problems within RL

Learning and Planning

Exploration and Exploitation

RL for Unsupervised Model

68

RL for Unsupervised Model:

Modularizing Unsupervised Sense

Embeddings (MUSE)

Word2Vec Polysemy Issue

Words are polysemy

An apple a day, keeps the doctor away.

Smartphone companies including apple, …

If words are polysemy, are their embeddings polysemy?

No 

What’s the problem?

70

tree

trees

rock

rocks

Smartphone companies including blackberry, and sony will be invited.

Modular Framework

Two key mechanisms

Sense selection given a text context

Sense representation to embed statistical characteristics of sense identity

apple

apple-1 apple-2 sense selection

sense embedding

sense representation sense selection

sense identity

reinforcement learning

Sense Selection Module

Input: a text context ഥ𝐶𝑡 = 𝐶𝑡−𝑚, … , 𝐶𝑡 = 𝑤𝑖, … , 𝐶𝑡+𝑚 Output: the fitness for each sense 𝑧𝑖1, … , 𝑧𝑖3

Model architecture: Continuous Bag-of-Words (CBOW) for efficiency

Sense selection

Policy-based

Value-based (Q-value)

72

Sense Selection Module

𝐶𝑡 = 𝑤𝑖

𝐶𝑡−1

𝑞(𝑧𝑖1| ഥ𝐶𝑡) 𝑞(𝑧𝑖2| ഥ𝐶𝑡) 𝑞(𝑧𝑖3| ഥ𝐶𝑡) Sense selection for target word 𝐶𝑡

matrix 𝑄𝑖

matrix 𝑃

𝐶𝑡+1

including apple blackberry

companies and

Sense Representation Module

Input: sense collocation 𝑧𝑖𝑘, 𝑧𝑗𝑙

Output: collocation likelihood estimation Model architecture: skip-gram architecture

Sense representation learning

𝑧𝑖1

Sense Representation Module

𝑃(𝑧𝑗2|𝑧𝑖1) 𝑃(𝑧𝑢𝑣|𝑧𝑖1)

matrix 𝑈 matrix 𝑉

A Summary of MUSE

74 Corpus: { Smartphone companies including apple blackberry, and sony will be invited.}

sense selection ←

reward signal

sense selection

sample collocation

1

2

2 3

Sense selection for collocated word 𝐶𝑡

Sense Selection Module

𝐶𝑡′ = 𝑤𝑗

𝐶𝑡′−1

𝑞(𝑧𝑗1|𝐶𝑡′) 𝑞(𝑧𝑗2|𝐶𝑡′) 𝑞(𝑧𝑗3|𝐶𝑡′) matrix 𝑄𝑗

matrix 𝑃

𝐶𝑡′+1

apple and

including blackberry sony

𝑧𝑖1

Sense Representation Module 𝑃(𝑧𝑗2|𝑧𝑖1) 𝑃(𝑧𝑢𝑣|𝑧𝑖1)

negative sampling

matrix 𝑉

matrix 𝑈

Sense Selection Module 𝐶𝑡 = 𝑤𝑖

𝐶𝑡−1

𝑞(𝑧𝑖1| ഥ𝐶𝑡) 𝑞(𝑧𝑖2| ഥ𝐶𝑡) 𝑞(𝑧𝑖3| ഥ𝐶𝑡) Sense selection for target word 𝐶𝑡

matrix 𝑄𝑖

matrix 𝑃

𝐶𝑡+1

including apple blackberry

companies and

The first purely sense-level embedding learning with efficient sense selection.

Context … braves finish the season in tie with the los angeles dodgers …

… his later years proudly wore tie with the chinese characters for …

k-NN scoreless otl shootout 6-6 hingis 3-3 7-7 0-0

pants trousers shirt juventus blazer socks anfield

Figure

Qualitative Analysis

Qualitative Analysis

76

Context … of the mulberry or the blackberry and minos sent him to …

… of the large number of blackberry users in the us federal …

k-NN cranberries maple

vaccinium apricot apple

smartphones sap microsoft ipv6 smartphone

Figure

Demonstration

Concluding Remarks

RL is a general purpose framework for decision making under interactions between agent and environment

RL is for an agent with the capacity to act

Each action influences the agent’s future state

Success is measured by a scalar reward signal

Goal: select actions to maximize future reward

An RL agent may include one or more of these components

Value function: how good is each state and/or action

Policy: agent’s behavior function

Model: agent’s representation of the environment

78

action state reward

相關文件