Deep Reinforcement Learning
Applied Deep Learning
March 28th, 2020 http://adl.miulab.tw
Outline
◉ Machine Learning
○ Supervised Learning v.s. Reinforcement Learning
○
Reinforcement Learning v.s. Deep Learning◉ Introduction to Reinforcement Learning
○ Agent and Environment
○ Action, State, and Reward
◉ Markov Decision Process
◉ Reinforcement Learning Approach
○ Value-Based
○ Policy-Based
○ Model-Based
2
Outline
◉ Machine Learning
○ Supervised Learning v.s. Reinforcement Learning
○
Reinforcement Learning v.s. Deep Learning◉ Introduction to Reinforcement Learning
○ Agent and Environment
○ Action, State, and Reward
◉ Markov Decision Process
◉ Reinforcement Learning Approach
○ Value-Based
○ Policy-Based
○ Model-Based
3
Machine Learning
Machine Learning
Unsupervised Learning Supervised
Learning
Reinforcement Learning
4
Supervised v.s. Reinforcement
◉ Supervised Learning
○ Training based on
supervisor/label/annotation
○ Feedback is instantaneous
○ Time does not matter
◉ Reinforcement Learning
○ Training only based on reward signal
○ Feedback is delayed
○ Time matters
○ Agent actions affect subsequent data
5
Supervised v.s. Reinforcement
◉ Supervised
◉ Reinforcement
……
Say “Hi”
Say “Good bye”
Learning from teacher
Learning from critics Hello ☺ ……
“Hello”
“Bye bye”
……. ……. OXX??
?!
Bad
6
Reinforcement Learning
◉ RL is a general purpose framework for decision making
○ RL is for an agent with the capacity to act
○
Each action influences the agent’s future state○ Success is measured by a scalar reward signal
○ Goal: select actions to maximize future reward
7
Deep Learning
◉ DL is a general purpose framework for representation learning
○ Given an objective
○
Learn representation that is required to achieve objective○ Directly from raw inputs
○ Use minimal domain knowledge
x1
x2
… …
y1
y2
… …
…
… …
…
…
…
yM
xN
vector x
vector y 8
Deep Reinforcement Learning
◉ AI is an agent that can solve human-level task
○ RL defines the objective
○
DL gives the mechanism○ RL + DL = general intelligence
9
Deep RL AI Examples
◉
Play games: Atari, poker, Go, …◉
Explore worlds: 3D worlds, …◉
Control physical systems: manipulate, …◉
Interact with users: recommend, optimize, personalize, …10
Reinforcement Learning
Introduction to RL
11
Outline
◉ Machine Learning
○ Supervised Learning v.s. Reinforcement Learning
○
Reinforcement Learning v.s. Deep Learning◉ Introduction to Reinforcement Learning
○ Agent and Environment
○ Action, State, and Reward
◉ Markov Decision Process
◉ Reinforcement Learning Approach
○ Value-Based
○ Policy-Based
○ Model-Based
12
Reinforcement Learning
◉ RL is a general purpose framework for decision making
○ RL is for an agent with the capacity to act
○
Each action influences the agent’s future state○ Success is measured by a scalar reward signal
Big three: action, state, reward
13
Agent and Environment
→←
MoveRight MoveLeft
observation ot action at
reward rt Agent
Environment
14
Agent and Environment
◉ At time step t
○ The agent
■
Executes action at■ Receives observation ot
■ Receives scalar reward rt
○ The environment
■ Receives action at
■ Emits observation ot+1
■ Emits scalar reward rt+1
○
t increments at env. stepobservation ot
action at
reward rt
15
State
◉
Experience is the sequence of observations, actions, rewards◉
State is the information used to determine what happens next○ what happens depends on the history experience
• The agent selects actions
• The environment selects observations/rewards
◉
The state is the function of the history experience16
observation ot
action at
reward rt
Environment State
◉
The environment state 𝑠𝑡𝑒 is theenvironment’s private representation
○
whether data the environment uses to pick the nextobservation/reward
○ may not be visible to the agent
○ may contain irrelevant information 17
observation ot
action at
reward rt
Agent State
◉
The agent state 𝑠𝑡𝑎 is the agent’s internal representation○
whether data the agent uses to pick the next action → information used by RL algorithms○ can be any function of experience 18
Information State
◉ An information state (a.k.a. Markov state) contains all useful information from history
◉ The future is independent of the past given the present
○ Once the state is known, the history may be thrown away
○ The state is a sufficient statistics of the future A state is Markov iff
19
Fully Observable Environment
◉
Full observability: agent directly observes environment stateinformation state = agent state = environment state
This is a Markov decision process (MDP)
20
Partially Observable Environment
◉
Partial observability: agent indirectly observes environmentagent state ≠ environment state
◉
Agent must construct its own state representation 𝑠𝑡𝑎○ Complete history:
○ Beliefs of environment state:
○ Hidden state (from RNN):
This is partially observable Markov decision process (POMDP)
21
Reward
◉ Reinforcement learning is based on reward hypothesis
◉ A reward r
tis a scalar feedback signal
○
Indicates how well agent is doing at step tReward hypothesis: all agent goals can be desired by maximizing expected cumulative reward
22
Sequential Decision Making
◉ Goal: select actions to maximize total future reward
○ Actions may have long-term consequences
○
Reward may be delayed○ It may be better to sacrifice immediate reward to gain more long-term reward
23
Scenario of Reinforcement Learning
Agent
Environment
Observation Action
Reward Don’t do that
State
Change the environment24
Scenario of Reinforcement Learning
Agent Observation
Reward Thank you.
State
Action Change the environment
Environment
Agent learns to take actions maximizing expected reward.
25
Machine Learning ≈ Looking for a Function
Observation Action
Reward Function
input
Used to pick the best function
Function output Actor/Policy
Action = π(Observation)
Environment
26
Learning to Play Go
Observation Action
Reward
Next Move
Environment
27
Learning to Play Go
Observation Action
Reward
Agent learns to take actions maximizing expected reward.
Environment If win, reward = 1
If loss, reward = -1 reward = 0 in most cases
28
Learning to Play Go
◉ Supervised
◉ Reinforcement Learning
Next move:
“5-5”
Next move:
“3-3”
First move …… many moves …… Win!
AlphaGo uses supervised learning + reinforcement learning.
Learning from teacher
Learning from experience
(Two agents play with each other.)
29
Learning a Chatbot
◉ Machine obtains feedback from user
How are you?
Bye bye ☺
Hello
Hi ☺
-10 3
Chatbot learns to maximize the expected reward
30
Learning a Chatbot
◉ Let two agents talk to each other (sometimes generate good dialogue, sometimes bad)
How old are you?
See you.
See you.
See you.
How old are you?
I am 16.
I though you were 12.
What make you think so?
31
Learning a chat-bot
◉ By this approach, we can generate a lot of dialogues.
◉ Use pre-defined rules to evaluate the goodness of a dialogue
Dialogue 1
Dialogue 2
Dialogue 3
Dialogue 4
Dialogue 5
Dialogue 6
Dialogue 7
Dialogue 8
Machine learns from the evaluation as rewards
32
Learning to Play Video Game
◉
Space invader: terminate when all aliens are killed, or your spaceship is destroyedfire Score
(reward) Kill the aliens
shield
Play yourself: http://www.2600online.com/spaceinvaders.html
How about machine: https://gym.openai.com/evaluations/eval_Eduozx4HRyqgTCVk9ltw
33
Learning to Play Video Game
Start with observation 𝑠1
Observation 𝑠2 Observation 𝑠3
Action 𝑎1: “right”
Obtain reward 𝑟1 = 0
Action 𝑎2: “fire”
(kill an alien) Obtain reward 𝑟2 = 5
Usually there is some randomness in the environment 34
Learning to Play Video Game
Start with observation 𝑠1
Observation 𝑠2 Observation 𝑠3
After many turns
Action 𝑎𝑇
Obtain reward 𝑟𝑇
Game Over
(spaceship destroyed)
This is an episode.
Learn to maximize the expected cumulative reward
per episode 35
More Applications
◉
Flying Helicopter○ https://www.youtube.com/watch?v=0JL04JJjocc
◉
Driving○ https://www.youtube.com/watch?v=0xo1Ldx3L5Q
◉
Robot○ https://www.youtube.com/watch?v=370cT-OAzzM
◉
Google Cuts Its Giant Electricity Bill With DeepMind-Powered AI○ http://www.bloomberg.com/news/articles/2016-07-19/google-cuts-its-giant-electricity-bill-with- deepmind-powered-ai
◉
Text Generation○ https://www.youtube.com/watch?v=pbQ4qe8EwLo
36
Fully Observable Environment
Markov Decision Process
37
Outline
◉ Machine Learning
○ Supervised Learning v.s. Reinforcement Learning
○
Reinforcement Learning v.s. Deep Learning◉ Introduction to Reinforcement Learning
○ Agent and Environment
○ Action, State, and Reward
◉ Markov Decision Process
◉ Reinforcement Learning Approach
○ Value-Based
○ Policy-Based
○ Model-Based
38
Markov Process
◉ Markov process is a memoryless random process
○ i.e. a sequence of random states S1, S2, ... with the Markov property
Student Markov chain
Sample episodes from S1=C1
• C1 C2 C3 Pass Sleep
• C1 FB FB C1 C2 Sleep
• C1 C2 C3 Pub C2 C3 Pass Sleep
• C1 FB FB C1 C2 C3 Pub
• C1 FB FB FB C1 C2 C3 Pub C2 Sleep
39
Student MRP
Markov Reward Process (MRP)
◉ Markov reward process is a Markov chain with values
○ The return Gt is the total discounted reward from time-step t
40
Markov Decision Process (MDP)
◉ Markov decision process is an MRP with decisions
○ It is an environment in which all states are Markov
Student MDP
41
Markov Decision Process (MDP)
◉
S : finite set of states/observations◉
A : finite set of actions◉
P : transition probability◉
R : immediate reward◉
γ : discount factor◉
Goal is to choose policy π at time t that maximizes expected overall return:42
Reinforcement Learning
43
Outline
◉ Machine Learning
○ Supervised Learning v.s. Reinforcement Learning
○
Reinforcement Learning v.s. Deep Learning◉ Introduction to Reinforcement Learning
○ Agent and Environment
○ Action, State, and Reward
◉ Markov Decision Process
◉ Reinforcement Learning
○ Value-Based
○ Policy-Based
○ Model-Based
44
Major Components in an RL Agent
◉ An RL agent may include one or more of these components
○ Value function: how good is each state and/or action
○
Policy: agent’s behavior function○ Model: agent’s representation of the environment
45
Reinforcement Learning Approach
◉ Value-based RL
○ Estimate the optimal value function
◉ Policy-based RL
○ Search directly for optimal policy
◉ Model-based RL
○ Build a model of the environment
○ Plan (e.g. by lookahead) using model
is the policy achieving maximum future reward
is maximum value achievable under any policy
46
Maze Example
◉ Rewards: -1 per time-step
◉ Actions: N, E, S, W
◉ States: agent’s location
47
Maze Example: Value Function
◉ Rewards: -1 per time-step
◉ Actions: N, E, S, W
◉ States: agent’s location
Numbers represent value Qπ(s) of each state s
48
Maze Example: Value Function
◉ Rewards: -1 per time-step
◉ Actions: N, E, S, W
◉ States: agent’s location
Grid layout represents transition model P
Numbers represent immediate reward R from each state s (same for all a)
49
Maze Example: Policy
◉ Rewards: -1 per time-step
◉ Actions: N, E, S, W
◉ States: agent’s location
Arrows represent policy π(s) for each state s
50
Categorizing RL Agents
◉ Value-Based
○ No Policy (implicit)
○
Value Function◉ Policy-Based
○ Policy
○ No Value Function
◉ Actor-Critic
○ Policy
○
Value Function◉ Model-Free
○ Policy and/or Value Function
○
No Model◉ Model-Based
○ Policy and/or Value Function
○ Model
51
RL Agent Taxonomy
Model- Free
Model
Value Policy
Learning a Critic
Actor-Critic
Learning an Actor 52
Concluding Remarks
◉
RL is a general purpose framework for decision making under interactions between agent and environment○
RL is for an agent with the capacity to act○ Each action influences the agent’s future state
○ Success is measured by a scalar reward signal
○ Goal: select actions to maximize future reward
◉
An RL agent may include one or more of these components○ Value function: how good is each state and/or action
○ Policy: agent’s behavior function
○
Model: agent’s representation of the environmentaction state reward 53
References
◉ Course materials by David Silver:
http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Teaching.html
◉ ICLR 2015 Tutorial:
http://www.iclr.cc/lib/exe/fetch.php?media=iclr2015:silver-iclr2015.pdf
◉ ICML 2016 Tutorial: http://icml.cc/2016/tutorials/deep_rl_tutorial.pdf
54