Jun 8, 2017 - in order to achieve certain goals. Image Source: Karen Zack, twitter.com/teenybiscuit ... Image Source: BB
Reinforcement Learning An introduction to the mathematics of artificial intelligence
Najeeb Khan
[email protected] najeebkhan.ca
June 8, 2017
Outline
1
Introduction
2
Reinforcement Learning
3
Deep Q-Learning
4
Imitating Optimal Control
Najeeb Khan (BIGLAB)
Reinforcement Learning
June 8, 2017
2 / 35
I have forty minutes!
12
9
3
6 Najeeb Khan (BIGLAB)
Reinforcement Learning
June 8, 2017
3 / 35
You may not like proofs!
Image Source: Sidney Harris
Najeeb Khan (BIGLAB)
Reinforcement Learning
June 8, 2017
4 / 35
Offering pinch of salt and citations
Image Source: REUTERS/Sergei Karpukhin
Najeeb Khan (BIGLAB)
Reinforcement Learning
June 8, 2017
5 / 35
Outline
1
Introduction
2
Reinforcement Learning
3
Deep Q-Learning
4
Imitating Optimal Control
Najeeb Khan (BIGLAB)
Reinforcement Learning
June 8, 2017
6 / 35
Artificial Intelligence
The study of intelligent systems that take actions under uncertainty in order to achieve certain goals.
Najeeb Khan (BIGLAB)
Reinforcement Learning
June 8, 2017
7 / 35
Artificial Intelligence The study of intelligent systems that take actions under uncertainty in order to achieve certain goals.
Image Source: Karen Zack, twitter.com/teenybiscuit
Najeeb Khan (BIGLAB)
Reinforcement Learning
June 8, 2017
7 / 35
Artificial Intelligence The study of intelligent systems that take actions under uncertainty in order to achieve certain goals.
Image Source: RoboRace: http://roborace.com
Najeeb Khan (BIGLAB)
Reinforcement Learning
June 8, 2017
7 / 35
Artificial Intelligence
The study of intelligent systems that take actions under uncertainty in order to achieve certain goals. How can we build such systems?
Najeeb Khan (BIGLAB)
Reinforcement Learning
June 8, 2017
7 / 35
Artificial Intelligence
The study of intelligent systems that take actions under uncertainty in order to achieve certain goals. How can we build such systems? Knowledge Engineering
Najeeb Khan (BIGLAB)
Reinforcement Learning
June 8, 2017
7 / 35
Artificial Intelligence
The study of intelligent systems that take actions under uncertainty in order to achieve certain goals. How can we build such systems? Knowledge Engineering I
Project Cyc (Lenat et al., 1990)
Najeeb Khan (BIGLAB)
Reinforcement Learning
June 8, 2017
7 / 35
Artificial Intelligence
The study of intelligent systems that take actions under uncertainty in order to achieve certain goals. How can we build such systems? Knowledge Engineering I
Project Cyc (Lenat et al., 1990)
Machine Learning
Najeeb Khan (BIGLAB)
Reinforcement Learning
June 8, 2017
7 / 35
Imitation Learning
Image Source: BBC Pictures
Najeeb Khan (BIGLAB)
Reinforcement Learning
June 8, 2017
8 / 35
Imitation Learning
Environment
Imitation Learning
Environment
Xi
π(Xi )
Imitation Learning
Teacher
Environment
Xi
π(Xi )
Imitation Learning
Yi
Teacher
Environment
Xi
π(Xi )
Yˆi
Imitation Learning
Yi
Teacher
Environment
Xi
π(Xi )
Yˆi
Yi − Yˆi
Najeeb Khan (BIGLAB)
Reinforcement Learning
June 8, 2017
8 / 35
Imitation Learning Yi
Teacher
Environment
Xi
π(Xi )
Yˆi
Yi − Yˆi The samples are assumed independent and identically distributed Xi ∼ p(X ).
Najeeb Khan (BIGLAB)
Reinforcement Learning
June 8, 2017
8 / 35
Imitation Learning Yi
Teacher
Environment
Xi
π(Xi )
Yˆi
Yi − Yˆi The samples are assumed independent and identically distributed Xi ∼ p(X ). Typical function approximation problem. Najeeb Khan (BIGLAB)
Reinforcement Learning
June 8, 2017
8 / 35
Outline
1
Introduction
2
Reinforcement Learning
3
Deep Q-Learning
4
Imitating Optimal Control
Najeeb Khan (BIGLAB)
Reinforcement Learning
June 8, 2017
9 / 35
Problems with Imitation Learning
Image Source: Boston Dynamics
Najeeb Khan (BIGLAB)
Reinforcement Learning
June 8, 2017
10 / 35
Problems with Imitation Learning
Image Source: Sutton and Barto (Sutton and Barto, 1998)
Najeeb Khan (BIGLAB)
Reinforcement Learning
June 8, 2017
10 / 35
Problems with Imitation Learning
We don’t have the teacher signal Yi
Najeeb Khan (BIGLAB)
Reinforcement Learning
June 8, 2017
10 / 35
Problems with Imitation Learning
We don’t have the teacher signal Yi I
Only a delayed reward is available
Najeeb Khan (BIGLAB)
Reinforcement Learning
June 8, 2017
10 / 35
Problems with Imitation Learning
We don’t have the teacher signal Yi I
Only a delayed reward is available
The samples are not independent
Najeeb Khan (BIGLAB)
Reinforcement Learning
June 8, 2017
10 / 35
Problems with Imitation Learning
We don’t have the teacher signal Yi I
Only a delayed reward is available
The samples are not independent The agent changes the environment through its actions
Najeeb Khan (BIGLAB)
Reinforcement Learning
June 8, 2017
10 / 35
Problems with Imitation Learning
We don’t have the teacher signal Yi I
Only a delayed reward is available
The samples are not independent The agent changes the environment through its actions I
Training data: Xi ∼ p(X ), Real world: Xi ∼ pπ (X )
Najeeb Khan (BIGLAB)
Reinforcement Learning
June 8, 2017
10 / 35
Reinforcement Learning
Environment
Reinforcement Learning
Environment
Agent
Reinforcement Learning
Environment
xt
Agent
Reinforcement Learning
Environment
xt
ut
Agent
Reinforcement Learning
Environment
xt
rt
ut
Agent
Najeeb Khan (BIGLAB)
Reinforcement Learning
June 8, 2017
11 / 35
Reinforcement Learning
Mathematics Decision Theory
RL
Najeeb Khan (BIGLAB)
Engineering
Computer Science
Optimal Control
Machine Learning
Reinforcement Learning
June 8, 2017
12 / 35
Markov Decision Processes
A stochastic process
Najeeb Khan (BIGLAB)
Reinforcement Learning
June 8, 2017
13 / 35
Markov Decision Processes
A stochastic process Decision at every step
Najeeb Khan (BIGLAB)
Reinforcement Learning
June 8, 2017
13 / 35
Markov Decision Processes
A stochastic process Decision at every step Markovity: the decision depends only on the current state irrespective of the past
Najeeb Khan (BIGLAB)
Reinforcement Learning
June 8, 2017
13 / 35
Markov Decision Processes
A stochastic process Decision at every step Markovity: the decision depends only on the current state irrespective of the past I
X : a set of states
Najeeb Khan (BIGLAB)
Reinforcement Learning
June 8, 2017
13 / 35
Markov Decision Processes
A stochastic process Decision at every step Markovity: the decision depends only on the current state irrespective of the past I I
X : a set of states U: a set of actions
Najeeb Khan (BIGLAB)
Reinforcement Learning
June 8, 2017
13 / 35
Markov Decision Processes
A stochastic process Decision at every step Markovity: the decision depends only on the current state irrespective of the past I I I
X : a set of states U: a set of actions Pxu : state transition probabilities
Najeeb Khan (BIGLAB)
Reinforcement Learning
June 8, 2017
13 / 35
Markov Decision Processes
A stochastic process Decision at every step Markovity: the decision depends only on the current state irrespective of the past I I I I
X : a set of states U: a set of actions Pxu : state transition probabilities γ ∈ [0, 1]: a discount factor
Najeeb Khan (BIGLAB)
Reinforcement Learning
June 8, 2017
13 / 35
Markov Decision Processes
A stochastic process Decision at every step Markovity: the decision depends only on the current state irrespective of the past I I I I I
X : a set of states U: a set of actions Pxu : state transition probabilities γ ∈ [0, 1]: a discount factor r : X × U → R: a reward function
Najeeb Khan (BIGLAB)
Reinforcement Learning
June 8, 2017
13 / 35
Markov Decision Processes
A stochastic process Decision at every step Markovity: the decision depends only on the current state irrespective of the past I I I I I
X : a set of states U: a set of actions Pxu : state transition probabilities γ ∈ [0, 1]: a discount factor r : X × U → R: a reward function
For now, let’s assume X and U are finite sets
Najeeb Khan (BIGLAB)
Reinforcement Learning
June 8, 2017
13 / 35
Markov Decision Processes
A stochastic process Decision at every step Markovity: the decision depends only on the current state irrespective of the past I I I I I
X : a set of states U: a set of actions Pxu : state transition probabilities γ ∈ [0, 1]: a discount factor r : X × U → R: a reward function
For now, let’s assume X and U are finite sets At each time step take action ut and transition to a new state xt
Najeeb Khan (BIGLAB)
Reinforcement Learning
June 8, 2017
13 / 35
The goal of RL
Total payoff R = r (x0 , u0 ) + γr (x1 , u1 ) + γ 2 r (x2 , u2 ) + · · ·
Najeeb Khan (BIGLAB)
Reinforcement Learning
June 8, 2017
14 / 35
The goal of RL
Total payoff R = r (x0 , u0 ) + γr (x1 , u1 ) + γ 2 r (x2 , u2 ) + · · · R = r (x0 ) + γr (x1 ) + γ 2 r (x2 ) + · · ·
Najeeb Khan (BIGLAB)
Reinforcement Learning
June 8, 2017
14 / 35
The goal of RL
Total payoff R = r (x0 , u0 ) + γr (x1 , u1 ) + γ 2 r (x2 , u2 ) + · · · R = r (x0 ) + γr (x1 ) + γ 2 r (x2 ) + · · · Choose actions ut wisely to maximize the expected value of total payoff
Najeeb Khan (BIGLAB)
Reinforcement Learning
June 8, 2017
14 / 35
Policy Policy π : X → U, a mapping from states to actions
Najeeb Khan (BIGLAB)
Reinforcement Learning
June 8, 2017
15 / 35
Policy Policy π : X → U, a mapping from states to actions How good is a given policy? Value function
Najeeb Khan (BIGLAB)
Reinforcement Learning
June 8, 2017
15 / 35
Policy Policy π : X → U, a mapping from states to actions How good is a given policy? Value function V π (x) = E [r (x0 ) + γr (x1 ) + γ 2 r (x2 ) + · · · |x0 = x, π]
Najeeb Khan (BIGLAB)
Reinforcement Learning
June 8, 2017
(1)
15 / 35
Policy Policy π : X → U, a mapping from states to actions How good is a given policy? Value function V π (x) = E [r (x0 ) + γr (x1 ) + γ 2 r (x2 ) + · · · |x0 = x, π]
(1)
Bellman equations X V π (x) = r (x) + γ Pxπ(x) (x 0 )V π (x 0 )
(2)
x 0 ∈X
Najeeb Khan (BIGLAB)
Reinforcement Learning
June 8, 2017
15 / 35
Policy Policy π : X → U, a mapping from states to actions How good is a given policy? Value function V π (x) = E [r (x0 ) + γr (x1 ) + γ 2 r (x2 ) + · · · |x0 = x, π]
(1)
Bellman equations X V π (x) = r (x) + γ Pxπ(x) (x 0 )V π (x 0 )
(2)
x 0 ∈X
Optimal value function X Pxπ(x) (x 0 )V π (x 0 ) V ∗ (x) = r (x) + max γ π
Najeeb Khan (BIGLAB)
(3)
x 0 ∈X
Reinforcement Learning
June 8, 2017
15 / 35
Policy Policy π : X → U, a mapping from states to actions How good is a given policy? Value function V π (x) = E [r (x0 ) + γr (x1 ) + γ 2 r (x2 ) + · · · |x0 = x, π]
(1)
Bellman equations X V π (x) = r (x) + γ Pxπ(x) (x 0 )V π (x 0 )
(2)
x 0 ∈X
Optimal value function X Pxπ(x) (x 0 )V π (x 0 ) V ∗ (x) = r (x) + max γ π
(3)
x 0 ∈X
Optimal policy π ∗ (x) = arg max u∈U
Najeeb Khan (BIGLAB)
X
Pxu (x 0 )V ∗ (x 0 )
(4)
x 0 ∈X
Reinforcement Learning
June 8, 2017
15 / 35
Learning Policies
∗
V ∗ (x) = V π (x) ≥ V π (x)
Najeeb Khan (BIGLAB)
Reinforcement Learning
June 8, 2017
16 / 35
Learning Policies
∗
V ∗ (x) = V π (x) ≥ V π (x) Value iteration
Najeeb Khan (BIGLAB)
Reinforcement Learning
June 8, 2017
16 / 35
Learning Policies
∗
V ∗ (x) = V π (x) ≥ V π (x) Value iteration Policy iteration
Najeeb Khan (BIGLAB)
Reinforcement Learning
June 8, 2017
16 / 35
Value Iteration
Result: Optimal value function V ∗ (x) V (x) := 0 ∀ x ∈ X ; while not converged do X V (x) := r (x) + max γ Pxu (x 0 )V (x 0 ) u∈U
∀x ∈X
x 0 ∈X
end
Najeeb Khan (BIGLAB)
Reinforcement Learning
June 8, 2017
17 / 35
Value Iteration
Result: Optimal value function V ∗ (x) V (x) := 0 ∀ x ∈ X ; while not converged do X V (x) := r (x) + max γ Pxu (x 0 )V (x 0 ) u∈U
∀x ∈X
x 0 ∈X
end It can be shown that V (x) → V ∗ (x)
Najeeb Khan (BIGLAB)
Reinforcement Learning
June 8, 2017
17 / 35
Value Iteration
Result: Optimal value function V ∗ (x) V (x) := 0 ∀ x ∈ X ; while not converged do X V (x) := r (x) + max γ Pxu (x 0 )V (x 0 ) u∈U
∀x ∈X
x 0 ∈X
end It can be shown that V (x) → V ∗ (x) V ∗ (x) can be used to find optimal policy π ∗ using Equation 4.
Najeeb Khan (BIGLAB)
Reinforcement Learning
June 8, 2017
17 / 35
Policy Iteration Result: Optimal policy and value function π ∗ , V ∗ (x) π(x) := random() ∀ x ∈ X ; while not converged do V (x) := V π (x) ∀ x ∈ X X π(x) := arg max Pxu (x 0 )V (x 0 ) u∈U
∀x ∈X
x 0 ∈X
end
Najeeb Khan (BIGLAB)
Reinforcement Learning
June 8, 2017
18 / 35
Policy Iteration Result: Optimal policy and value function π ∗ , V ∗ (x) π(x) := random() ∀ x ∈ X ; while not converged do V (x) := V π (x) ∀ x ∈ X X π(x) := arg max Pxu (x 0 )V (x 0 ) u∈U
∀x ∈X
x 0 ∈X
end It can be shown that V (x) → V ∗ (x) and π(x) → π ∗ (x)
Najeeb Khan (BIGLAB)
Reinforcement Learning
June 8, 2017
18 / 35
Learning the Markov Decision Process
States X , actions U and the discount factor γ are usually known Transition probabilities can be estimated using Maximum Likelihood Estimate Pxu (x 0 ) =
Najeeb Khan (BIGLAB)
number of times decided u and transitioned to x 0 number of times decided u in state x
Reinforcement Learning
June 8, 2017
19 / 35
Continuous State MDPs Data for learning MDP
Najeeb Khan (BIGLAB)
Reinforcement Learning
June 8, 2017
20 / 35
Continuous State MDPs Data for learning MDP I
Physics-based simulator xt , ut
Najeeb Khan (BIGLAB)
System Dynamics
Reinforcement Learning
xt+1
June 8, 2017
20 / 35
Continuous State MDPs Data for learning MDP I
Physics-based simulator xt , ut
I
System Dynamics
xt+1
Run a random policy and collect data from real system
Najeeb Khan (BIGLAB)
Reinforcement Learning
June 8, 2017
20 / 35
Continuous State MDPs Data for learning MDP I
Physics-based simulator xt , ut
I
System Dynamics
xt+1
Run a random policy and collect data from real system xt+1 = Axt + But
Najeeb Khan (BIGLAB)
Reinforcement Learning
June 8, 2017
20 / 35
Continuous State MDPs Data for learning MDP I
Physics-based simulator xt , ut
I
System Dynamics
xt+1
Run a random policy and collect data from real system xt+1 = Axt + But
arg min A,B
Najeeb Khan (BIGLAB)
m T −1 X X
i |xt+1 − (Axti + Buti )|2
i=1 t=0
Reinforcement Learning
June 8, 2017
20 / 35
Continuous State MDPs Fitted value iteration Z V (x) := r (x) + max γ u∈U
Najeeb Khan (BIGLAB)
x0
Reinforcement Learning
Pxu (x 0 )V (x 0 )dx 0
(5)
June 8, 2017
21 / 35
Continuous State MDPs Fitted value iteration Z V (x) := r (x) + max γ u∈U
x0
Pxu (x 0 )V (x 0 )dx 0
(5)
Get a sample estimate of the right side of Equation 5, call it y
Najeeb Khan (BIGLAB)
Reinforcement Learning
June 8, 2017
21 / 35
Continuous State MDPs Fitted value iteration Z V (x) := r (x) + max γ u∈U
x0
Pxu (x 0 )V (x 0 )dx 0
(5)
Get a sample estimate of the right side of Equation 5, call it y Use a supervised learning algorithm to compute V (x)
Najeeb Khan (BIGLAB)
Reinforcement Learning
June 8, 2017
21 / 35
Continuous State MDPs Fitted value iteration Z V (x) := r (x) + max γ u∈U
x0
Pxu (x 0 )V (x 0 )dx 0
(5)
Get a sample estimate of the right side of Equation 5, call it y Use a supervised learning algorithm to compute V (x) V (x) = θT φ(x)
Najeeb Khan (BIGLAB)
Reinforcement Learning
(6)
June 8, 2017
21 / 35
Continuous State MDPs Fitted value iteration Z V (x) := r (x) + max γ u∈U
x0
Pxu (x 0 )V (x 0 )dx 0
(5)
Get a sample estimate of the right side of Equation 5, call it y Use a supervised learning algorithm to compute V (x) V (x) = θT φ(x)
θ = arg min θ
Najeeb Khan (BIGLAB)
m X
(6)
|θT φ(x)i − y i |2
(7)
i=1
Reinforcement Learning
June 8, 2017
21 / 35
Continuous State MDPs Fitted value iteration Z V (x) := r (x) + max γ u∈U
x0
Pxu (x 0 )V (x 0 )dx 0
(5)
Get a sample estimate of the right side of Equation 5, call it y Use a supervised learning algorithm to compute V (x) V (x) = θT φ(x)
θ = arg min
m X
θ
(6)
|θT φ(x)i − y i |2
(7)
i=1
Convergence is not guaranteed! Najeeb Khan (BIGLAB)
Reinforcement Learning
June 8, 2017
21 / 35
Outline
1
Introduction
2
Reinforcement Learning
3
Deep Q-Learning
4
Imitating Optimal Control
Najeeb Khan (BIGLAB)
Reinforcement Learning
June 8, 2017
22 / 35
Q-Learning
Model free RL Result: Q-function Q(x, u) Q(x, u) := random ∀ x ∈ X , u ∈ U; Set the learning rate α; while not converged do Q(x, u) := Q(x, u) + α(r + γ max Q(x 0 , u 0 ) − Q(x, u)) 0 u ∈U
end
Najeeb Khan (BIGLAB)
Reinforcement Learning
June 8, 2017
23 / 35
Q-Learning
Model free RL Result: Q-function Q(x, u) Q(x, u) := random ∀ x ∈ X , u ∈ U; Set the learning rate α; while not converged do Q(x, u) := Q(x, u) + α(r + γ max Q(x 0 , u 0 ) − Q(x, u)) 0 u ∈U
end It has been proved that Q-learning converges to the true Q-function (Melo, 2001)
Najeeb Khan (BIGLAB)
Reinforcement Learning
June 8, 2017
23 / 35
Deep Q-Network
Atari Breakout
Najeeb Khan (BIGLAB)
Reinforcement Learning
June 8, 2017
24 / 35
Deep Q-Network
Atari Breakout
Reward?
Najeeb Khan (BIGLAB)
Reinforcement Learning
June 8, 2017
24 / 35
Deep Q-Network
Atari Breakout
Reward? I
Game score
Najeeb Khan (BIGLAB)
Reinforcement Learning
June 8, 2017
24 / 35
Deep Q-Network
Atari Breakout
Reward? I
Game score
State?
Najeeb Khan (BIGLAB)
Reinforcement Learning
June 8, 2017
24 / 35
Deep Q-Network
Atari Breakout
Reward? I
Game score
State? I
The position and direction of the ball and the position of the paddle etc.
Najeeb Khan (BIGLAB)
Reinforcement Learning
June 8, 2017
24 / 35
Deep Q-Network
Atari Breakout
Reward? I
Game score
State? I
The position and direction of the ball and the position of the paddle etc.
Each game will have different state. We want a one-for-all state
Najeeb Khan (BIGLAB)
Reinforcement Learning
June 8, 2017
24 / 35
Deep Q-Network
Atari Breakout
Reward? I
Game score
State? I
The position and direction of the ball and the position of the paddle etc.
Each game will have different state. We want a one-for-all state Why not just use the entire screen pixels? Najeeb Khan (BIGLAB)
Reinforcement Learning
June 8, 2017
24 / 35
Convolutional Neural Networks Assumes local connectivity Generates activation maps using convolution operation instead of dot products 1D convolution (w ∗ x)n =
X
wm xn−m
m kern
Activation map activ(n) = ReLU((w ∗ x)n )
(8)
where ReLU(z) = max(0, z).
Najeeb Khan (BIGLAB)
Reinforcement Learning
June 8, 2017
25 / 35
Convolutional Neural Networks Raw pixels
Convolutional Neural Networks
ReLU((w ∗ x))
Raw pixels
Activation maps
Convolutional Neural Networks
ReLU((w ∗ x))
Raw pixels
Activation maps
Convolutional Neural Networks
ReLU((w ∗ x))
Raw pixels
Activation maps
Convolutional Neural Networks
ReLU((w ∗ x))
Raw pixels
Activation maps
Convolutional Neural Networks Activation maps
Pooling
max (activ)
ReLU((w ∗ x))
Raw pixels
Convolutional Neural Networks Activation maps
Pooling Features
max (activ)
ReLU((w ∗ x))
Raw pixels
Convolutional Neural Networks Activation maps
Pooling Features
max (activ)
ReLU((w ∗ x))
Raw pixels
Convolutional Neural Networks Activation maps
Pooling Features
max (activ)
ReLU((w ∗ x))
Raw pixels
Convolutional Neural Networks Activation maps
Pooling Features
max (activ)
ReLU((w ∗ x))
Raw pixels
Convolutional Neural Networks
Najeeb Khan (BIGLAB)
Activation maps
Pooling Features
max (activ)
ReLU((w ∗ x))
Raw pixels
Reinforcement Learning
Repeat and map to Q(x, u)
June 8, 2017
26 / 35
Deep Q-Network
(Mnih et al., 2013) https://www.youtube.com/watch?v=V1eYniJ0Rnk
Najeeb Khan (BIGLAB)
Reinforcement Learning
June 8, 2017
27 / 35
Outline
1
Introduction
2
Reinforcement Learning
3
Deep Q-Learning
4
Imitating Optimal Control
Najeeb Khan (BIGLAB)
Reinforcement Learning
June 8, 2017
28 / 35
Optimal Control The optimal control problem is given by min
u1 ,u2 ,··· ,uT
Najeeb Khan (BIGLAB)
T X
c(xt , ut ) : xt = f (xt−1 , ut−1 )
(9)
t=1
Reinforcement Learning
June 8, 2017
29 / 35
Optimal Control The optimal control problem is given by min
u1 ,u2 ,··· ,uT
T X
c(xt , ut ) : xt = f (xt−1 , ut−1 )
(9)
t=1
Shooting method min
u1 ,u2 ,··· ,uT
c(x1 , u1 ) + c(f (x1 , u1 ), u2 ) + · · · + c(f (f (. . . ) . . . ), uT ) (10)
Najeeb Khan (BIGLAB)
Reinforcement Learning
June 8, 2017
29 / 35
Optimal Control The optimal control problem is given by min
u1 ,u2 ,··· ,uT
T X
c(xt , ut ) : xt = f (xt−1 , ut−1 )
(9)
t=1
Shooting method min
u1 ,u2 ,··· ,uT
c(x1 , u1 ) + c(f (x1 , u1 ), u2 ) + · · · + c(f (f (. . . ) . . . ), uT ) (10)
Backpropagate gradients and optimize (Linear Quadratic Regulator) df df dc dc , , , dxt dut dxt dut
Najeeb Khan (BIGLAB)
Reinforcement Learning
June 8, 2017
29 / 35
Optimal Control The optimal control problem is given by min
u1 ,u2 ,··· ,uT
T X
c(xt , ut ) : xt = f (xt−1 , ut−1 )
(9)
t=1
Shooting method min
u1 ,u2 ,··· ,uT
c(x1 , u1 ) + c(f (x1 , u1 ), u2 ) + · · · + c(f (f (. . . ) . . . ), uT ) (10)
Backpropagate gradients and optimize (Linear Quadratic Regulator) df df dc dc , , , dxt dut dxt dut Known dynamics → optimal control Najeeb Khan (BIGLAB)
Reinforcement Learning
June 8, 2017
29 / 35
Imitating Optimal Control
Controlling a low cost manipulator (Deisenroth et al., 2011)
Najeeb Khan (BIGLAB)
Reinforcement Learning
June 8, 2017
30 / 35
Imitating Optimal Control
Controlling a low cost manipulator (Deisenroth et al., 2011) Run base policy π0 (ut |xt ) to collect data D = {(x, u, x 0 )i }
Najeeb Khan (BIGLAB)
Reinforcement Learning
June 8, 2017
30 / 35
Imitating Optimal Control
Controlling a low cost manipulator (Deisenroth et al., 2011) Run base policy π0 (ut |xt ) to collect data D = {(x, u, x 0 )i } Learn Gaussian Process dynamics model p(x 0 |x, u) to maximize P 0 i logp(xi |xi , ui )
Najeeb Khan (BIGLAB)
Reinforcement Learning
June 8, 2017
30 / 35
Imitating Optimal Control
Controlling a low cost manipulator (Deisenroth et al., 2011) Run base policy π0 (ut |xt ) to collect data D = {(x, u, x 0 )i } Learn Gaussian Process dynamics model p(x 0 |x, u) to maximize P 0 i logp(xi |xi , ui ) Backpropagate through p(x 0 |x, u) into the policy to optimize π0 (ut |xt )
Najeeb Khan (BIGLAB)
Reinforcement Learning
June 8, 2017
30 / 35
Imitating Optimal Control
Controlling a low cost manipulator (Deisenroth et al., 2011) Run base policy π0 (ut |xt ) to collect data D = {(x, u, x 0 )i } Learn Gaussian Process dynamics model p(x 0 |x, u) to maximize P 0 i logp(xi |xi , ui ) Backpropagate through p(x 0 |x, u) into the policy to optimize π0 (ut |xt ) Run π0 (ut |xt ), appending the visited tuples (x, u, x 0 ) to D
Najeeb Khan (BIGLAB)
Reinforcement Learning
June 8, 2017
30 / 35
Imitating Optimal Control
(Deisenroth et al., 2011) https://www.youtube.com/watch?v=gdT6dwUOYC0
Najeeb Khan (BIGLAB)
Reinforcement Learning
June 8, 2017
31 / 35
Want to learn more?
Figure – Drafts available on authors’ websites
Google DeepMind https://deepmind.com/research/publications/
Najeeb Khan (BIGLAB)
Reinforcement Learning
June 8, 2017
32 / 35
Acknowledgment
Dr. Ian Stavness
Najeeb Khan (BIGLAB)
Reinforcement Learning
June 8, 2017
33 / 35
Thank You najeebkhan.ca/docs/rl2017.pdf
References Deisenroth, M. P., Rasmussen, C. E., and Fox, D. (2011). Learning to control a low-cost manipulator using data-efficient reinforcement learning. Lenat, D. B., Guha, R. V., Pittman, K., Pratt, D., and Shepherd, M. (1990). Cyc: toward programs with common sense. Communications of the ACM, 33(8):30–49. Melo, F. S. (2001). Convergence of q-learning: A simple proof. Institute Of Systems and Robotics, Tech. Rep, pages 1–4. Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. (2013). Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602. Sutton, R. S. and Barto, A. G. (1998). Reinforcement learning: An introduction, volume 1. MIT press Cambridge.
Najeeb Khan (BIGLAB)
Reinforcement Learning
June 8, 2017
35 / 35