Reinforcement Learning - An introduction to the mathematics of ...

54 downloads 257 Views 12MB Size Report
Jun 8, 2017 - in order to achieve certain goals. Image Source: Karen Zack, twitter.com/teenybiscuit ... Image Source: BB
Reinforcement Learning An introduction to the mathematics of artificial intelligence

Najeeb Khan [email protected] najeebkhan.ca

June 8, 2017

Outline

1

Introduction

2

Reinforcement Learning

3

Deep Q-Learning

4

Imitating Optimal Control

Najeeb Khan (BIGLAB)

Reinforcement Learning

June 8, 2017

2 / 35

I have forty minutes!

12

9

3

6 Najeeb Khan (BIGLAB)

Reinforcement Learning

June 8, 2017

3 / 35

You may not like proofs!

Image Source: Sidney Harris

Najeeb Khan (BIGLAB)

Reinforcement Learning

June 8, 2017

4 / 35

Offering pinch of salt and citations

Image Source: REUTERS/Sergei Karpukhin

Najeeb Khan (BIGLAB)

Reinforcement Learning

June 8, 2017

5 / 35

Outline

1

Introduction

2

Reinforcement Learning

3

Deep Q-Learning

4

Imitating Optimal Control

Najeeb Khan (BIGLAB)

Reinforcement Learning

June 8, 2017

6 / 35

Artificial Intelligence

The study of intelligent systems that take actions under uncertainty in order to achieve certain goals.

Najeeb Khan (BIGLAB)

Reinforcement Learning

June 8, 2017

7 / 35

Artificial Intelligence The study of intelligent systems that take actions under uncertainty in order to achieve certain goals.

Image Source: Karen Zack, twitter.com/teenybiscuit

Najeeb Khan (BIGLAB)

Reinforcement Learning

June 8, 2017

7 / 35

Artificial Intelligence The study of intelligent systems that take actions under uncertainty in order to achieve certain goals.

Image Source: RoboRace: http://roborace.com

Najeeb Khan (BIGLAB)

Reinforcement Learning

June 8, 2017

7 / 35

Artificial Intelligence

The study of intelligent systems that take actions under uncertainty in order to achieve certain goals. How can we build such systems?

Najeeb Khan (BIGLAB)

Reinforcement Learning

June 8, 2017

7 / 35

Artificial Intelligence

The study of intelligent systems that take actions under uncertainty in order to achieve certain goals. How can we build such systems? Knowledge Engineering

Najeeb Khan (BIGLAB)

Reinforcement Learning

June 8, 2017

7 / 35

Artificial Intelligence

The study of intelligent systems that take actions under uncertainty in order to achieve certain goals. How can we build such systems? Knowledge Engineering I

Project Cyc (Lenat et al., 1990)

Najeeb Khan (BIGLAB)

Reinforcement Learning

June 8, 2017

7 / 35

Artificial Intelligence

The study of intelligent systems that take actions under uncertainty in order to achieve certain goals. How can we build such systems? Knowledge Engineering I

Project Cyc (Lenat et al., 1990)

Machine Learning

Najeeb Khan (BIGLAB)

Reinforcement Learning

June 8, 2017

7 / 35

Imitation Learning

Image Source: BBC Pictures

Najeeb Khan (BIGLAB)

Reinforcement Learning

June 8, 2017

8 / 35

Imitation Learning

Environment

Imitation Learning

Environment

Xi

π(Xi )

Imitation Learning

Teacher

Environment

Xi

π(Xi )

Imitation Learning

Yi

Teacher

Environment

Xi

π(Xi )

Yˆi

Imitation Learning

Yi

Teacher

Environment

Xi

π(Xi )

Yˆi

Yi − Yˆi

Najeeb Khan (BIGLAB)

Reinforcement Learning

June 8, 2017

8 / 35

Imitation Learning Yi

Teacher

Environment

Xi

π(Xi )

Yˆi

Yi − Yˆi The samples are assumed independent and identically distributed Xi ∼ p(X ).

Najeeb Khan (BIGLAB)

Reinforcement Learning

June 8, 2017

8 / 35

Imitation Learning Yi

Teacher

Environment

Xi

π(Xi )

Yˆi

Yi − Yˆi The samples are assumed independent and identically distributed Xi ∼ p(X ). Typical function approximation problem. Najeeb Khan (BIGLAB)

Reinforcement Learning

June 8, 2017

8 / 35

Outline

1

Introduction

2

Reinforcement Learning

3

Deep Q-Learning

4

Imitating Optimal Control

Najeeb Khan (BIGLAB)

Reinforcement Learning

June 8, 2017

9 / 35

Problems with Imitation Learning

Image Source: Boston Dynamics

Najeeb Khan (BIGLAB)

Reinforcement Learning

June 8, 2017

10 / 35

Problems with Imitation Learning

Image Source: Sutton and Barto (Sutton and Barto, 1998)

Najeeb Khan (BIGLAB)

Reinforcement Learning

June 8, 2017

10 / 35

Problems with Imitation Learning

We don’t have the teacher signal Yi

Najeeb Khan (BIGLAB)

Reinforcement Learning

June 8, 2017

10 / 35

Problems with Imitation Learning

We don’t have the teacher signal Yi I

Only a delayed reward is available

Najeeb Khan (BIGLAB)

Reinforcement Learning

June 8, 2017

10 / 35

Problems with Imitation Learning

We don’t have the teacher signal Yi I

Only a delayed reward is available

The samples are not independent

Najeeb Khan (BIGLAB)

Reinforcement Learning

June 8, 2017

10 / 35

Problems with Imitation Learning

We don’t have the teacher signal Yi I

Only a delayed reward is available

The samples are not independent The agent changes the environment through its actions

Najeeb Khan (BIGLAB)

Reinforcement Learning

June 8, 2017

10 / 35

Problems with Imitation Learning

We don’t have the teacher signal Yi I

Only a delayed reward is available

The samples are not independent The agent changes the environment through its actions I

Training data: Xi ∼ p(X ), Real world: Xi ∼ pπ (X )

Najeeb Khan (BIGLAB)

Reinforcement Learning

June 8, 2017

10 / 35

Reinforcement Learning

Environment

Reinforcement Learning

Environment

Agent

Reinforcement Learning

Environment

xt

Agent

Reinforcement Learning

Environment

xt

ut

Agent

Reinforcement Learning

Environment

xt

rt

ut

Agent

Najeeb Khan (BIGLAB)

Reinforcement Learning

June 8, 2017

11 / 35

Reinforcement Learning

Mathematics Decision Theory

RL

Najeeb Khan (BIGLAB)

Engineering

Computer Science

Optimal Control

Machine Learning

Reinforcement Learning

June 8, 2017

12 / 35

Markov Decision Processes

A stochastic process

Najeeb Khan (BIGLAB)

Reinforcement Learning

June 8, 2017

13 / 35

Markov Decision Processes

A stochastic process Decision at every step

Najeeb Khan (BIGLAB)

Reinforcement Learning

June 8, 2017

13 / 35

Markov Decision Processes

A stochastic process Decision at every step Markovity: the decision depends only on the current state irrespective of the past

Najeeb Khan (BIGLAB)

Reinforcement Learning

June 8, 2017

13 / 35

Markov Decision Processes

A stochastic process Decision at every step Markovity: the decision depends only on the current state irrespective of the past I

X : a set of states

Najeeb Khan (BIGLAB)

Reinforcement Learning

June 8, 2017

13 / 35

Markov Decision Processes

A stochastic process Decision at every step Markovity: the decision depends only on the current state irrespective of the past I I

X : a set of states U: a set of actions

Najeeb Khan (BIGLAB)

Reinforcement Learning

June 8, 2017

13 / 35

Markov Decision Processes

A stochastic process Decision at every step Markovity: the decision depends only on the current state irrespective of the past I I I

X : a set of states U: a set of actions Pxu : state transition probabilities

Najeeb Khan (BIGLAB)

Reinforcement Learning

June 8, 2017

13 / 35

Markov Decision Processes

A stochastic process Decision at every step Markovity: the decision depends only on the current state irrespective of the past I I I I

X : a set of states U: a set of actions Pxu : state transition probabilities γ ∈ [0, 1]: a discount factor

Najeeb Khan (BIGLAB)

Reinforcement Learning

June 8, 2017

13 / 35

Markov Decision Processes

A stochastic process Decision at every step Markovity: the decision depends only on the current state irrespective of the past I I I I I

X : a set of states U: a set of actions Pxu : state transition probabilities γ ∈ [0, 1]: a discount factor r : X × U → R: a reward function

Najeeb Khan (BIGLAB)

Reinforcement Learning

June 8, 2017

13 / 35

Markov Decision Processes

A stochastic process Decision at every step Markovity: the decision depends only on the current state irrespective of the past I I I I I

X : a set of states U: a set of actions Pxu : state transition probabilities γ ∈ [0, 1]: a discount factor r : X × U → R: a reward function

For now, let’s assume X and U are finite sets

Najeeb Khan (BIGLAB)

Reinforcement Learning

June 8, 2017

13 / 35

Markov Decision Processes

A stochastic process Decision at every step Markovity: the decision depends only on the current state irrespective of the past I I I I I

X : a set of states U: a set of actions Pxu : state transition probabilities γ ∈ [0, 1]: a discount factor r : X × U → R: a reward function

For now, let’s assume X and U are finite sets At each time step take action ut and transition to a new state xt

Najeeb Khan (BIGLAB)

Reinforcement Learning

June 8, 2017

13 / 35

The goal of RL

Total payoff R = r (x0 , u0 ) + γr (x1 , u1 ) + γ 2 r (x2 , u2 ) + · · ·

Najeeb Khan (BIGLAB)

Reinforcement Learning

June 8, 2017

14 / 35

The goal of RL

Total payoff R = r (x0 , u0 ) + γr (x1 , u1 ) + γ 2 r (x2 , u2 ) + · · · R = r (x0 ) + γr (x1 ) + γ 2 r (x2 ) + · · ·

Najeeb Khan (BIGLAB)

Reinforcement Learning

June 8, 2017

14 / 35

The goal of RL

Total payoff R = r (x0 , u0 ) + γr (x1 , u1 ) + γ 2 r (x2 , u2 ) + · · · R = r (x0 ) + γr (x1 ) + γ 2 r (x2 ) + · · · Choose actions ut wisely to maximize the expected value of total payoff

Najeeb Khan (BIGLAB)

Reinforcement Learning

June 8, 2017

14 / 35

Policy Policy π : X → U, a mapping from states to actions

Najeeb Khan (BIGLAB)

Reinforcement Learning

June 8, 2017

15 / 35

Policy Policy π : X → U, a mapping from states to actions How good is a given policy? Value function

Najeeb Khan (BIGLAB)

Reinforcement Learning

June 8, 2017

15 / 35

Policy Policy π : X → U, a mapping from states to actions How good is a given policy? Value function V π (x) = E [r (x0 ) + γr (x1 ) + γ 2 r (x2 ) + · · · |x0 = x, π]

Najeeb Khan (BIGLAB)

Reinforcement Learning

June 8, 2017

(1)

15 / 35

Policy Policy π : X → U, a mapping from states to actions How good is a given policy? Value function V π (x) = E [r (x0 ) + γr (x1 ) + γ 2 r (x2 ) + · · · |x0 = x, π]

(1)

Bellman equations X V π (x) = r (x) + γ Pxπ(x) (x 0 )V π (x 0 )

(2)

x 0 ∈X

Najeeb Khan (BIGLAB)

Reinforcement Learning

June 8, 2017

15 / 35

Policy Policy π : X → U, a mapping from states to actions How good is a given policy? Value function V π (x) = E [r (x0 ) + γr (x1 ) + γ 2 r (x2 ) + · · · |x0 = x, π]

(1)

Bellman equations X V π (x) = r (x) + γ Pxπ(x) (x 0 )V π (x 0 )

(2)

x 0 ∈X

Optimal value function X Pxπ(x) (x 0 )V π (x 0 ) V ∗ (x) = r (x) + max γ π

Najeeb Khan (BIGLAB)

(3)

x 0 ∈X

Reinforcement Learning

June 8, 2017

15 / 35

Policy Policy π : X → U, a mapping from states to actions How good is a given policy? Value function V π (x) = E [r (x0 ) + γr (x1 ) + γ 2 r (x2 ) + · · · |x0 = x, π]

(1)

Bellman equations X V π (x) = r (x) + γ Pxπ(x) (x 0 )V π (x 0 )

(2)

x 0 ∈X

Optimal value function X Pxπ(x) (x 0 )V π (x 0 ) V ∗ (x) = r (x) + max γ π

(3)

x 0 ∈X

Optimal policy π ∗ (x) = arg max u∈U

Najeeb Khan (BIGLAB)

X

Pxu (x 0 )V ∗ (x 0 )

(4)

x 0 ∈X

Reinforcement Learning

June 8, 2017

15 / 35

Learning Policies



V ∗ (x) = V π (x) ≥ V π (x)

Najeeb Khan (BIGLAB)

Reinforcement Learning

June 8, 2017

16 / 35

Learning Policies



V ∗ (x) = V π (x) ≥ V π (x) Value iteration

Najeeb Khan (BIGLAB)

Reinforcement Learning

June 8, 2017

16 / 35

Learning Policies



V ∗ (x) = V π (x) ≥ V π (x) Value iteration Policy iteration

Najeeb Khan (BIGLAB)

Reinforcement Learning

June 8, 2017

16 / 35

Value Iteration

Result: Optimal value function V ∗ (x) V (x) := 0 ∀ x ∈ X ; while not converged do X V (x) := r (x) + max γ Pxu (x 0 )V (x 0 ) u∈U

∀x ∈X

x 0 ∈X

end

Najeeb Khan (BIGLAB)

Reinforcement Learning

June 8, 2017

17 / 35

Value Iteration

Result: Optimal value function V ∗ (x) V (x) := 0 ∀ x ∈ X ; while not converged do X V (x) := r (x) + max γ Pxu (x 0 )V (x 0 ) u∈U

∀x ∈X

x 0 ∈X

end It can be shown that V (x) → V ∗ (x)

Najeeb Khan (BIGLAB)

Reinforcement Learning

June 8, 2017

17 / 35

Value Iteration

Result: Optimal value function V ∗ (x) V (x) := 0 ∀ x ∈ X ; while not converged do X V (x) := r (x) + max γ Pxu (x 0 )V (x 0 ) u∈U

∀x ∈X

x 0 ∈X

end It can be shown that V (x) → V ∗ (x) V ∗ (x) can be used to find optimal policy π ∗ using Equation 4.

Najeeb Khan (BIGLAB)

Reinforcement Learning

June 8, 2017

17 / 35

Policy Iteration Result: Optimal policy and value function π ∗ , V ∗ (x) π(x) := random() ∀ x ∈ X ; while not converged do V (x) := V π (x) ∀ x ∈ X X π(x) := arg max Pxu (x 0 )V (x 0 ) u∈U

∀x ∈X

x 0 ∈X

end

Najeeb Khan (BIGLAB)

Reinforcement Learning

June 8, 2017

18 / 35

Policy Iteration Result: Optimal policy and value function π ∗ , V ∗ (x) π(x) := random() ∀ x ∈ X ; while not converged do V (x) := V π (x) ∀ x ∈ X X π(x) := arg max Pxu (x 0 )V (x 0 ) u∈U

∀x ∈X

x 0 ∈X

end It can be shown that V (x) → V ∗ (x) and π(x) → π ∗ (x)

Najeeb Khan (BIGLAB)

Reinforcement Learning

June 8, 2017

18 / 35

Learning the Markov Decision Process

States X , actions U and the discount factor γ are usually known Transition probabilities can be estimated using Maximum Likelihood Estimate Pxu (x 0 ) =

Najeeb Khan (BIGLAB)

number of times decided u and transitioned to x 0 number of times decided u in state x

Reinforcement Learning

June 8, 2017

19 / 35

Continuous State MDPs Data for learning MDP

Najeeb Khan (BIGLAB)

Reinforcement Learning

June 8, 2017

20 / 35

Continuous State MDPs Data for learning MDP I

Physics-based simulator xt , ut

Najeeb Khan (BIGLAB)

System Dynamics

Reinforcement Learning

xt+1

June 8, 2017

20 / 35

Continuous State MDPs Data for learning MDP I

Physics-based simulator xt , ut

I

System Dynamics

xt+1

Run a random policy and collect data from real system

Najeeb Khan (BIGLAB)

Reinforcement Learning

June 8, 2017

20 / 35

Continuous State MDPs Data for learning MDP I

Physics-based simulator xt , ut

I

System Dynamics

xt+1

Run a random policy and collect data from real system xt+1 = Axt + But

Najeeb Khan (BIGLAB)

Reinforcement Learning

June 8, 2017

20 / 35

Continuous State MDPs Data for learning MDP I

Physics-based simulator xt , ut

I

System Dynamics

xt+1

Run a random policy and collect data from real system xt+1 = Axt + But

arg min A,B

Najeeb Khan (BIGLAB)

m T −1 X X

i |xt+1 − (Axti + Buti )|2

i=1 t=0

Reinforcement Learning

June 8, 2017

20 / 35

Continuous State MDPs Fitted value iteration Z V (x) := r (x) + max γ u∈U

Najeeb Khan (BIGLAB)

x0

Reinforcement Learning

Pxu (x 0 )V (x 0 )dx 0

(5)

June 8, 2017

21 / 35

Continuous State MDPs Fitted value iteration Z V (x) := r (x) + max γ u∈U

x0

Pxu (x 0 )V (x 0 )dx 0

(5)

Get a sample estimate of the right side of Equation 5, call it y

Najeeb Khan (BIGLAB)

Reinforcement Learning

June 8, 2017

21 / 35

Continuous State MDPs Fitted value iteration Z V (x) := r (x) + max γ u∈U

x0

Pxu (x 0 )V (x 0 )dx 0

(5)

Get a sample estimate of the right side of Equation 5, call it y Use a supervised learning algorithm to compute V (x)

Najeeb Khan (BIGLAB)

Reinforcement Learning

June 8, 2017

21 / 35

Continuous State MDPs Fitted value iteration Z V (x) := r (x) + max γ u∈U

x0

Pxu (x 0 )V (x 0 )dx 0

(5)

Get a sample estimate of the right side of Equation 5, call it y Use a supervised learning algorithm to compute V (x) V (x) = θT φ(x)

Najeeb Khan (BIGLAB)

Reinforcement Learning

(6)

June 8, 2017

21 / 35

Continuous State MDPs Fitted value iteration Z V (x) := r (x) + max γ u∈U

x0

Pxu (x 0 )V (x 0 )dx 0

(5)

Get a sample estimate of the right side of Equation 5, call it y Use a supervised learning algorithm to compute V (x) V (x) = θT φ(x)

θ = arg min θ

Najeeb Khan (BIGLAB)

m X

(6)

|θT φ(x)i − y i |2

(7)

i=1

Reinforcement Learning

June 8, 2017

21 / 35

Continuous State MDPs Fitted value iteration Z V (x) := r (x) + max γ u∈U

x0

Pxu (x 0 )V (x 0 )dx 0

(5)

Get a sample estimate of the right side of Equation 5, call it y Use a supervised learning algorithm to compute V (x) V (x) = θT φ(x)

θ = arg min

m X

θ

(6)

|θT φ(x)i − y i |2

(7)

i=1

Convergence is not guaranteed! Najeeb Khan (BIGLAB)

Reinforcement Learning

June 8, 2017

21 / 35

Outline

1

Introduction

2

Reinforcement Learning

3

Deep Q-Learning

4

Imitating Optimal Control

Najeeb Khan (BIGLAB)

Reinforcement Learning

June 8, 2017

22 / 35

Q-Learning

Model free RL Result: Q-function Q(x, u) Q(x, u) := random ∀ x ∈ X , u ∈ U; Set the learning rate α; while not converged do Q(x, u) := Q(x, u) + α(r + γ max Q(x 0 , u 0 ) − Q(x, u)) 0 u ∈U

end

Najeeb Khan (BIGLAB)

Reinforcement Learning

June 8, 2017

23 / 35

Q-Learning

Model free RL Result: Q-function Q(x, u) Q(x, u) := random ∀ x ∈ X , u ∈ U; Set the learning rate α; while not converged do Q(x, u) := Q(x, u) + α(r + γ max Q(x 0 , u 0 ) − Q(x, u)) 0 u ∈U

end It has been proved that Q-learning converges to the true Q-function (Melo, 2001)

Najeeb Khan (BIGLAB)

Reinforcement Learning

June 8, 2017

23 / 35

Deep Q-Network

Atari Breakout

Najeeb Khan (BIGLAB)

Reinforcement Learning

June 8, 2017

24 / 35

Deep Q-Network

Atari Breakout

Reward?

Najeeb Khan (BIGLAB)

Reinforcement Learning

June 8, 2017

24 / 35

Deep Q-Network

Atari Breakout

Reward? I

Game score

Najeeb Khan (BIGLAB)

Reinforcement Learning

June 8, 2017

24 / 35

Deep Q-Network

Atari Breakout

Reward? I

Game score

State?

Najeeb Khan (BIGLAB)

Reinforcement Learning

June 8, 2017

24 / 35

Deep Q-Network

Atari Breakout

Reward? I

Game score

State? I

The position and direction of the ball and the position of the paddle etc.

Najeeb Khan (BIGLAB)

Reinforcement Learning

June 8, 2017

24 / 35

Deep Q-Network

Atari Breakout

Reward? I

Game score

State? I

The position and direction of the ball and the position of the paddle etc.

Each game will have different state. We want a one-for-all state

Najeeb Khan (BIGLAB)

Reinforcement Learning

June 8, 2017

24 / 35

Deep Q-Network

Atari Breakout

Reward? I

Game score

State? I

The position and direction of the ball and the position of the paddle etc.

Each game will have different state. We want a one-for-all state Why not just use the entire screen pixels? Najeeb Khan (BIGLAB)

Reinforcement Learning

June 8, 2017

24 / 35

Convolutional Neural Networks Assumes local connectivity Generates activation maps using convolution operation instead of dot products 1D convolution (w ∗ x)n =

X

wm xn−m

m  kern

Activation map activ(n) = ReLU((w ∗ x)n )

(8)

where ReLU(z) = max(0, z).

Najeeb Khan (BIGLAB)

Reinforcement Learning

June 8, 2017

25 / 35

Convolutional Neural Networks Raw pixels

Convolutional Neural Networks

ReLU((w ∗ x))

Raw pixels

Activation maps

Convolutional Neural Networks

ReLU((w ∗ x))

Raw pixels

Activation maps

Convolutional Neural Networks

ReLU((w ∗ x))

Raw pixels

Activation maps

Convolutional Neural Networks

ReLU((w ∗ x))

Raw pixels

Activation maps

Convolutional Neural Networks Activation maps

Pooling

max (activ)

ReLU((w ∗ x))

Raw pixels

Convolutional Neural Networks Activation maps

Pooling Features

max (activ)

ReLU((w ∗ x))

Raw pixels

Convolutional Neural Networks Activation maps

Pooling Features

max (activ)

ReLU((w ∗ x))

Raw pixels

Convolutional Neural Networks Activation maps

Pooling Features

max (activ)

ReLU((w ∗ x))

Raw pixels

Convolutional Neural Networks Activation maps

Pooling Features

max (activ)

ReLU((w ∗ x))

Raw pixels

Convolutional Neural Networks

Najeeb Khan (BIGLAB)

Activation maps

Pooling Features

max (activ)

ReLU((w ∗ x))

Raw pixels

Reinforcement Learning

Repeat and map to Q(x, u)

June 8, 2017

26 / 35

Deep Q-Network

(Mnih et al., 2013) https://www.youtube.com/watch?v=V1eYniJ0Rnk

Najeeb Khan (BIGLAB)

Reinforcement Learning

June 8, 2017

27 / 35

Outline

1

Introduction

2

Reinforcement Learning

3

Deep Q-Learning

4

Imitating Optimal Control

Najeeb Khan (BIGLAB)

Reinforcement Learning

June 8, 2017

28 / 35

Optimal Control The optimal control problem is given by min

u1 ,u2 ,··· ,uT

Najeeb Khan (BIGLAB)

T X

c(xt , ut ) : xt = f (xt−1 , ut−1 )

(9)

t=1

Reinforcement Learning

June 8, 2017

29 / 35

Optimal Control The optimal control problem is given by min

u1 ,u2 ,··· ,uT

T X

c(xt , ut ) : xt = f (xt−1 , ut−1 )

(9)

t=1

Shooting method min

u1 ,u2 ,··· ,uT

c(x1 , u1 ) + c(f (x1 , u1 ), u2 ) + · · · + c(f (f (. . . ) . . . ), uT ) (10)

Najeeb Khan (BIGLAB)

Reinforcement Learning

June 8, 2017

29 / 35

Optimal Control The optimal control problem is given by min

u1 ,u2 ,··· ,uT

T X

c(xt , ut ) : xt = f (xt−1 , ut−1 )

(9)

t=1

Shooting method min

u1 ,u2 ,··· ,uT

c(x1 , u1 ) + c(f (x1 , u1 ), u2 ) + · · · + c(f (f (. . . ) . . . ), uT ) (10)

Backpropagate gradients and optimize (Linear Quadratic Regulator) df df dc dc , , , dxt dut dxt dut

Najeeb Khan (BIGLAB)

Reinforcement Learning

June 8, 2017

29 / 35

Optimal Control The optimal control problem is given by min

u1 ,u2 ,··· ,uT

T X

c(xt , ut ) : xt = f (xt−1 , ut−1 )

(9)

t=1

Shooting method min

u1 ,u2 ,··· ,uT

c(x1 , u1 ) + c(f (x1 , u1 ), u2 ) + · · · + c(f (f (. . . ) . . . ), uT ) (10)

Backpropagate gradients and optimize (Linear Quadratic Regulator) df df dc dc , , , dxt dut dxt dut Known dynamics → optimal control Najeeb Khan (BIGLAB)

Reinforcement Learning

June 8, 2017

29 / 35

Imitating Optimal Control

Controlling a low cost manipulator (Deisenroth et al., 2011)

Najeeb Khan (BIGLAB)

Reinforcement Learning

June 8, 2017

30 / 35

Imitating Optimal Control

Controlling a low cost manipulator (Deisenroth et al., 2011) Run base policy π0 (ut |xt ) to collect data D = {(x, u, x 0 )i }

Najeeb Khan (BIGLAB)

Reinforcement Learning

June 8, 2017

30 / 35

Imitating Optimal Control

Controlling a low cost manipulator (Deisenroth et al., 2011) Run base policy π0 (ut |xt ) to collect data D = {(x, u, x 0 )i } Learn Gaussian Process dynamics model p(x 0 |x, u) to maximize P 0 i logp(xi |xi , ui )

Najeeb Khan (BIGLAB)

Reinforcement Learning

June 8, 2017

30 / 35

Imitating Optimal Control

Controlling a low cost manipulator (Deisenroth et al., 2011) Run base policy π0 (ut |xt ) to collect data D = {(x, u, x 0 )i } Learn Gaussian Process dynamics model p(x 0 |x, u) to maximize P 0 i logp(xi |xi , ui ) Backpropagate through p(x 0 |x, u) into the policy to optimize π0 (ut |xt )

Najeeb Khan (BIGLAB)

Reinforcement Learning

June 8, 2017

30 / 35

Imitating Optimal Control

Controlling a low cost manipulator (Deisenroth et al., 2011) Run base policy π0 (ut |xt ) to collect data D = {(x, u, x 0 )i } Learn Gaussian Process dynamics model p(x 0 |x, u) to maximize P 0 i logp(xi |xi , ui ) Backpropagate through p(x 0 |x, u) into the policy to optimize π0 (ut |xt ) Run π0 (ut |xt ), appending the visited tuples (x, u, x 0 ) to D

Najeeb Khan (BIGLAB)

Reinforcement Learning

June 8, 2017

30 / 35

Imitating Optimal Control

(Deisenroth et al., 2011) https://www.youtube.com/watch?v=gdT6dwUOYC0

Najeeb Khan (BIGLAB)

Reinforcement Learning

June 8, 2017

31 / 35

Want to learn more?

Figure – Drafts available on authors’ websites

Google DeepMind https://deepmind.com/research/publications/

Najeeb Khan (BIGLAB)

Reinforcement Learning

June 8, 2017

32 / 35

Acknowledgment

Dr. Ian Stavness

Najeeb Khan (BIGLAB)

Reinforcement Learning

June 8, 2017

33 / 35

Thank You najeebkhan.ca/docs/rl2017.pdf

References Deisenroth, M. P., Rasmussen, C. E., and Fox, D. (2011). Learning to control a low-cost manipulator using data-efficient reinforcement learning. Lenat, D. B., Guha, R. V., Pittman, K., Pratt, D., and Shepherd, M. (1990). Cyc: toward programs with common sense. Communications of the ACM, 33(8):30–49. Melo, F. S. (2001). Convergence of q-learning: A simple proof. Institute Of Systems and Robotics, Tech. Rep, pages 1–4. Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. (2013). Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602. Sutton, R. S. and Barto, A. G. (1998). Reinforcement learning: An introduction, volume 1. MIT press Cambridge.

Najeeb Khan (BIGLAB)

Reinforcement Learning

June 8, 2017

35 / 35