Deep Neural Networks - Introduction, Architectures and ... - Najeeb Khan

0 downloads 282 Views 14MB Size Report
Nov 29, 2016 - Optimization: Direct Method. First order derivative ... Optimization: Iterative Method. Gradient Descent
Deep Neural Networks Introduction, Architectures and Implementations

Najeeb Khan [email protected] usask.ca/~najeeb.khan

November 29, 2016

Outline 1

Introduction Hard Problems Datasets Neural Networks

2

Architectures Unsupervised Learning Convolutional Neural Networks Recurrent Neural Networks Overfitting

3

Implementations TensorFlow Keras Hardware

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

2 / 60

This Presentation

c Image Source: Ralph A. Clevenger

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

3 / 60

Outline 1

Introduction Hard Problems Datasets Neural Networks

2

Architectures Unsupervised Learning Convolutional Neural Networks Recurrent Neural Networks Overfitting

3

Implementations TensorFlow Keras Hardware

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

4 / 60

Machine Learning Tribes Analogizers

Bayesians

Symbolists

Machine Learning Evolutionaries Connectionists

Graphic inspired by first few chapters of the Master Algorithm (Domingos, 2015)

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

5 / 60

Traditional Machine Learning

Image Source: http://scikit-learn.org

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

6 / 60

Hard Problems: Animal or Food?

Image Source: Karen Zack, twitter.com/teenybiscuit

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

7 / 60

Hard Problems: Animal or Food?

Image Source: Karen Zack, twitter.com/teenybiscuit

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

7 / 60

Hard Problems: Animal or Food?

Image Source: Karen Zack, twitter.com/teenybiscuit

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

7 / 60

Hard Problems: Animal or Food?

Image Source: Karen Zack, twitter.com/teenybiscuit

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

7 / 60

Hard Problems: Animal or Food?

Image Source: Karen Zack, twitter.com/teenybiscuit

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

7 / 60

Hard Problems: Autonomous Formula One?

Image Source: RoboRace: http://roborace.com

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

8 / 60

Hard Problems: Should I?

Image Source: OpenReview: ICLR 2017

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

9 / 60

Hard Problems: Should I?

Image Source: OpenReview: ICLR 2017

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

9 / 60

Datasets

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

10 / 60

Datasets

YouTube-8M Dataset I I

8 Million videos, 56 years in duration Available for download: research.google.com/youtube8m

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

10 / 60

Datasets

YouTube-8M Dataset I I

8 Million videos, 56 years in duration Available for download: research.google.com/youtube8m

Yahoo News Feed Dataset I I

User- news item interaction 20M users, 1.5 TB text data

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

10 / 60

Datasets

YouTube-8M Dataset I I

8 Million videos, 56 years in duration Available for download: research.google.com/youtube8m

Yahoo News Feed Dataset I I

User- news item interaction 20M users, 1.5 TB text data

Plant Pictures Dataset I I I

2 TB of drone images 800 GB of time-lapse images 1 GB of sensor data

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

10 / 60

Datasets

MNIST database of handwritten digits (LeCun et al., 1998)

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

11 / 60

Datasets MNIST database of handwritten digits (LeCun et al., 1998)

Image Source: (Deng, 2012)

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

11 / 60

Datasets MNIST database of handwritten digits (LeCun et al., 1998)

Training set: 60,000 examples Test set: 10,000 examples State-of-the-art classification error rate: 0.21% (Wan et al., 2013).

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

11 / 60

Datasets

Canadian Institute for Advanced Research-10 Dataset (CIFAR-10) (Krizhevsky and Hinton, 2009)

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

12 / 60

Datasets Canadian Institute for Advanced Research-10 Dataset (CIFAR-10) (Krizhevsky and Hinton, 2009)

60000 32x32 color images in 10 classes 6000 images per class State-of-the-art classification error rate: 3.47% (Graham, 2014).

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

12 / 60

Artificial Neuron x1 x2

w1

w2

v = wx

Σ

a

a = ϕ(v ) =

w3

1 (1 + e −v )

x3

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

13 / 60

Artificial Neuron x1 x2

w1

w2

v = wx

Σ

a

a = ϕ(v ) =

w3

1 (1 + e −v )

x3 i m p o r t numpy a s np x=np . a r r a y ( [ [ 1 ] , [ 2 ] , [ 3 ] ] ) w=np . random . r a n d ( 1 , 3 ) v=w . d o t ( x ) a =1/( 1 + np . exp (−v ) )

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

13 / 60

Artificial Neuron x1 x2

w1

w2

v = wx

Σ

a

a = ϕ(v ) =

w3

1 (1 + e −v )

x3 i m p o r t numpy a s np x=np . a r r a y ( [ [ 1 ] , [ 2 ] , [ 3 ] ] ) w=np . random . r a n d ( 1 , 3 ) v=w . d o t ( x ) a =1/( 1 + np . exp (−v ) )

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

13 / 60

Artificial Neuron x1 x2

w1

w2

v = wx

Σ

a

a = ϕ(v ) =

w3

1 (1 + e −v )

x3 i m p o r t numpy a s np x=np . a r r a y ( [ [ 1 ] , [ 2 ] , [ 3 ] ] ) w=np . random . r a n d ( 1 , 3 ) v=w . d o t ( x ) a =1/( 1 + np . exp (−v ) )

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

13 / 60

Artificial Neuron x1 x2

w1

w2

v = wx

Σ

a

a = ϕ(v ) =

w3

1 (1 + e −v )

x3 i m p o r t numpy a s np x=np . a r r a y ( [ [ 1 ] , [ 2 ] , [ 3 ] ] ) w=np . random . r a n d ( 1 , 3 ) v=w . d o t ( x ) a =1/( 1 + np . exp (−v ) )

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

13 / 60

Artificial Neuron x1 x2

w1

w2

v = wx

Σ

a

a = ϕ(v ) =

w3

1 (1 + e −v )

x3 i m p o r t numpy a s np x=np . a r r a y ( [ [ 1 ] , [ 2 ] , [ 3 ] ] ) w=np . random . r a n d ( 1 , 3 ) v=w . d o t ( x ) a =1/( 1 + np . exp (−v ) )

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

13 / 60

Neural Networks: Representation

Σϕ x1

Σϕ

x2

Σϕ

x3

Σϕ

v1 = W21 x Σϕ

yˆ1

Σϕ

yˆ2

v2 = W32 a1

1 (1 + e −v1 ) 1 a2 = ϕ(v1 ) = (1 + e −v2 )

a1 =

ϕ(v1 ) =

yˆ = a2

Σϕ

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

14 / 60

Neural Networks: Representation

Σϕ x1

Σϕ

x2

Σϕ

x3

Σϕ

v1 = W21 x Σϕ

yˆ1

Σϕ

yˆ2

v2 = W32 a1

1 (1 + e −v1 ) 1 a2 = ϕ(v1 ) = (1 + e −v2 )

a1 =

ϕ(v1 ) =

yˆ = a2

Σϕ

i m p o r t numpy a s np x=np . a r r a y ( [ [ 1 ] , [ 2 ] , [ 3 ] ] ) w21=r a n d ( 5 , 3 ) ; w32=r a n d ( 2 , 5 ) V1=w . d o t ( x ) ; a1 =1/( 1 + np . exp (−V1 ) ) V2=w . d o t ( a1 ) ; a2 =1/( 1 + np . exp (−V2 ) )

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

14 / 60

Neural Networks: Representation

Σϕ x1

Σϕ

x2

Σϕ

x3

Σϕ

v1 = W21 x Σϕ

yˆ1

Σϕ

yˆ2

v2 = W32 a1

1 (1 + e −v1 ) 1 a2 = ϕ(v1 ) = (1 + e −v2 )

a1 =

ϕ(v1 ) =

yˆ = a2

Σϕ

i m p o r t numpy a s np x=np . a r r a y ( [ [ 1 ] , [ 2 ] , [ 3 ] ] ) w21=r a n d ( 5 , 3 ) ; w32=r a n d ( 2 , 5 ) V1=w . d o t ( x ) ; a1 =1/( 1 + np . exp (−V1 ) ) V2=w . d o t ( a1 ) ; a2 =1/( 1 + np . exp (−V2 ) )

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

14 / 60

Neural Networks: Representation

Σϕ x1

Σϕ

x2

Σϕ

x3

Σϕ

v1 = W21 x Σϕ

yˆ1

Σϕ

yˆ2

v2 = W32 a1

1 (1 + e −v1 ) 1 a2 = ϕ(v1 ) = (1 + e −v2 )

a1 =

ϕ(v1 ) =

yˆ = a2

Σϕ

i m p o r t numpy a s np x=np . a r r a y ( [ [ 1 ] , [ 2 ] , [ 3 ] ] ) w21=r a n d ( 5 , 3 ) ; w32=r a n d ( 2 , 5 ) V1=w . d o t ( x ) ; a1 =1/( 1 + np . exp (−V1 ) ) V2=w . d o t ( a1 ) ; a2 =1/( 1 + np . exp (−V2 ) )

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

14 / 60

Neural Networks: Representation

Σϕ x1

Σϕ

x2

Σϕ

x3

Σϕ

v1 = W21 x Σϕ

yˆ1

Σϕ

yˆ2

v2 = W32 a1

1 (1 + e −v1 ) 1 a2 = ϕ(v1 ) = (1 + e −v2 )

a1 =

ϕ(v1 ) =

yˆ = a2

Σϕ

i m p o r t numpy a s np x=np . a r r a y ( [ [ 1 ] , [ 2 ] , [ 3 ] ] ) w21=r a n d ( 5 , 3 ) ; w32=r a n d ( 2 , 5 ) V1=w . d o t ( x ) ; a1 =1/( 1 + np . exp (−V1 ) ) V2=w . d o t ( a1 ) ; a2 =1/( 1 + np . exp (−V2 ) )

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

14 / 60

Neural Networks: Representation

Σϕ x1

Σϕ

x2

Σϕ

x3

Σϕ

v1 = W21 x Σϕ

yˆ1

Σϕ

yˆ2

v2 = W32 a1

1 (1 + e −v1 ) 1 a2 = ϕ(v1 ) = (1 + e −v2 )

a1 =

ϕ(v1 ) =

yˆ = a2

Σϕ

i m p o r t numpy a s np x=np . a r r a y ( [ [ 1 ] , [ 2 ] , [ 3 ] ] ) w21=r a n d ( 5 , 3 ) ; w32=r a n d ( 2 , 5 ) V1=w . d o t ( x ) ; a1 =1/( 1 + np . exp (−V1 ) ) V2=w . d o t ( a1 ) ; a2 =1/( 1 + np . exp (−V2 ) )

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

14 / 60

Neural Networks: Representation

x1 x2 x3

I as am es fo igno r s ri im ng pl bi ici ty !

Σϕ

v1 = W21 x

Σϕ

Σϕ

yˆ1

Σϕ

yˆ2

Σϕ

v2 = W32 a1

Σϕ Σϕ

1 (1 + e −v1 ) 1 a2 = ϕ(v1 ) = (1 + e −v2 )

a1 =

ϕ(v1 ) =

yˆ = a2

i m p o r t numpy a s np x=np . a r r a y ( [ [ 1 ] , [ 2 ] , [ 3 ] ] ) w21=r a n d ( 5 , 3 ) ; w32=r a n d ( 2 , 5 ) V1=w . d o t ( x ) ; a1 =1/( 1 + np . exp (−V1 ) ) V2=w . d o t ( a1 ) ; a2 =1/( 1 + np . exp (−V2 ) )

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

14 / 60

Neural Networks: Prediction Logistic Regression x1 w 1 x 2 w2 Σ 3 w

1 , (1 + e −wx ) p(y = 0|x) = 1 − p(y = 1|x), p(y = 1|x) =

p (1|x)

x3

y ∗ = arg max p(c = i|x) i

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

15 / 60

Neural Networks: Prediction Logistic Regression x1 w 1

1 , (1 + e −wx ) p(y = 0|x) = 1 − p(y = 1|x), p(y = 1|x) =

x 2 w2 Σ 3

p (1|x)

w

x3

y ∗ = arg max p(c = i|x) i

Softmax Regression x1

Σ

v1

x2

Σ

v2

x3

Σ

v3

Najeeb Khan (BIGLAB)

e vi p(y = i|x) = PJ

j=1 e

vj

,

y ∗ = arg max p(y = i|x) i

Deep Neural Networks

November 29, 2016

15 / 60

Neural Networks: Learning

x1 x2

w1

w2

Σ

y (a > 0.5)

w3

x3

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

16 / 60

Neural Networks: Learning

We will discuss learning weights for a logistic regression classifier

x1 x2

w1

w2

Σ

y (a > 0.5)

w3

x3

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

16 / 60

Neural Networks: Learning

We will discuss learning weights for a logistic regression classifier Training example (x, t) x1 x2

w1

w2

Σ

y (a > 0.5)

w3

x3

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

16 / 60

Neural Networks: Learning

We will discuss learning weights for a logistic regression classifier Training example (x, t) x1 x2

w1

w2

Σ

y (a > 0.5)

Give me weights w so that if I see x the uncertainty in predicting t is as little as possible

w3

x3

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

16 / 60

Neural Networks: Learning

We will discuss learning weights for a logistic regression classifier Training example (x, t) x1 x2

w1

w2

Σ

w3

y (a > 0.5)

Give me weights w so that if I see x the uncertainty in predicting t is as little as possible Max p (y = t|x)

x3

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

16 / 60

Neural Networks: Learning

We will discuss learning weights for a logistic regression classifier Training example (x, t) x1 x2

w1

w2

Σ

w3

y (a > 0.5)

Give me weights w so that if I see x the uncertainty in predicting t is as little as possible Max p (y = t|x) = Ber (y = t|ϕ (wx))

x3

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

16 / 60

Neural Networks: Learning

We will discuss learning weights for a logistic regression classifier Training example (x, t) x1 x2

w1

w2

Σ

w3

x3

Najeeb Khan (BIGLAB)

y (a > 0.5)

Give me weights w so that if I see x the uncertainty in predicting t is as little as possible Max p (y = t|x) = Ber (y = t|ϕ (wx)) p (y = t|x) = ϕ (wx)t (1 − ϕ (wx))1−t

Deep Neural Networks

November 29, 2016

16 / 60

Neural Networks: Learning

We will discuss learning weights for a logistic regression classifier Training example (x, t) x1 x2

w1

w2

Σ

w3

x3

y (a > 0.5)

Give me weights w so that if I see x the uncertainty in predicting t is as little as possible Max p (y = t|x) = Ber (y = t|ϕ (wx)) p (y = t|x) = ϕ (wx)t (1 − ϕ (wx))1−t J = −t log(ϕ (wx))−(1−t) log(1−ϕ (wx))

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

16 / 60

Neural Networks: Learning

We will discuss learning weights for a logistic regression classifier Training example (x, t) x1 x2

w1

w2

Σ

w3

x3

y (a > 0.5)

Give me weights w so that if I see x the uncertainty in predicting t is as little as possible Max p (y = t|x) = Ber (y = t|ϕ (wx)) p (y = t|x) = ϕ (wx)t (1 − ϕ (wx))1−t J = −t log(ϕ (wx))−(1−t) log(1−ϕ (wx)) w∗ = arg min J

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

16 / 60

Optimization: Direct Method

First order derivative test

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

17 / 60

Optimization: Direct Method

First order derivative test ∂J = 0 for ∂wk

Najeeb Khan (BIGLAB)

k = 1...K

Deep Neural Networks

November 29, 2016

17 / 60

Optimization: Direct Method

First order derivative test ∂J = 0 for ∂wk

k = 1...K

∇Jw = 0

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

17 / 60

Optimization: Direct Method

First order derivative test ∂J = 0 for ∂wk

k = 1...K

∇Jw = 0 Closed form solution

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

17 / 60

Optimization: Direct Method

First order derivative test ∂J = 0 for ∂wk

k = 1...K

∇Jw = 0 Closed form solution I

Not available for most cases

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

17 / 60

Optimization: Direct Method

First order derivative test ∂J = 0 for ∂wk

k = 1...K

∇Jw = 0 Closed form solution I I

Not available for most cases Expensive for large/sparse problems

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

17 / 60

Optimization: Direct Method

First order derivative test ∂J = 0 for ∂wk

k = 1...K

∇Jw = 0 Closed form solution I I I

Not available for most cases Expensive for large/sparse problems Model specific, e.g., ∗ = (X T X )−1 X T y for linear regression

Najeeb Khan (BIGLAB)

w

Deep Neural Networks

November 29, 2016

17 / 60

Optimization: Iterative Method Gradient Descent ∆wn = η∇J wn+1 = wn − ∆wn

∇J = −2.25 n = 0 η = 0.5

Najeeb Khan (BIGLAB)

Deep Neural Networks

w = −4.5

November 29, 2016

18 / 60

Optimization: Iterative Method Gradient Descent ∆wn = η∇J wn+1 = wn − ∆wn Stochastic Gradient Descent I

∇J is computed using a single training example

∇J = −2.25 n = 0 η = 0.5

Najeeb Khan (BIGLAB)

Deep Neural Networks

w = −4.5

November 29, 2016

18 / 60

Optimization: Iterative Method Gradient Descent ∆wn = η∇J wn+1 = wn − ∆wn Stochastic Gradient Descent I

∇J is computed using a single training example

Batch Gradient Descent I

∇J is computed using the whole training set

∇J = −2.25 n = 0 η = 0.5

Najeeb Khan (BIGLAB)

Deep Neural Networks

w = −4.5

November 29, 2016

18 / 60

Optimization: Iterative Method Gradient Descent ∆wn = η∇J wn+1 = wn − ∆wn Stochastic Gradient Descent I

∇J is computed using a single training example

Batch Gradient Descent I

∇J is computed using the whole training set

Mini-batch Gradient Descent I

∇J is based on subsets of the training set

∇J = −2.25 n = 0 η = 0.5

w = −4.5

Gradient Descent ∆wn = η∇J wn+1 = wn − ∆wn Stochastic Gradient Descent I

∇J is computed using a single training example

S m ingl a ve x e la x re ye (R gr r en ess log ni io is e, n tic 20 los a 05 ses nd ) ar So e ft co n-

Optimization: Iterative Method

Batch Gradient Descent I

∇J is computed using the whole training set

Mini-batch Gradient Descent I

∇J is based on subsets of the training set

∇J = −2.25 n = 0 η = 0.5

w = −4.5

Gradient Descent ∆wn = η∇J wn+1 = wn − ∆wn Stochastic Gradient Descent I

∇J is computed using a single training example

Batch Gradient Descent I

∇J is computed using the whole training set

Mini-batch Gradient Descent I

∇J is based on subsets of the training set

Najeeb Khan (BIGLAB)

S m ingl a ve x e la r Bu x (R egr yer en ess log ca t ni io is fo n‘t sing e, n tic od te le 20 los a ap ll p lay 05 ses nd ar et er ) ar So t! an s e ft co d n-

Optimization: Iterative Method

∇J = −2.25 n = 0 η = 0.5

Deep Neural Networks

w = −4.5

November 29, 2016

18 / 60

NN without a Hidden Layer Linear boundary

Demo created with playground.tensorflow.org

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

19 / 60

NN without a Hidden Layer Under-fitting a non-linear boundary

Demo created with playground.tensorflow.org

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

19 / 60

Multi-layer Neural Networks

Universal Approximation Theorem (Cybenko, 1989) I

A hidden-layer neural network can represent any decision boundary given enough neurons and proper weights

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

20 / 60

Multi-layer Neural Networks Universal Approximation Theorem (Cybenko, 1989) I

A hidden-layer neural network can represent any decision boundary given enough neurons and proper weights

Demo created with playground.tensorflow.org

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

20 / 60

Multi-layer Neural Networks Universal Approximation Theorem (Cybenko, 1989) I

A hidden-layer neural network can represent any decision boundary given enough neurons and proper weights

How do we get proper weights? We don‘t have ∇wh J with respect to the hidden layer weights!

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

20 / 60

Multi-layer Neural Networks Universal Approximation Theorem (Cybenko, 1989) I

A hidden-layer neural network can represent any decision boundary given enough neurons and proper weights

How do we get proper weights? We don‘t have ∇wh J with respect to the hidden layer weights! Backpropagation Algorithm

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

20 / 60

Multi-layer Neural Networks Universal Approximation Theorem (Cybenko, 1989) I

A hidden-layer neural network can represent any decision boundary given enough neurons and proper weights

How do we get proper weights? We don‘t have ∇wh J with respect to the hidden layer weights! Backpropagation Algorithm I

W x)

Output of the hidden layer ϕ (

Najeeb Khan (BIGLAB)

21

Deep Neural Networks

November 29, 2016

20 / 60

Multi-layer Neural Networks Universal Approximation Theorem (Cybenko, 1989) I

A hidden-layer neural network can represent any decision boundary given enough neurons and proper weights

How do we get proper weights? We don‘t have ∇wh J with respect to the hidden layer weights! Backpropagation Algorithm I I

W x W W x )))

Output of the hidden layer ϕ ( 21 ) Output of the final layer ϕ ( 32 (ϕ (

Najeeb Khan (BIGLAB)

Deep Neural Networks

21

November 29, 2016

20 / 60

Multi-layer Neural Networks Universal Approximation Theorem (Cybenko, 1989) I

A hidden-layer neural network can represent any decision boundary given enough neurons and proper weights

How do we get proper weights? We don‘t have ∇wh J with respect to the hidden layer weights! Backpropagation Algorithm I I I

W x W W x W W x

Output of the hidden layer ϕ ( 21 ) Output of the final layer ϕ ( 32 (ϕ ( 21 ))) Cost function J (t, ϕ ( 32 (ϕ ( 21 ))))

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

20 / 60

Multi-layer Neural Networks Universal Approximation Theorem (Cybenko, 1989) I

A hidden-layer neural network can represent any decision boundary given enough neurons and proper weights

How do we get proper weights? We don‘t have ∇wh J with respect to the hidden layer weights! Backpropagation Algorithm I I I I

W x W W x W W x

Output of the hidden layer ϕ ( 21 ) Output of the final layer ϕ ( 32 (ϕ ( 21 ))) Cost function J (t, ϕ ( 32 (ϕ ( 21 )))) Backpropagation uses chain rule to find the derivative of J with respect to any weight w in the network.

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

20 / 60

Multi-layer Neural Networks Universal Approximation Theorem (Cybenko, 1989) I

A hidden-layer neural network can represent any decision boundary given enough neurons and proper weights

How do we get proper weights? We don‘t have ∇wh J with respect to the hidden layer weights! Backpropagation Algorithm I I I I

W x W W x W W x

Output of the hidden layer ϕ ( 21 ) Output of the final layer ϕ ( 32 (ϕ ( 21 ))) Cost function J (t, ϕ ( 32 (ϕ ( 21 )))) Backpropagation uses chain rule to find the derivative of J with respect to any weight w in the network.

The objective function J is no longer guaranteed to be convex or concave

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

20 / 60

Gradient Descent: Low Learning Rate

∇J = −18.54518 w = −3.3 n = 0 η = 0.005

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

21 / 60

Gradient Descent: High Learning Rate

∇J = −18.54518 w = −3.3 n = 0 η = 0.1

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

22 / 60

Stochastic Gradient Descent Variants

Momentum

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

23 / 60

Stochastic Gradient Descent Variants

Momentum ∆wn = η∇J + m∆wn−1 wn+1 = wn − ∆wn

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

23 / 60

Stochastic Gradient Descent Variants

Momentum ∆wn = η∇J + m∆wn−1 wn+1 = wn − ∆wn

Adaptive learning rates (Senior et al., 2013)

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

23 / 60

Stochastic Gradient Descent Variants

Momentum ∆wn = η∇J + m∆wn−1 wn+1 = wn − ∆wn

Adaptive learning rates (Senior et al., 2013) I I I I I

Exponential adaptation η0 × 10−n/τ AdaGrad AdaDelta RMSProp Adam

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

23 / 60

Stochastic Gradient Descent Variants

Image Source: Alec Radford http://imgur.com/a/Hqolp

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

24 / 60

Stochastic Gradient Descent Variants

Image Source: Alec Radford http://imgur.com/a/Hqolp

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

24 / 60

Stochastic Gradient Descent Variants

Image Source: Alec Radford http://imgur.com/a/Hqolp

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

24 / 60

Impact of ML Research

“... we have implemented similar gradient update rules adapted to our clusters and they successfully improved our baselines. The resulted models have been or will be launched online and one fifth of the world’s population would benefit from these improved models.” — A Reviewer of Deep learning with Elastic Averaging SGD

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

25 / 60

Outline 1

Introduction Hard Problems Datasets Neural Networks

2

Architectures Unsupervised Learning Convolutional Neural Networks Recurrent Neural Networks Overfitting

3

Implementations TensorFlow Keras Hardware

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

26 / 60

Deep Learning To learn complicated functions one may need deep architectures, e.g, neural networks with a large number of hidden layers

Adapted from (Dean, 2016)

Deep Learning To learn complicated functions one may need deep architectures, e.g, neural networks with a large number of hidden layers

e

g ed

Adapted from (Dean, 2016)

r to c te de

Deep Learning To learn complicated functions one may need deep architectures, e.g, neural networks with a large number of hidden layers

sha

pe

e

g ed

det

ect

or

r to c te de

Adapted from (Dean, 2016)

Deep Learning

sha

pe

e

g ed

det

ect

or

r to c te de

Adapted from (Dean, 2016)

abstract features

To learn complicated functions one may need deep architectures, e.g, neural networks with a large number of hidden layers

Deep Learning

recognition

sha

pe

e

g ed

det

ect

or

r to c te de

abstract features

To learn complicated functions one may need deep architectures, e.g, neural networks with a large number of hidden layers

Adapted from (Dean, 2016)

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

27 / 60

Deep Learning Architectures Convo Nets

Unsup Learning

Deep Learning

Recursive Nets

Reinforced Learning

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

28 / 60

Unsupervised Learning

A good first reference is (Bengio, 2009)

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

29 / 60

Unsupervised Learning

A good first reference is (Bengio, 2009) Problems with deep architectures

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

29 / 60

Unsupervised Learning

A good first reference is (Bengio, 2009) Problems with deep architectures I

We don’t have enough labeled data

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

29 / 60

Unsupervised Learning

A good first reference is (Bengio, 2009) Problems with deep architectures I I

We don’t have enough labeled data Gradient vanishing problem

W x ))

∇W 21 J (t, ϕ (

Najeeb Khan (BIGLAB)

Deep Neural Networks

21

November 29, 2016

29 / 60

Unsupervised Learning

A good first reference is (Bengio, 2009) Problems with deep architectures I I

We don’t have enough labeled data Gradient vanishing problem ∇W 21 J (t, ϕ (

Najeeb Khan (BIGLAB)

W

32

Deep Neural Networks

(ϕ (

W x )))) 21

November 29, 2016

29 / 60

Unsupervised Learning

A good first reference is (Bengio, 2009) Problems with deep architectures I I

We don’t have enough labeled data Gradient vanishing problem

W

∇W 21 J (t, ϕ (

Najeeb Khan (BIGLAB)

W

43 ϕ (

Deep Neural Networks

32

W x )))))

(ϕ (

21

November 29, 2016

29 / 60

Unsupervised Learning

A good first reference is (Bengio, 2009) Problems with deep architectures I I

We don’t have enough labeled data Gradient vanishing problem

W

∇W 21 J (t, ϕ (

Najeeb Khan (BIGLAB)

W

54 ϕ (

W

43 ϕ (

Deep Neural Networks

32

W x ))))))

(ϕ (

21

November 29, 2016

29 / 60

Unsupervised Learning A good first reference is (Bengio, 2009) Problems with deep architectures I I

We don’t have enough labeled data Gradient vanishing problem

W

∇W 21 J (t, ϕ (

W

54 ϕ (

W

43 ϕ (

32

W x ))))))

(ϕ (

21

∆wn = η∇J wn+1 = wn − ∆wn

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

29 / 60

Unsupervised Learning A good first reference is (Bengio, 2009) Problems with deep architectures I I

We don’t have enough labeled data Gradient vanishing problem

W

∇W 21 J (t, ϕ (

W

54 ϕ (

W

43 ϕ (

32

W x ))))))

(ϕ (

21

∆wn = η∇J wn+1 = wn − ∆wn

Unsupervised learning

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

29 / 60

Unsupervised Learning A good first reference is (Bengio, 2009) Problems with deep architectures I I

We don’t have enough labeled data Gradient vanishing problem

W

∇W 21 J (t, ϕ (

W

54 ϕ (

W

43 ϕ (

32

W x ))))))

(ϕ (

21

∆wn = η∇J wn+1 = wn − ∆wn

Unsupervised learning I

Utilizes unlabeled data

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

29 / 60

Unsupervised Learning A good first reference is (Bengio, 2009) Problems with deep architectures I I

We don’t have enough labeled data Gradient vanishing problem

W

∇W 21 J (t, ϕ (

W

54 ϕ (

W

43 ϕ (

32

W x ))))))

(ϕ (

21

∆wn = η∇J wn+1 = wn − ∆wn

Unsupervised learning I I

Utilizes unlabeled data Uses layerwise pre-training

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

29 / 60

Unsupervised Pre-training

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

30 / 60

Unsupervised Pre-training There is more information available in the form of data than in the form of labels

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

30 / 60

Unsupervised Pre-training There is more information available in the form of data than in the form of labels Learn a good initialization of the weights using the unlabeled data

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

30 / 60

Unsupervised Pre-training There is more information available in the form of data than in the form of labels Learn a good initialization of the weights using the unlabeled data Fine-tune the weights with labeled data

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

30 / 60

Unsupervised Pre-training There is more information available in the form of data than in the form of labels Learn a good initialization of the weights using the unlabeled data Fine-tune the weights with labeled data Two popular models used for pre-initialization of weights are autoencoder and restricted Boltzmann machines

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

30 / 60

Unsupervised Pre-training There is more information available in the form of data than in the form of labels Learn a good initialization of the weights using the unlabeled data Fine-tune the weights with labeled data Two popular models used for pre-initialization of weights are autoencoder and restricted Boltzmann machines Input layer

Output layer Hidden layer

x1

x ˆ1

x2

x ˆ2

x3

x ˆ3

x4

x ˆ4

x5

x ˆ5

x6

x ˆ6

Autoencoder

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

30 / 60

Unsupervised Pre-training There is more information available in the form of data than in the form of labels Learn a good initialization of the weights using the unlabeled data Fine-tune the weights with labeled data Two popular models used for pre-initialization of weights are autoencoder and restricted Boltzmann machines Input layer

Output layer Hidden layer

x1

Visible layer Hidden layer

x ˆ1

x1

x2

x ˆ2

x2

h1

x3

x ˆ3

x3

h2

x4

x ˆ4

x4

h3

x5

x ˆ5

x5

h4

x6

x ˆ6

x6

Autoencoder

Najeeb Khan (BIGLAB)

RBM

Deep Neural Networks

November 29, 2016

30 / 60

Layer-wise Pre-training Training a deep network Input layer

x1 x2 x3

Hidden layer 1

Hidden layer 2

Hidden layer 3

Output layer

W21 W32

W43

W54

x4

yˆ1

x5

yˆ2

x6 x7 x8

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

31 / 60

Layer-wise Pre-training Training a deep network Input layer

x1 x2 x3

Hidden layer 1

Hidden layer 2

Hidden layer 3

Output layer

W21 W32

W43

W54

x4

yˆ1

x5

yˆ2

x6 x7 x8

Raw input pixels

Hidden features α

Reconstructed input

x1

x ˆ1

x2

x ˆ2

x3

x ˆ3

x4

x ˆ4

x5

x ˆ5

x6

x ˆ6

x7

x ˆ7

x8

x ˆ8

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

31 / 60

Layer-wise Pre-training Training a deep network Input layer

x1

Hidden layer 1

Hidden layer 2

Hidden layer 3

Output layer

W21 W32

x2 x3

W43

W54

x4

yˆ1

x5

yˆ2

x6 x7 x8

Raw input pixels

Hidden features α

Reconstructed input

Input features α

Hidden features β

Reconstructed α

x1

x ˆ1

x2

x ˆ2

α1

α ˆ1

x3

x ˆ3

α2

α ˆ2

x4

x ˆ4

α3

α ˆ3

x5

x ˆ5

α4

α ˆ4

x6

x ˆ6

α5

α ˆ5

x7

x ˆ7

α6

α ˆ6

x8

x ˆ8

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

31 / 60

Layer-wise Pre-training Training a deep network Input layer

x1

Hidden layer 1

Hidden layer 2

Output layer

Hidden layer 3

W21 W32

x2 x3

W43

W54

x4

yˆ1

x5

yˆ2

x6 x7 x8

Raw input pixels

Hidden features α

Reconstructed input

Input features α

Hidden features β

Reconstructed α

x1

x ˆ1

x2

x ˆ2

α1

α ˆ1

x3

x ˆ3

α2

α ˆ2

β1

βˆ1

x4

x ˆ4

α3

α ˆ3

β2

βˆ2

x5

x ˆ5

α4

α ˆ4

β3

βˆ3

x6

x ˆ6

α5

α ˆ5

β4

βˆ4

x7

x ˆ7

α6

α ˆ6

x8

x ˆ8

Najeeb Khan (BIGLAB)

Input features β

Deep Neural Networks

Hidden features γ

Reconstructed β

November 29, 2016

31 / 60

Layer-wise Pre-training Training a deep network Input layer

x1

Hidden layer 1

Hidden layer 2

Output layer

Hidden layer 3

W21 W32

x2 x3

W43

W54

x4

yˆ1

x5

yˆ2

x6 x7 x8

Raw input pixels

Hidden features α

Reconstructed input

Input features α

Hidden features β

Reconstructed α

x1

x ˆ1

x2

x ˆ2

α1

α ˆ1

x3

x ˆ3

α2

α ˆ2

β1

βˆ1

x4

x ˆ4

α3

α ˆ3

β2

βˆ2

x5

x ˆ5

α4

α ˆ4

β3

βˆ3

x6

x ˆ6

α5

α ˆ5

β4

βˆ4

x7

x ˆ7

α6

α ˆ6

x8

x ˆ8

Najeeb Khan (BIGLAB)

Input features β

Deep Neural Networks

Hidden features γ

Hidden features γ

Reconstructed β

Output

γ1 yˆ1 γ2 yˆ2 γ3

November 29, 2016

31 / 60

Layer-wise Pre-training Training a deep network Input layer

x1

Hidden layer 1

Hidden layer 2

Output layer

Hidden layer 3

W21 W32

x2 x3

W43

W54

x4

yˆ1

x5

yˆ2

x6 x7 x8

Raw input pixels

Hidden features α

Reconstructed input

Input features α

Hidden features β

Reconstructed α

x1

x ˆ1

x2

x ˆ2

α1

α ˆ1

x3

x ˆ3

α2

α ˆ2

β1

βˆ1

x4

x ˆ4

α3

α ˆ3

β2

βˆ2

x5

x ˆ5

α4

α ˆ4

β3

βˆ3

x6

x ˆ6

α5

α ˆ5

β4

βˆ4

x7

x ˆ7

α6

α ˆ6

x8

x ˆ8

Najeeb Khan (BIGLAB)

Input features β

Deep Neural Networks

Hidden features γ

Hidden features γ

Reconstructed β

Output

γ1 yˆ1 γ2 yˆ2 γ3

November 29, 2016

31 / 60

Layer-wise Pre-training Training a deep network Input layer

Hidden layer 1

Hidden layer 2

Output layer

Hidden layer 3

W21 W32

x2 x3

W43

W54

x4

F la ine b e tu led ne sa wi m th pl es

x1

yˆ1

x5

yˆ2

x6 x7 x8

Raw input pixels

Hidden features α

Reconstructed input

Input features α

Hidden features β

Reconstructed α

x1

x ˆ1

x2

x ˆ2

α1

α ˆ1

x3

x ˆ3

α2

α ˆ2

β1

βˆ1

x4

x ˆ4

α3

α ˆ3

β2

βˆ2

x5

x ˆ5

α4

α ˆ4

β3

βˆ3

x6

x ˆ6

α5

α ˆ5

β4

βˆ4

x7

x ˆ7

α6

α ˆ6

x8

x ˆ8

Najeeb Khan (BIGLAB)

Input features β

Deep Neural Networks

Hidden features γ

Hidden features γ

Reconstructed β

Output

γ1 yˆ1 γ2 yˆ2 γ3

November 29, 2016

31 / 60

Varients of Autoencoders (Bengio et al., 2013)

Regularized Autoencoders Sparse Autoencoders Stacked Denoising Autoencoders Contractive Autoencoders

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

32 / 60

Convolutional Neural Networks (Karpathy, 2016)

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

33 / 60

Convolutional Neural Networks (Karpathy, 2016)

Developed for solving problems in computer vision

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

33 / 60

Convolutional Neural Networks (Karpathy, 2016)

Developed for solving problems in computer vision Inspired by the animal visual cortex

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

33 / 60

Convolutional Neural Networks (Karpathy, 2016)

Developed for solving problems in computer vision Inspired by the animal visual cortex Parameter sharing makes them easier to train

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

33 / 60

Convolutional Neural Networks (Karpathy, 2016)

Developed for solving problems in computer vision Inspired by the animal visual cortex Parameter sharing makes them easier to train Input size 64x64x3, let’s say we train a hidden layer with 512 neurons

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

33 / 60

Convolutional Neural Networks (Karpathy, 2016)

Developed for solving problems in computer vision Inspired by the animal visual cortex Parameter sharing makes them easier to train Input size 64x64x3, let’s say we train a hidden layer with 512 neurons How many parameters do we need to train in a fully connected layer?

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

33 / 60

Convolutional Neural Networks (Karpathy, 2016)

Developed for solving problems in computer vision Inspired by the animal visual cortex Parameter sharing makes them easier to train Input size 64x64x3, let’s say we train a hidden layer with 512 neurons How many parameters do we need to train in a fully connected layer? 64x64x3x512 ≈ 6 Million

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

33 / 60

Convolutional Neural Networks (Karpathy, 2016)

Developed for solving problems in computer vision Inspired by the animal visual cortex Parameter sharing makes them easier to train Input size 64x64x3, let’s say we train a hidden layer with 512 neurons How many parameters do we need to train in a fully connected layer? 64x64x3x512 ≈ 6 Million If we use 10 kernels of size 5x5x3, we have 750 parameters to learn

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

33 / 60

Convolutional Neural Networks

Input Layer

Convolutional Neural Networks

Input Layer

Conv Layer

Convolutional Neural Networks

Input Layer

Conv Layer

Pooling Layer

Convolutional Neural Networks

Input Layer

Conv Layer

Pooling Layer

Fully connected Layer

Convolutional Neural Networks

Input Layer

Conv Layer

Pooling Layer

Fully connected Layer

Softmax SVM ...

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

34 / 60

Convolutional Neural Networks Assumes local connectivity Generates activation maps using convolution operation instead of dot products 1D convolution (w ∗ x)n =

X

wm xn−m

m  kern

Activation map activ(n) = ReLU((w ∗ x)n )

(1)

where ReLU(z) = max(0, z).

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

35 / 60

Convolutional Neural Networks Raw pixels

Convolutional Neural Networks

ReLU((w ∗ x))

Raw pixels

Activation maps

Convolutional Neural Networks

ReLU((w ∗ x))

Raw pixels

Activation maps

Convolutional Neural Networks

ReLU((w ∗ x))

Raw pixels

Activation maps

Convolutional Neural Networks

ReLU((w ∗ x))

Raw pixels

Activation maps

Convolutional Neural Networks Activation maps

Pooling

max (activ)

ReLU((w ∗ x))

Raw pixels

Convolutional Neural Networks Activation maps

Pooling Features

max (activ)

ReLU((w ∗ x))

Raw pixels

Convolutional Neural Networks Activation maps

Pooling Features

max (activ)

ReLU((w ∗ x))

Raw pixels

Convolutional Neural Networks Activation maps

Pooling Features

max (activ)

ReLU((w ∗ x))

Raw pixels

Convolutional Neural Networks Activation maps

Pooling Features

max (activ)

ReLU((w ∗ x))

Raw pixels

Convolutional Neural Networks

Najeeb Khan (BIGLAB)

Activation maps

Pooling Features

max (activ)

ReLU((w ∗ x))

Raw pixels

Deep Neural Networks

Repeat

November 29, 2016

36 / 60

GoogLeNet softmax2

SoftmaxActivation

FC

AveragePool 7x7+1(V)

DepthConcat

Conv 1x1+1(S)

Conv 3x3+1(S)

Conv 5x5+1(S)

Conv 1x1+1(S)

Conv 1x1+1(S)

Conv 1x1+1(S)

MaxPool 3x3+1(S)

DepthConcat

Conv 1x1+1(S)

Conv 3x3+1(S)

Conv 1x1+1(S)

Conv 5x5+1(S)

Conv 1x1+1(S)

softmax1

Conv 1x1+1(S)

MaxPool 3x3+1(S)

SoftmaxActivation

MaxPool 3x3+2(S)

FC

DepthConcat

Conv 1x1+1(S)

FC

Conv 3x3+1(S)

Conv 5x5+1(S)

Conv 1x1+1(S)

Conv 1x1+1(S)

Conv 1x1+1(S)

Conv 1x1+1(S)

MaxPool 3x3+1(S)

AveragePool 5x5+3(V)

DepthConcat

Conv 1x1+1(S)

Conv 3x3+1(S)

Conv 5x5+1(S)

Conv 1x1+1(S)

Conv 1x1+1(S)

Conv 1x1+1(S)

MaxPool 3x3+1(S)

DepthConcat

Conv 1x1+1(S)

softmax0

Conv 3x3+1(S)

Conv 5x5+1(S)

Conv 1x1+1(S)

Conv 1x1+1(S)

Conv 1x1+1(S)

MaxPool 3x3+1(S)

SoftmaxActivation

FC

DepthConcat

Conv 1x1+1(S)

FC

Conv 3x3+1(S)

Conv 5x5+1(S)

Conv 1x1+1(S)

Conv 1x1+1(S)

Conv 1x1+1(S)

Conv 1x1+1(S)

MaxPool 3x3+1(S)

AveragePool 5x5+3(V)

DepthConcat

Conv 1x1+1(S)

Conv 3x3+1(S)

Conv 5x5+1(S)

Conv 1x1+1(S)

Conv 1x1+1(S)

Conv 1x1+1(S)

MaxPool 3x3+1(S)

MaxPool 3x3+2(S)

DepthConcat

Conv 1x1+1(S)

Conv 3x3+1(S)

Conv 5x5+1(S)

Conv 1x1+1(S)

Conv 1x1+1(S)

Conv 1x1+1(S)

MaxPool 3x3+1(S)

DepthConcat

Conv 1x1+1(S)

Conv 3x3+1(S)

Conv 5x5+1(S)

Conv 1x1+1(S)

Conv 1x1+1(S)

Conv 1x1+1(S)

MaxPool 3x3+1(S)

MaxPool 3x3+2(S)

LocalRespNorm

Conv 3x3+1(S)

Conv 1x1+1(V)

LocalRespNorm

MaxPool 3x3+2(S)

Conv 7x7+2(S)

input

Figure – GoogLeNet network with all the bells and whistles (Szegedy et al., 2015) Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

37 / 60

Residual Neural Networks x weight layer

relu

F (x)

weight layer

F (x) + x

x identity

relu

Figure – Residual Neural Networks (He et al., 2015)

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

38 / 60

Residual Neural Networks 34-layer plain

x weight layer

relu

F (x)

weight layer

F (x) + x

x

image

7x7 conv, 64, /2

pool, /2

pool, /2

3x3 conv, 64

3x3 conv, 64

3x3 conv, 64

identity

3x3 conv, 64

relu

34-layer residual

image

7x7 conv, 64, /2

3x3 conv, 64 3x3 conv, 64

3x3 conv, 64

3x3 conv, 64

3x3 conv, 64

3x3 conv, 64

3x3 conv, 64

3x3 conv, 64

3x3 conv, 128, /2

3x3 conv, 128, /2

3x3 conv, 128

3x3 conv, 128

3x3 conv, 128

3x3 conv, 128

3x3 conv, 128

3x3 conv, 128

3x3 conv, 128

3x3 conv, 128

3x3 conv, 128 3x3 conv, 128

3x3 conv, 128 3x3 conv, 128

3x3 conv, 128

3x3 conv, 128

3x3 conv, 256, /2

3x3 conv, 256, /2

3x3 conv, 256

3x3 conv, 256

3x3 conv, 256

3x3 conv, 256

3x3 conv, 256 3x3 conv, 256

3x3 conv, 256 3x3 conv, 256

3x3 conv, 256

3x3 conv, 256

3x3 conv, 256

3x3 conv, 256

3x3 conv, 256

3x3 conv, 256

3x3 conv, 256

3x3 conv, 256

3x3 conv, 256

3x3 conv, 256

3x3 conv, 256

3x3 conv, 256

3x3 conv, 256

3x3 conv, 256

3x3 conv, 512, /2

3x3 conv, 512, /2

3x3 conv, 512

3x3 conv, 512

3x3 conv, 512

3x3 conv, 512

3x3 conv, 512

3x3 conv, 512

3x3 conv, 512

3x3 conv, 512

3x3 conv, 512

3x3 conv, 512

avg pool

avg pool

fc 1000

fc 1000

Figure – Residual Neural Networks (He et al., 2015)

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

38 / 60

Residual Neural Networks 34-layer plain

x weight layer

relu

F (x)

x

weight layer

F (x) + x

pool, /2

3x3 conv, 64

3x3 conv, 64

3x3 conv, 64

relu

20

image

7x7 conv, 64, /2

pool, /2

3x3 conv, 64

identity

20

56-layer

5

0 0

20-layer

110-layer

5

plain-20 plain-32 plain-44 plain-56 1

error (%)

error (%)

20-layer

10

2

3

iter. (1e4)

4

5

6

0 0

1

2

3

iter. (1e4)

4

5

6

3x3 conv, 64 3x3 conv, 64

3x3 conv, 64

3x3 conv, 64

3x3 conv, 64

3x3 conv, 64

3x3 conv, 128, /2

3x3 conv, 128, /2

3x3 conv, 128

3x3 conv, 128

3x3 conv, 128

3x3 conv, 128

3x3 conv, 128

3x3 conv, 128

3x3 conv, 128

3x3 conv, 128

3x3 conv, 128

3x3 conv, 128 3x3 conv, 128

3x3 conv, 128

3x3 conv, 128

3x3 conv, 256, /2

3x3 conv, 256, /2

3x3 conv, 256

3x3 conv, 256

3x3 conv, 256

3x3 conv, 256

3x3 conv, 256 3x3 conv, 256

10

3x3 conv, 64

3x3 conv, 64

3x3 conv, 128

ResNet-20 ResNet-32 ResNet-44 ResNet-56 ResNet-110

34-layer residual

image

7x7 conv, 64, /2

3x3 conv, 256 3x3 conv, 256

3x3 conv, 256

3x3 conv, 256

3x3 conv, 256

3x3 conv, 256

3x3 conv, 256

3x3 conv, 256

3x3 conv, 256

3x3 conv, 256

3x3 conv, 256

3x3 conv, 256

3x3 conv, 256

3x3 conv, 256

3x3 conv, 256

3x3 conv, 256

3x3 conv, 512, /2

3x3 conv, 512, /2

3x3 conv, 512

3x3 conv, 512

3x3 conv, 512

3x3 conv, 512

3x3 conv, 512

3x3 conv, 512

3x3 conv, 512

3x3 conv, 512

3x3 conv, 512

3x3 conv, 512

avg pool

avg pool

fc 1000

fc 1000

Figure – Residual Neural Networks (He et al., 2015)

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

38 / 60

Recurrent Neural Networks (Lipton et al., 2015)

Modeling sequential and variable length data, e.g, what is happening in a video?

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

39 / 60

Recurrent Neural Networks (Lipton et al., 2015)

Modeling sequential and variable length data, e.g, what is happening in a video? Extend feedforward neural networks with recursive edges

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

39 / 60

Recurrent Neural Networks (Lipton et al., 2015)

y Modeling sequential and variable length data, e.g, what is happening in a video? Extend feedforward neural networks with recursive edges Training is performed using Back-propagation Through Time (BPTT) x

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

39 / 60

RNN Unfolding

yt0

xt0

RNN Unfolding

yt0

yt1

xt0

xt1

RNN Unfolding

yt0

yt1

yt2

xt0

xt1

xt2

RNN Unfolding

yt0

yt1

yt2

yt3

xt0

xt1

xt2

xt3

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

40 / 60

Long Short Term Memory (LSTM) RNNs cannot model long term dependencies

Source: Christopher Olah, http://colah.github.io

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

41 / 60

Long Short Term Memory (LSTM)

Source: Christopher Olah, http://colah.github.io

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

41 / 60

Long Short Term Memory (LSTM)

Source: Christopher Olah, http://colah.github.io

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

41 / 60

Long Short Term Memory (LSTM)

Source: Christopher Olah, http://colah.github.io

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

41 / 60

Long Short Term Memory (LSTM)

Source: Christopher Olah, http://colah.github.io

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

41 / 60

Long Short Term Memory (LSTM)

Source: Christopher Olah, http://colah.github.io

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

41 / 60

Overfitting

Demo created with playground.tensorflow.org

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

42 / 60

Overfitting: Do we have enough data?

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

43 / 60

Overfitting: Do we have enough data?

Consider we are classifying binary images of size 10 × 10

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

43 / 60

Overfitting: Do we have enough data?

Consider we are classifying binary images of size 10 × 10 We have a dataset containing 1 million labeled images

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

43 / 60

Overfitting: Do we have enough data?

Consider we are classifying binary images of size 10 × 10 We have a dataset containing 1 million labeled images Number of possible images?

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

43 / 60

Overfitting: Do we have enough data?

Consider we are classifying binary images of size 10 × 10 We have a dataset containing 1 million labeled images Number of possible images? 2100

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

43 / 60

Overfitting: Do we have enough data?

Consider we are classifying binary images of size 10 × 10 We have a dataset containing 1 million labeled images Number of possible images? 2100 Fraction of possible images for which we have labels?

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

43 / 60

Overfitting: Do we have enough data?

Consider we are classifying binary images of size 10 × 10 We have a dataset containing 1 million labeled images Number of possible images? 2100 Fraction of possible images for which we have labels? 106 2100

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

43 / 60

Overfitting: Do we have enough data?

Consider we are classifying binary images of size 10 × 10 We have a dataset containing 1 million labeled images Number of possible images? 2100 Fraction of possible images for which we have labels? 106 2100



106 1030

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

43 / 60

Overfitting: Do we have enough data?

Consider we are classifying binary images of size 10 × 10 We have a dataset containing 1 million labeled images Number of possible images? 2100 Fraction of possible images for which we have labels? 106 2100



106 1030

∼ 10−24

0.000, 000, 000, 000, 000, 000, 000, 1%

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

43 / 60

Overfitting

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

44 / 60

Overfitting Overcoming over-fitting

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

44 / 60

Overfitting Overcoming over-fitting Get more data I I

Use crowd-sourcing, e.g., Amazon MTurk Use data-augmentation

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

44 / 60

Overfitting Overcoming over-fitting Get more data I I

Use crowd-sourcing, e.g., Amazon MTurk Use data-augmentation

Don’t train too much I

Stop when the validation error starts ascending

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

44 / 60

Overfitting Overcoming over-fitting Get more data I I

Use crowd-sourcing, e.g., Amazon MTurk Use data-augmentation

Don’t train too much I

Stop when the validation error starts ascending

Use some form of regularization I I

W

Penalize weights, e.g., add | |2 to J Damage neurons in innovative ways, e.g,

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

44 / 60

Overfitting Overcoming over-fitting Get more data I I

Use crowd-sourcing, e.g., Amazon MTurk Use data-augmentation

Don’t train too much I

Stop when the validation error starts ascending

Use some form of regularization I I

W

Penalize weights, e.g., add | |2 to J Damage neurons in innovative ways, e.g, DropOut (Srivastava et al., 2014),

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

44 / 60

Overfitting Overcoming over-fitting Get more data I I

Use crowd-sourcing, e.g., Amazon MTurk Use data-augmentation

Don’t train too much I

Stop when the validation error starts ascending

Use some form of regularization I I

W

Penalize weights, e.g., add | |2 to J Damage neurons in innovative ways, e.g, DropOut (Srivastava et al., 2014), DropConnect (Wan et al., 2013),

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

44 / 60

Overfitting Overcoming over-fitting Get more data I I

Use crowd-sourcing, e.g., Amazon MTurk Use data-augmentation

Don’t train too much I

Stop when the validation error starts ascending

Use some form of regularization I I

W

Penalize weights, e.g., add | |2 to J Damage neurons in innovative ways, e.g, DropOut (Srivastava et al., 2014), DropConnect (Wan et al., 2013), ShakeOut (Kang et al., 2016) etc.

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

44 / 60

Overfitting Overcoming over-fitting Get more data I I

Use crowd-sourcing, e.g., Amazon MTurk Use data-augmentation

Don’t train too much I

Stop when the validation error starts ascending

Use some form of regularization I I

I

W

Penalize weights, e.g., add | |2 to J Damage neurons in innovative ways, e.g, DropOut (Srivastava et al., 2014), DropConnect (Wan et al., 2013), ShakeOut (Kang et al., 2016) etc. Induce noise into your model

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

44 / 60

Outline 1

Introduction Hard Problems Datasets Neural Networks

2

Architectures Unsupervised Learning Convolutional Neural Networks Recurrent Neural Networks Overfitting

3

Implementations TensorFlow Keras Hardware

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

45 / 60

Implementations

Image Source: http://imgur.com/ZfkhOt4

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

46 / 60

TensorFlow TensorFlow is an interface for expressing machine learning algorithms, and an implementation for executing such algorithms (Abadi et al., 2015).

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

47 / 60

TensorFlow TensorFlow is an interface for expressing machine learning algorithms, and an implementation for executing such algorithms (Abadi et al., 2015). Represents computations as graphs.

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

47 / 60

TensorFlow TensorFlow is an interface for expressing machine learning algorithms, and an implementation for executing such algorithms (Abadi et al., 2015). Represents computations as graphs. I

Nodes in the graph are called ops

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

47 / 60

TensorFlow TensorFlow is an interface for expressing machine learning algorithms, and an implementation for executing such algorithms (Abadi et al., 2015). Represents computations as graphs. I I

Nodes in the graph are called ops Edges are tensors

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

47 / 60

TensorFlow TensorFlow is an interface for expressing machine learning algorithms, and an implementation for executing such algorithms (Abadi et al., 2015). Represents computations as graphs. I I

Nodes in the graph are called ops Edges are tensors

Represents data as tensors.

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

47 / 60

TensorFlow TensorFlow is an interface for expressing machine learning algorithms, and an implementation for executing such algorithms (Abadi et al., 2015). Represents computations as graphs. I I

Nodes in the graph are called ops Edges are tensors

Represents data as tensors. I

A tensor is an n-dimensional array with a rank, shape and type.

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

47 / 60

TensorFlow TensorFlow is an interface for expressing machine learning algorithms, and an implementation for executing such algorithms (Abadi et al., 2015). Represents computations as graphs. I I

Nodes in the graph are called ops Edges are tensors

Represents data as tensors. I I

A tensor is an n-dimensional array with a rank, shape and type. For example [batch, height, width, channels]

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

47 / 60

TensorFlow TensorFlow is an interface for expressing machine learning algorithms, and an implementation for executing such algorithms (Abadi et al., 2015). Represents computations as graphs. I I

Nodes in the graph are called ops Edges are tensors

Represents data as tensors. I I

A tensor is an n-dimensional array with a rank, shape and type. For example [batch, height, width, channels]

Executes graphs in the context of sessions.

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

47 / 60

TensorFlow TensorFlow is an interface for expressing machine learning algorithms, and an implementation for executing such algorithms (Abadi et al., 2015). Represents computations as graphs. I I

Nodes in the graph are called ops Edges are tensors

Represents data as tensors. I I

A tensor is an n-dimensional array with a rank, shape and type. For example [batch, height, width, channels]

Executes graphs in the context of sessions. I

A session places the graph ops onto devices such as CPUs/GPUs etc.

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

47 / 60

TensorFlow TensorFlow is an interface for expressing machine learning algorithms, and an implementation for executing such algorithms (Abadi et al., 2015). Represents computations as graphs. I I

Nodes in the graph are called ops Edges are tensors

Represents data as tensors. I I

A tensor is an n-dimensional array with a rank, shape and type. For example [batch, height, width, channels]

Executes graphs in the context of sessions. I

A session places the graph ops onto devices such as CPUs/GPUs etc.

Maintains state with variables.

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

47 / 60

TensorFlow TensorFlow is an interface for expressing machine learning algorithms, and an implementation for executing such algorithms (Abadi et al., 2015). Represents computations as graphs. I I

Nodes in the graph are called ops Edges are tensors

Represents data as tensors. I I

A tensor is an n-dimensional array with a rank, shape and type. For example [batch, height, width, channels]

Executes graphs in the context of sessions. I

A session places the graph ops onto devices such as CPUs/GPUs etc.

Maintains state with variables. I

Typically represent the parameters of a statistical model as a set of variables

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

47 / 60

TensorFlow TensorFlow is an interface for expressing machine learning algorithms, and an implementation for executing such algorithms (Abadi et al., 2015). Represents computations as graphs. I I

Nodes in the graph are called ops Edges are tensors

Represents data as tensors. I I

A tensor is an n-dimensional array with a rank, shape and type. For example [batch, height, width, channels]

Executes graphs in the context of sessions. I

A session places the graph ops onto devices such as CPUs/GPUs etc.

Maintains state with variables. I

Typically represent the parameters of a statistical model as a set of variables

Uses feeds and fetches to get data into and out of arbitrary operations. Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

47 / 60

TensorFlow Example I

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

48 / 60

TensorFlow Example I import t e n s o r f l o w as t f # D e f i n e two c o n s t a n t s matrix1 = t f . constant ( [ [ 3 . , 3 . ] ] ) matrix2 = t f . constant ( [ [ 2 . ] , [ 2 . ] ] ) # D e f i n e a matmul o p e r a t i o n p r o d u c t = t f . matmul ( m a t r i x 1 , m a t r i x 2 ) # Launch t h e d e f a u l t g r a p h . sess = tf . Session () # Run t h e matmul o p e r a t i o n r e s u l t = s e s s . run ( product ) print ( result ) [[ 12.]] # Close the Session . sess . close ()

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

48 / 60

TensorFlow Example I import t e n s o r f l o w as t f # D e f i n e two c o n s t a n t s matrix1 = t f . constant ( [ [ 3 . , 3 . ] ] ) matrix2 = t f . constant ( [ [ 2 . ] , [ 2 . ] ] ) # D e f i n e a matmul o p e r a t i o n p r o d u c t = t f . matmul ( m a t r i x 1 , m a t r i x 2 ) # Launch t h e d e f a u l t g r a p h . sess = tf . Session () # Run t h e matmul o p e r a t i o n r e s u l t = s e s s . run ( product ) print ( result ) [[ 12.]] # Close the Session . sess . close ()

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

48 / 60

TensorFlow Example I import t e n s o r f l o w as t f # D e f i n e two c o n s t a n t s matrix1 = t f . constant ( [ [ 3 . , 3 . ] ] ) matrix2 = t f . constant ( [ [ 2 . ] , [ 2 . ] ] ) # D e f i n e a matmul o p e r a t i o n p r o d u c t = t f . matmul ( m a t r i x 1 , m a t r i x 2 ) # Launch t h e d e f a u l t g r a p h . sess = tf . Session () # Run t h e matmul o p e r a t i o n r e s u l t = s e s s . run ( product ) print ( result ) [[ 12.]] # Close the Session . sess . close ()

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

48 / 60

TensorFlow Example I import t e n s o r f l o w as t f # D e f i n e two c o n s t a n t s matrix1 = t f . constant ( [ [ 3 . , 3 . ] ] ) matrix2 = t f . constant ( [ [ 2 . ] , [ 2 . ] ] ) # D e f i n e a matmul o p e r a t i o n p r o d u c t = t f . matmul ( m a t r i x 1 , m a t r i x 2 ) # Launch t h e d e f a u l t g r a p h . sess = tf . Session () # Run t h e matmul o p e r a t i o n r e s u l t = s e s s . run ( product ) print ( result ) [[ 12.]] # Close the Session . sess . close ()

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

48 / 60

TensorFlow Example I import t e n s o r f l o w as t f # D e f i n e two c o n s t a n t s matrix1 = t f . constant ( [ [ 3 . , 3 . ] ] ) matrix2 = t f . constant ( [ [ 2 . ] , [ 2 . ] ] ) # D e f i n e a matmul o p e r a t i o n p r o d u c t = t f . matmul ( m a t r i x 1 , m a t r i x 2 ) # Launch t h e d e f a u l t g r a p h . sess = tf . Session () # Run t h e matmul o p e r a t i o n r e s u l t = s e s s . run ( product ) print ( result ) [[ 12.]] # Close the Session . sess . close ()

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

48 / 60

TensorFlow Example I import t e n s o r f l o w as t f # D e f i n e two c o n s t a n t s matrix1 = t f . constant ( [ [ 3 . , 3 . ] ] ) matrix2 = t f . constant ( [ [ 2 . ] , [ 2 . ] ] ) # D e f i n e a matmul o p e r a t i o n p r o d u c t = t f . matmul ( m a t r i x 1 , m a t r i x 2 ) # Launch t h e d e f a u l t g r a p h . sess = tf . Session () # Run t h e matmul o p e r a t i o n r e s u l t = s e s s . run ( product ) print ( result ) [[ 12.]] # Close the Session . sess . close ()

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

48 / 60

TensorFlow Example II

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

49 / 60

TensorFlow Example II s t a t e = t f . V a r i a b l e ( 0 . 0 , name=” c o u n t e r ” ) inc = tf . placeholder ( tf . float32 ) n e w v a l u e = t f . add ( s t a t e , i n c ) update = t f . a s s i g n ( state , new value ) in it op = tf . i n i t i a l i z e a l l v a r i a b l e s () # Launch t h e g r a p h with t f . S e s s i o n () as s e s s : w i t h t f . d e v i c e ( ” / gpu : 0 ” ) : # Run t h e i n i t op s e s s . run ( i n i t o p ) # Run t h e op t h a t u p d a t e s s t a t e for in range ( 3 ) : s e s s . r u n ( [ u p d a t e ] , f e e d d i c t ={ i n c : 0 . 5 } ) p r i n t ( s e s s . run ( s t a t e ) )

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

49 / 60

TensorFlow Example II s t a t e = t f . V a r i a b l e ( 0 . 0 , name=” c o u n t e r ” ) inc = tf . placeholder ( tf . float32 ) n e w v a l u e = t f . add ( s t a t e , i n c ) update = t f . a s s i g n ( state , new value ) in it op = tf . i n i t i a l i z e a l l v a r i a b l e s () # Launch t h e g r a p h with t f . S e s s i o n () as s e s s : w i t h t f . d e v i c e ( ” / gpu : 0 ” ) : # Run t h e i n i t op s e s s . run ( i n i t o p ) # Run t h e op t h a t u p d a t e s s t a t e for in range ( 3 ) : s e s s . r u n ( [ u p d a t e ] , f e e d d i c t ={ i n c : 0 . 5 } ) p r i n t ( s e s s . run ( s t a t e ) )

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

49 / 60

TensorFlow Example II s t a t e = t f . V a r i a b l e ( 0 . 0 , name=” c o u n t e r ” ) inc = tf . placeholder ( tf . float32 ) n e w v a l u e = t f . add ( s t a t e , i n c ) update = t f . a s s i g n ( state , new value ) in it op = tf . i n i t i a l i z e a l l v a r i a b l e s () # Launch t h e g r a p h with t f . S e s s i o n () as s e s s : w i t h t f . d e v i c e ( ” / gpu : 0 ” ) : # Run t h e i n i t op s e s s . run ( i n i t o p ) # Run t h e op t h a t u p d a t e s s t a t e for in range ( 3 ) : s e s s . r u n ( [ u p d a t e ] , f e e d d i c t ={ i n c : 0 . 5 } ) p r i n t ( s e s s . run ( s t a t e ) )

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

49 / 60

TensorFlow Example II s t a t e = t f . V a r i a b l e ( 0 . 0 , name=” c o u n t e r ” ) inc = tf . placeholder ( tf . float32 ) n e w v a l u e = t f . add ( s t a t e , i n c ) update = t f . a s s i g n ( state , new value ) in it op = tf . i n i t i a l i z e a l l v a r i a b l e s () # Launch t h e g r a p h with t f . S e s s i o n () as s e s s : w i t h t f . d e v i c e ( ” / gpu : 0 ” ) : # Run t h e i n i t op s e s s . run ( i n i t o p ) # Run t h e op t h a t u p d a t e s s t a t e for in range ( 3 ) : s e s s . r u n ( [ u p d a t e ] , f e e d d i c t ={ i n c : 0 . 5 } ) p r i n t ( s e s s . run ( s t a t e ) )

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

49 / 60

TensorFlow Example II s t a t e = t f . V a r i a b l e ( 0 . 0 , name=” c o u n t e r ” ) inc = tf . placeholder ( tf . float32 ) n e w v a l u e = t f . add ( s t a t e , i n c ) update = t f . a s s i g n ( state , new value ) in it op = tf . i n i t i a l i z e a l l v a r i a b l e s () # Launch t h e g r a p h with t f . S e s s i o n () as s e s s : w i t h t f . d e v i c e ( ” / gpu : 0 ” ) : # Run t h e i n i t op s e s s . run ( i n i t o p ) # Run t h e op t h a t u p d a t e s s t a t e for in range ( 3 ) : s e s s . r u n ( [ u p d a t e ] , f e e d d i c t ={ i n c : 0 . 5 } ) p r i n t ( s e s s . run ( s t a t e ) )

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

49 / 60

TensorFlow Example II s t a t e = t f . V a r i a b l e ( 0 . 0 , name=” c o u n t e r ” ) inc = tf . placeholder ( tf . float32 ) n e w v a l u e = t f . add ( s t a t e , i n c ) update = t f . a s s i g n ( state , new value ) in it op = tf . i n i t i a l i z e a l l v a r i a b l e s () # Launch t h e g r a p h with t f . S e s s i o n () as s e s s : w i t h t f . d e v i c e ( ” / gpu : 0 ” ) : # Run t h e i n i t op s e s s . run ( i n i t o p ) # Run t h e op t h a t u p d a t e s s t a t e for in range ( 3 ) : s e s s . r u n ( [ u p d a t e ] , f e e d d i c t ={ i n c : 0 . 5 } ) p r i n t ( s e s s . run ( s t a t e ) )

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

49 / 60

TensorFlow Example II s t a t e = t f . V a r i a b l e ( 0 . 0 , name=” c o u n t e r ” ) inc = tf . placeholder ( tf . float32 ) n e w v a l u e = t f . add ( s t a t e , i n c ) update = t f . a s s i g n ( state , new value ) in it op = tf . i n i t i a l i z e a l l v a r i a b l e s () # Launch t h e g r a p h with t f . S e s s i o n () as s e s s : w i t h t f . d e v i c e ( ” / gpu : 0 ” ) : # Run t h e i n i t op s e s s . run ( i n i t o p ) # Run t h e op t h a t u p d a t e s s t a t e for in range ( 3 ) : s e s s . r u n ( [ u p d a t e ] , f e e d d i c t ={ i n c : 0 . 5 } ) p r i n t ( s e s s . run ( s t a t e ) )

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

49 / 60

TensorFlow Example II s t a t e = t f . V a r i a b l e ( 0 . 0 , name=” c o u n t e r ” ) inc = tf . placeholder ( tf . float32 ) n e w v a l u e = t f . add ( s t a t e , i n c ) update = t f . a s s i g n ( state , new value ) in it op = tf . i n i t i a l i z e a l l v a r i a b l e s () # Launch t h e g r a p h with t f . S e s s i o n () as s e s s : w i t h t f . d e v i c e ( ” / gpu : 0 ” ) : # Run t h e i n i t op s e s s . run ( i n i t o p ) # Run t h e op t h a t u p d a t e s s t a t e for in range ( 3 ) : s e s s . r u n ( [ u p d a t e ] , f e e d d i c t ={ i n c : 0 . 5 } ) p r i n t ( s e s s . run ( s t a t e ) )

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

49 / 60

Keras Example Feedforward Network

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

50 / 60

Keras Example Feedforward Network from k e r a s . m o d e l s i m p o r t S e q u e n t i a l from k e r a s . l a y e r s i m p o r t Dense , Dropout , A c t i v a t i o n from k e r a s . o p t i m i z e r s i m p o r t SGD model = S e q u e n t i a l ( ) model . add ( Dense ( 6 4 , i n p u t d i m =20 , i n i t = ’ u n i f o r m ’ ) ) model . add ( A c t i v a t i o n ( ’ t a n h ’ ) ) model . add ( Dropout ( 0 . 5 ) ) model . add ( Dense ( 6 4 , i n i t = ’ u n i f o r m ’ ) ) model . add ( A c t i v a t i o n ( ’ t a n h ’ ) ) model . add ( Dropout ( 0 . 5 ) ) model . add ( Dense ( 1 0 , i n i t = ’ u n i f o r m ’ ) ) model . add ( A c t i v a t i o n ( ’ s o f t m a x ’ ) ) Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

50 / 60

Keras Example Feedforward Network from k e r a s . m o d e l s i m p o r t S e q u e n t i a l from k e r a s . l a y e r s i m p o r t Dense , Dropout , A c t i v a t i o n from k e r a s . o p t i m i z e r s i m p o r t SGD model = S e q u e n t i a l ( ) model . add ( Dense ( 6 4 , i n p u t d i m =20 , i n i t = ’ u n i f o r m ’ ) ) model . add ( A c t i v a t i o n ( ’ t a n h ’ ) ) model . add ( Dropout ( 0 . 5 ) ) model . add ( Dense ( 6 4 , i n i t = ’ u n i f o r m ’ ) ) model . add ( A c t i v a t i o n ( ’ t a n h ’ ) ) model . add ( Dropout ( 0 . 5 ) ) model . add ( Dense ( 1 0 , i n i t = ’ u n i f o r m ’ ) ) model . add ( A c t i v a t i o n ( ’ s o f t m a x ’ ) ) Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

50 / 60

Keras Example Feedforward Network from k e r a s . m o d e l s i m p o r t S e q u e n t i a l from k e r a s . l a y e r s i m p o r t Dense , Dropout , A c t i v a t i o n from k e r a s . o p t i m i z e r s i m p o r t SGD model = S e q u e n t i a l ( ) model . add ( Dense ( 6 4 , i n p u t d i m =20 , i n i t = ’ u n i f o r m ’ ) ) model . add ( A c t i v a t i o n ( ’ t a n h ’ ) ) model . add ( Dropout ( 0 . 5 ) ) model . add ( Dense ( 6 4 , i n i t = ’ u n i f o r m ’ ) ) model . add ( A c t i v a t i o n ( ’ t a n h ’ ) ) model . add ( Dropout ( 0 . 5 ) ) model . add ( Dense ( 1 0 , i n i t = ’ u n i f o r m ’ ) ) model . add ( A c t i v a t i o n ( ’ s o f t m a x ’ ) ) Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

50 / 60

Keras Example Feedforward Network from k e r a s . m o d e l s i m p o r t S e q u e n t i a l from k e r a s . l a y e r s i m p o r t Dense , Dropout , A c t i v a t i o n from k e r a s . o p t i m i z e r s i m p o r t SGD model = S e q u e n t i a l ( ) model . add ( Dense ( 6 4 , i n p u t d i m =20 , i n i t = ’ u n i f o r m ’ ) ) model . add ( A c t i v a t i o n ( ’ t a n h ’ ) ) model . add ( Dropout ( 0 . 5 ) ) model . add ( Dense ( 6 4 , i n i t = ’ u n i f o r m ’ ) ) model . add ( A c t i v a t i o n ( ’ t a n h ’ ) ) model . add ( Dropout ( 0 . 5 ) ) model . add ( Dense ( 1 0 , i n i t = ’ u n i f o r m ’ ) ) model . add ( A c t i v a t i o n ( ’ s o f t m a x ’ ) ) Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

50 / 60

Keras Example Feedforward Network from k e r a s . m o d e l s i m p o r t S e q u e n t i a l from k e r a s . l a y e r s i m p o r t Dense , Dropout , A c t i v a t i o n from k e r a s . o p t i m i z e r s i m p o r t SGD model = S e q u e n t i a l ( ) model . add ( Dense ( 6 4 , i n p u t d i m =20 , i n i t = ’ u n i f o r m ’ ) ) model . add ( A c t i v a t i o n ( ’ t a n h ’ ) ) model . add ( Dropout ( 0 . 5 ) ) model . add ( Dense ( 6 4 , i n i t = ’ u n i f o r m ’ ) ) model . add ( A c t i v a t i o n ( ’ t a n h ’ ) ) model . add ( Dropout ( 0 . 5 ) ) model . add ( Dense ( 1 0 , i n i t = ’ u n i f o r m ’ ) ) model . add ( A c t i v a t i o n ( ’ s o f t m a x ’ ) ) Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

50 / 60

Keras Example Feedforward Network from k e r a s . m o d e l s i m p o r t S e q u e n t i a l from k e r a s . l a y e r s i m p o r t Dense , Dropout , A c t i v a t i o n from k e r a s . o p t i m i z e r s i m p o r t SGD model = S e q u e n t i a l ( ) model . add ( Dense ( 6 4 , i n p u t d i m =20 , i n i t = ’ u n i f o r m ’ ) ) model . add ( A c t i v a t i o n ( ’ t a n h ’ ) ) model . add ( Dropout ( 0 . 5 ) ) model . add ( Dense ( 6 4 , i n i t = ’ u n i f o r m ’ ) ) model . add ( A c t i v a t i o n ( ’ t a n h ’ ) ) model . add ( Dropout ( 0 . 5 ) ) model . add ( Dense ( 1 0 , i n i t = ’ u n i f o r m ’ ) ) model . add ( A c t i v a t i o n ( ’ s o f t m a x ’ ) ) Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

50 / 60

Keras Example Feedforward Network from k e r a s . m o d e l s i m p o r t S e q u e n t i a l from k e r a s . l a y e r s i m p o r t Dense , Dropout , A c t i v a t i o n from k e r a s . o p t i m i z e r s i m p o r t SGD model = S e q u e n t i a l ( ) model . add ( Dense ( 6 4 , i n p u t d i m =20 , i n i t = ’ u n i f o r m ’ ) ) model . add ( A c t i v a t i o n ( ’ t a n h ’ ) ) model . add ( Dropout ( 0 . 5 ) ) model . add ( Dense ( 6 4 , i n i t = ’ u n i f o r m ’ ) ) model . add ( A c t i v a t i o n ( ’ t a n h ’ ) ) model . add ( Dropout ( 0 . 5 ) ) model . add ( Dense ( 1 0 , i n i t = ’ u n i f o r m ’ ) ) model . add ( A c t i v a t i o n ( ’ s o f t m a x ’ ) ) Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

50 / 60

Keras Example Feedforward Network s g d = SGD( l r =0.1 , d e c a y=1e −6, momentum =0.9 , n e s t e r o v=True ) model . c o m p i l e ( l o s s= ’ c a t e g o r i c a l c r o s s e n t r o p y ’ , o p t i m i z e r=sgd , m e t r i c s =[ ’ a c c u r a c y ’ ] ) model . f i t ( X t r a i n , y t r a i n , n b e p o c h =20 , b a t c h s i z e =16) s c o r e = model . e v a l u a t e ( X t e s t , y t e s t , b a t c h s i z e =16)

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

51 / 60

Keras Example Feedforward Network s g d = SGD( l r =0.1 , d e c a y=1e −6, momentum =0.9 , n e s t e r o v=True ) model . c o m p i l e ( l o s s= ’ c a t e g o r i c a l c r o s s e n t r o p y ’ , o p t i m i z e r=sgd , m e t r i c s =[ ’ a c c u r a c y ’ ] ) model . f i t ( X t r a i n , y t r a i n , n b e p o c h =20 , b a t c h s i z e =16) s c o r e = model . e v a l u a t e ( X t e s t , y t e s t , b a t c h s i z e =16)

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

51 / 60

Keras Example Feedforward Network s g d = SGD( l r =0.1 , d e c a y=1e −6, momentum =0.9 , n e s t e r o v=True ) model . c o m p i l e ( l o s s= ’ c a t e g o r i c a l c r o s s e n t r o p y ’ , o p t i m i z e r=sgd , m e t r i c s =[ ’ a c c u r a c y ’ ] ) model . f i t ( X t r a i n , y t r a i n , n b e p o c h =20 , b a t c h s i z e =16) s c o r e = model . e v a l u a t e ( X t e s t , y t e s t , b a t c h s i z e =16)

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

51 / 60

Keras Example Feedforward Network s g d = SGD( l r =0.1 , d e c a y=1e −6, momentum =0.9 , n e s t e r o v=True ) model . c o m p i l e ( l o s s= ’ c a t e g o r i c a l c r o s s e n t r o p y ’ , o p t i m i z e r=sgd , m e t r i c s =[ ’ a c c u r a c y ’ ] ) model . f i t ( X t r a i n , y t r a i n , n b e p o c h =20 , b a t c h s i z e =16) s c o r e = model . e v a l u a t e ( X t e s t , y t e s t , b a t c h s i z e =16)

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

51 / 60

Hardware NVIDIA GTX TITAN X CUDA Cores 3072 11 teraflops Core Clock 1000 MHz Price ≈ $1000

source: http://www.geforce.com

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

52 / 60

Hardware NVIDIA DGX-1 CUDA Cores 8x 3584 170 teraflops Price ≈ $130,000

source: http://www.geforce.com

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

53 / 60

Hardware

Department of CS Skorpio University of Saskatchewan HPC Plato Zeno Meton

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

54 / 60

Want to learn more?

Available free of cost at: deeplearningbook.org

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

55 / 60

Questions?

Acknowledgment

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

57 / 60

Acknowledgment

Dr. Kevin Stanley

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

57 / 60

Acknowledgment

Dr. Kevin Stanley

Najeeb Khan (BIGLAB)

Dr. Ian Stavness

Deep Neural Networks

November 29, 2016

57 / 60

Acknowledgment

Dr. Kevin Stanley

Dr. Ian Stavness

Dr. Jung Lee

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

57 / 60

Acknowledgment

Dr. Kevin Stanley

Dr. Ian Stavness

Dr. Jung Lee

Dr. Jawad Shah

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

57 / 60

Citation

@unpublished{najeeb2016dnn, Author = {Khan, Najeeb}, Institution = {University of Saskatchewan}, Year = {2016}, Title = {Deep Neural Networks: Introduction, Architectures and Implementations} URL = {usask.ca/~najeeb.khan/docs/dnn2016.pdf} }

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

58 / 60

References Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., et al. (2015). Tensorflow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow. org, 1. R in Machine Bengio, Y. (2009). Learning deep architectures for ai. Foundations and trends Learning, 2(1):1–127.

Bengio, Y., Courville, A., and Vincent, P. (2013). Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828. Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4):303–314. Dean, J. (2016). Large-scale deep learning for intelligent computer systems. Web Search and Data Mining. Deng, L. (2012). The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29(6):141–142. Domingos, P. (2015). The master algorithm: How the quest for the ultimate learning machine will remake our world. Basic Books. Graham, B. (2014). Fractional max-pooling. arXiv preprint arXiv:1412.6071. He, K., Zhang, X., Ren, S., and Sun, J. (2015). Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385.

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

59 / 60

References (cont.) Kang, G., Li, J., and Tao, D. (2016). Shakeout: A new regularized deep neural network training scheme. In Thirtieth AAAI Conference on Artificial Intelligence. Karpathy, A. (2016). CS231n: Convolutional Neural Networks for Visual Recognition. http://cs231n.stanford.edu/ [Accessed: October 20, 2016]. Krizhevsky, A. and Hinton, G. (2009). Learning multiple layers of features from tiny images. LeCun, Y., Cortes, C., and Burges, C. J. (1998). The mnist database of handwritten digits. Lipton, Z. C., Berkowitz, J., and Elkan, C. (2015). A critical review of recurrent neural networks for sequence learning. arXiv preprint arXiv:1506.00019. Rennie, J. D. (2005). Regularized logistic regression is strictly convex. Unpublished manuscript. URL people. csail. mit. edu/jrennie/writing/convexLR. pdf. Senior, A., Heigold, G., Yang, K., et al. (2013). An empirical study of learning rates in deep neural networks for speech recognition. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 6724–6728. IEEE. Srivastava, N., Hinton, G. E., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929–1958. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. (2015). Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1–9. Wan, L., Zeiler, M., Zhang, S., Cun, Y. L., and Fergus, R. (2013). Regularization of neural networks using dropconnect. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pages 1058–1066. Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

60 / 60