Deep Neural Networks - Introduction, Architectures and ... - Najeeb Khan

Deep Neural Networks Introduction, Architectures and Implementations

Najeeb Khan [email protected] usask.ca/~najeeb.khan

November 29, 2016

Outline 1

Introduction Hard Problems Datasets Neural Networks

2

Architectures Unsupervised Learning Convolutional Neural Networks Recurrent Neural Networks Overfitting

3

Implementations TensorFlow Keras Hardware

Najeeb Khan (BIGLAB)

Deep Neural Networks

November 29, 2016

2 / 60

This Presentation

c Image Source: Ralph A. Clevenger



November 29, 2016

3 / 60

Outline 1


2


3




November 29, 2016

4 / 60

Machine Learning Tribes Analogizers

Bayesians

Symbolists

Machine Learning Evolutionaries Connectionists

Graphic inspired by first few chapters of the Master Algorithm (Domingos, 2015)



November 29, 2016

5 / 60

Traditional Machine Learning

Image Source: http://scikit-learn.org



November 29, 2016

6 / 60

Hard Problems: Animal or Food?

Image Source: Karen Zack, twitter.com/teenybiscuit



November 29, 2016

7 / 60





November 29, 2016

7 / 60





November 29, 2016

7 / 60





November 29, 2016

7 / 60





November 29, 2016

7 / 60

Hard Problems: Autonomous Formula One?

Image Source: RoboRace: http://roborace.com



November 29, 2016

8 / 60

Hard Problems: Should I?

Image Source: OpenReview: ICLR 2017



November 29, 2016

9 / 60

Hard Problems: Should I?

Image Source: OpenReview: ICLR 2017



November 29, 2016

9 / 60

Datasets



November 29, 2016

10 / 60

Datasets

YouTube-8M Dataset I I

8 Million videos, 56 years in duration Available for download: research.google.com/youtube8m



November 29, 2016

10 / 60

Datasets



Yahoo News Feed Dataset I I

User- news item interaction 20M users, 1.5 TB text data



November 29, 2016

10 / 60

Datasets



Yahoo News Feed Dataset I I

User- news item interaction 20M users, 1.5 TB text data

Plant Pictures Dataset I I I

2 TB of drone images 800 GB of time-lapse images 1 GB of sensor data



November 29, 2016

10 / 60

Datasets

MNIST database of handwritten digits (LeCun et al., 1998)



November 29, 2016

11 / 60

Datasets MNIST database of handwritten digits (LeCun et al., 1998)

Image Source: (Deng, 2012)



November 29, 2016

11 / 60

Datasets MNIST database of handwritten digits (LeCun et al., 1998)

Training set: 60,000 examples Test set: 10,000 examples State-of-the-art classification error rate: 0.21% (Wan et al., 2013).



November 29, 2016

11 / 60

Datasets

Canadian Institute for Advanced Research-10 Dataset (CIFAR-10) (Krizhevsky and Hinton, 2009)



November 29, 2016

12 / 60

Datasets Canadian Institute for Advanced Research-10 Dataset (CIFAR-10) (Krizhevsky and Hinton, 2009)

60000 32x32 color images in 10 classes 6000 images per class State-of-the-art classification error rate: 3.47% (Graham, 2014).



November 29, 2016

12 / 60

Artificial Neuron x1 x2

w1

w2

v = wx

Σ

a

a = ϕ(v ) =

w3

1 (1 + e −v )

x3



November 29, 2016

13 / 60


w1

w2

v = wx

Σ

a

a = ϕ(v ) =

w3

1 (1 + e −v )

x3 i m p o r t numpy a s np x=np . a r r a y ( [ [ 1 ] , [ 2 ] , [ 3 ] ] ) w=np . random . r a n d ( 1 , 3 ) v=w . d o t ( x ) a =1/( 1 + np . exp (−v ) )



November 29, 2016

13 / 60


w1

w2

v = wx

Σ

a

a = ϕ(v ) =

w3

1 (1 + e −v )




November 29, 2016

13 / 60


w1

w2

v = wx

Σ

a

a = ϕ(v ) =

w3

1 (1 + e −v )




November 29, 2016

13 / 60


w1

w2

v = wx

Σ

a

a = ϕ(v ) =

w3

1 (1 + e −v )




November 29, 2016

13 / 60


w1

w2

v = wx

Σ

a

a = ϕ(v ) =

w3

1 (1 + e −v )




November 29, 2016

13 / 60

Neural Networks: Representation

Σϕ x1

Σϕ

x2

Σϕ

x3

Σϕ

v1 = W21 x Σϕ

yˆ1

Σϕ

yˆ2

v2 = W32 a1

1 (1 + e −v1 ) 1 a2 = ϕ(v1 ) = (1 + e −v2 )

a1 =

ϕ(v1 ) =

yˆ = a2

Σϕ



November 29, 2016

14 / 60


Σϕ x1

Σϕ

x2

Σϕ

x3

Σϕ

v1 = W21 x Σϕ

yˆ1

Σϕ

yˆ2

v2 = W32 a1

1 (1 + e −v1 ) 1 a2 = ϕ(v1 ) = (1 + e −v2 )

a1 =

ϕ(v1 ) =

yˆ = a2

Σϕ

i m p o r t numpy a s np x=np . a r r a y ( [ [ 1 ] , [ 2 ] , [ 3 ] ] ) w21=r a n d ( 5 , 3 ) ; w32=r a n d ( 2 , 5 ) V1=w . d o t ( x ) ; a1 =1/( 1 + np . exp (−V1 ) ) V2=w . d o t ( a1 ) ; a2 =1/( 1 + np . exp (−V2 ) )



November 29, 2016

14 / 60


Σϕ x1

Σϕ

x2

Σϕ

x3

Σϕ

v1 = W21 x Σϕ

yˆ1

Σϕ

yˆ2

v2 = W32 a1

1 (1 + e −v1 ) 1 a2 = ϕ(v1 ) = (1 + e −v2 )

a1 =

ϕ(v1 ) =

yˆ = a2

Σϕ




November 29, 2016

14 / 60


Σϕ x1

Σϕ

x2

Σϕ

x3

Σϕ

v1 = W21 x Σϕ

yˆ1

Σϕ

yˆ2

v2 = W32 a1

1 (1 + e −v1 ) 1 a2 = ϕ(v1 ) = (1 + e −v2 )

a1 =

ϕ(v1 ) =

yˆ = a2

Σϕ




November 29, 2016

14 / 60


Σϕ x1

Σϕ

x2

Σϕ

x3

Σϕ

v1 = W21 x Σϕ

yˆ1

Σϕ

yˆ2

v2 = W32 a1

1 (1 + e −v1 ) 1 a2 = ϕ(v1 ) = (1 + e −v2 )

a1 =

ϕ(v1 ) =

yˆ = a2

Σϕ




November 29, 2016

14 / 60


Σϕ x1

Σϕ

x2

Σϕ

x3

Σϕ

v1 = W21 x Σϕ

yˆ1

Σϕ

yˆ2

v2 = W32 a1

1 (1 + e −v1 ) 1 a2 = ϕ(v1 ) = (1 + e −v2 )

a1 =

ϕ(v1 ) =

yˆ = a2

Σϕ




November 29, 2016

14 / 60


x1 x2 x3

I as am es fo igno r s ri im ng pl bi ici ty !

Σϕ

v1 = W21 x

Σϕ

Σϕ

yˆ1

Σϕ

yˆ2

Σϕ

v2 = W32 a1

Σϕ Σϕ

1 (1 + e −v1 ) 1 a2 = ϕ(v1 ) = (1 + e −v2 )

a1 =

ϕ(v1 ) =

yˆ = a2




November 29, 2016

14 / 60

Neural Networks: Prediction Logistic Regression x1 w 1 x 2 w2 Σ 3 w

1 , (1 + e −wx ) p(y = 0|x) = 1 − p(y = 1|x), p(y = 1|x) =

p (1|x)

x3

y ∗ = arg max p(c = i|x) i



November 29, 2016

15 / 60

Neural Networks: Prediction Logistic Regression x1 w 1

1 , (1 + e −wx ) p(y = 0|x) = 1 − p(y = 1|x), p(y = 1|x) =

x 2 w2 Σ 3

p (1|x)

w

x3

y ∗ = arg max p(c = i|x) i

Softmax Regression x1

Σ

v1

x2

Σ

v2

x3

Σ

v3


e vi p(y = i|x) = PJ

j=1 e

vj

,

y ∗ = arg max p(y = i|x) i


November 29, 2016

15 / 60

Neural Networks: Learning

x1 x2

w1

w2

Σ

y (a > 0.5)

w3

x3



November 29, 2016

16 / 60


We will discuss learning weights for a logistic regression classifier

x1 x2

w1

w2

Σ

y (a > 0.5)

w3

x3



November 29, 2016

16 / 60


We will discuss learning weights for a logistic regression classifier Training example (x, t) x1 x2

w1

w2

Σ

y (a > 0.5)

w3

x3



November 29, 2016

16 / 60



w1

w2

Σ

y (a > 0.5)

Give me weights w so that if I see x the uncertainty in predicting t is as little as possible

w3

x3



November 29, 2016

16 / 60



w1

w2

Σ

w3

y (a > 0.5)

Give me weights w so that if I see x the uncertainty in predicting t is as little as possible Max p (y = t|x)

x3



November 29, 2016

16 / 60



w1

w2

Σ

w3

y (a > 0.5)

Give me weights w so that if I see x the uncertainty in predicting t is as little as possible Max p (y = t|x) = Ber (y = t|ϕ (wx))

x3



November 29, 2016

16 / 60



w1

w2

Σ

w3

x3


y (a > 0.5)

Give me weights w so that if I see x the uncertainty in predicting t is as little as possible Max p (y = t|x) = Ber (y = t|ϕ (wx)) p (y = t|x) = ϕ (wx)t (1 − ϕ (wx))1−t


November 29, 2016

16 / 60



w1

w2

Σ

w3

x3

y (a > 0.5)

Give me weights w so that if I see x the uncertainty in predicting t is as little as possible Max p (y = t|x) = Ber (y = t|ϕ (wx)) p (y = t|x) = ϕ (wx)t (1 − ϕ (wx))1−t J = −t log(ϕ (wx))−(1−t) log(1−ϕ (wx))



November 29, 2016

16 / 60



w1

w2

Σ

w3

x3

y (a > 0.5)

Give me weights w so that if I see x the uncertainty in predicting t is as little as possible Max p (y = t|x) = Ber (y = t|ϕ (wx)) p (y = t|x) = ϕ (wx)t (1 − ϕ (wx))1−t J = −t log(ϕ (wx))−(1−t) log(1−ϕ (wx)) w∗ = arg min J



November 29, 2016

16 / 60

Optimization: Direct Method

First order derivative test



November 29, 2016

17 / 60


First order derivative test ∂J = 0 for ∂wk


k = 1...K


November 29, 2016

17 / 60



k = 1...K

∇Jw = 0



November 29, 2016

17 / 60



k = 1...K

∇Jw = 0 Closed form solution



November 29, 2016

17 / 60



k = 1...K

∇Jw = 0 Closed form solution I

Not available for most cases



November 29, 2016

17 / 60



k = 1...K

∇Jw = 0 Closed form solution I I

Not available for most cases Expensive for large/sparse problems



November 29, 2016

17 / 60



k = 1...K

∇Jw = 0 Closed form solution I I I

Not available for most cases Expensive for large/sparse problems Model specific, e.g., ∗ = (X T X )−1 X T y for linear regression


w


November 29, 2016

17 / 60

Optimization: Iterative Method Gradient Descent ∆wn = η∇J wn+1 = wn − ∆wn

∇J = −2.25 n = 0 η = 0.5



w = −4.5

November 29, 2016

18 / 60

Optimization: Iterative Method Gradient Descent ∆wn = η∇J wn+1 = wn − ∆wn Stochastic Gradient Descent I

∇J is computed using a single training example

∇J = −2.25 n = 0 η = 0.5



w = −4.5

November 29, 2016

18 / 60



Batch Gradient Descent I

∇J is computed using the whole training set

∇J = −2.25 n = 0 η = 0.5



w = −4.5

November 29, 2016

18 / 60





Mini-batch Gradient Descent I

∇J is based on subsets of the training set

∇J = −2.25 n = 0 η = 0.5

w = −4.5

Gradient Descent ∆wn = η∇J wn+1 = wn − ∆wn Stochastic Gradient Descent I


S m ingl a ve x e la x re ye (R gr r en ess log ni io is e, n tic 20 los a 05 ses nd ) ar So e ft co n-

Optimization: Iterative Method





∇J = −2.25 n = 0 η = 0.5

w = −4.5

Gradient Descent ∆wn = η∇J wn+1 = wn − ∆wn Stochastic Gradient Descent I







S m ingl a ve x e la r Bu x (R egr yer en ess log ca t ni io is fo n‘t sing e, n tic od te le 20 los a ap ll p lay 05 ses nd ar et er ) ar So t! an s e ft co d n-

Optimization: Iterative Method

∇J = −2.25 n = 0 η = 0.5


w = −4.5

November 29, 2016

18 / 60

NN without a Hidden Layer Linear boundary

Demo created with playground.tensorflow.org



November 29, 2016

19 / 60

NN without a Hidden Layer Under-fitting a non-linear boundary




November 29, 2016

19 / 60

Multi-layer Neural Networks

Universal Approximation Theorem (Cybenko, 1989) I

A hidden-layer neural network can represent any decision boundary given enough neurons and proper weights



November 29, 2016

20 / 60

Multi-layer Neural Networks Universal Approximation Theorem (Cybenko, 1989) I





November 29, 2016

20 / 60



How do we get proper weights? We don‘t have ∇wh J with respect to the hidden layer weights!



November 29, 2016

20 / 60



How do we get proper weights? We don‘t have ∇wh J with respect to the hidden layer weights! Backpropagation Algorithm



November 29, 2016

20 / 60



How do we get proper weights? We don‘t have ∇wh J with respect to the hidden layer weights! Backpropagation Algorithm I

W x)

Output of the hidden layer ϕ (


21


November 29, 2016

20 / 60



How do we get proper weights? We don‘t have ∇wh J with respect to the hidden layer weights! Backpropagation Algorithm I I

W x W W x )))

Output of the hidden layer ϕ ( 21 ) Output of the final layer ϕ ( 32 (ϕ (



21

November 29, 2016

20 / 60



How do we get proper weights? We don‘t have ∇wh J with respect to the hidden layer weights! Backpropagation Algorithm I I I

W x W W x W W x

Output of the hidden layer ϕ ( 21 ) Output of the final layer ϕ ( 32 (ϕ ( 21 ))) Cost function J (t, ϕ ( 32 (ϕ ( 21 ))))



November 29, 2016

20 / 60



How do we get proper weights? We don‘t have ∇wh J with respect to the hidden layer weights! Backpropagation Algorithm I I I I

W x W W x W W x

Output of the hidden layer ϕ ( 21 ) Output of the final layer ϕ ( 32 (ϕ ( 21 ))) Cost function J (t, ϕ ( 32 (ϕ ( 21 )))) Backpropagation uses chain rule to find the derivative of J with respect to any weight w in the network.



November 29, 2016

20 / 60



How do we get proper weights? We don‘t have ∇wh J with respect to the hidden layer weights! Backpropagation Algorithm I I I I

W x W W x W W x

Output of the hidden layer ϕ ( 21 ) Output of the final layer ϕ ( 32 (ϕ ( 21 ))) Cost function J (t, ϕ ( 32 (ϕ ( 21 )))) Backpropagation uses chain rule to find the derivative of J with respect to any weight w in the network.

The objective function J is no longer guaranteed to be convex or concave



November 29, 2016

20 / 60

Gradient Descent: Low Learning Rate

∇J = −18.54518 w = −3.3 n = 0 η = 0.005



November 29, 2016

21 / 60

Gradient Descent: High Learning Rate

∇J = −18.54518 w = −3.3 n = 0 η = 0.1



November 29, 2016

22 / 60

Stochastic Gradient Descent Variants

Momentum



November 29, 2016

23 / 60


Momentum ∆wn = η∇J + m∆wn−1 wn+1 = wn − ∆wn



November 29, 2016

23 / 60



Adaptive learning rates (Senior et al., 2013)



November 29, 2016

23 / 60



Adaptive learning rates (Senior et al., 2013) I I I I I

Exponential adaptation η0 × 10−n/τ AdaGrad AdaDelta RMSProp Adam



November 29, 2016

23 / 60


Image Source: Alec Radford http://imgur.com/a/Hqolp



November 29, 2016

24 / 60





November 29, 2016

24 / 60





November 29, 2016

24 / 60

Impact of ML Research

“... we have implemented similar gradient update rules adapted to our clusters and they successfully improved our baselines. The resulted models have been or will be launched online and one fifth of the world’s population would benefit from these improved models.” — A Reviewer of Deep learning with Elastic Averaging SGD



November 29, 2016

25 / 60

Outline 1


2


3




November 29, 2016

26 / 60

Deep Learning To learn complicated functions one may need deep architectures, e.g, neural networks with a large number of hidden layers

Adapted from (Dean, 2016)


e

g ed


r to c te de


sha

pe

e

g ed

det

ect

or

r to c te de


Deep Learning

sha

pe

e

g ed

det

ect

or

r to c te de


abstract features

To learn complicated functions one may need deep architectures, e.g, neural networks with a large number of hidden layers

Deep Learning

recognition

sha

pe

e

g ed

det

ect

or

r to c te de

abstract features

To learn complicated functions one may need deep architectures, e.g, neural networks with a large number of hidden layers




November 29, 2016

27 / 60

Deep Learning Architectures Convo Nets

Unsup Learning

Deep Learning

Recursive Nets

Reinforced Learning



November 29, 2016

28 / 60

Unsupervised Learning

A good first reference is (Bengio, 2009)



November 29, 2016

29 / 60


A good first reference is (Bengio, 2009) Problems with deep architectures



November 29, 2016

29 / 60


A good first reference is (Bengio, 2009) Problems with deep architectures I

We don’t have enough labeled data



November 29, 2016

29 / 60


A good first reference is (Bengio, 2009) Problems with deep architectures I I

We don’t have enough labeled data Gradient vanishing problem

W x ))

∇W 21 J (t, ϕ (



21

November 29, 2016

29 / 60



We don’t have enough labeled data Gradient vanishing problem ∇W 21 J (t, ϕ (


W

32


(ϕ (

W x )))) 21

November 29, 2016

29 / 60




W

∇W 21 J (t, ϕ (


W

43 ϕ (


32

W x )))))

(ϕ (

21

November 29, 2016

29 / 60




W

∇W 21 J (t, ϕ (


W

54 ϕ (

W

43 ϕ (


32

W x ))))))

(ϕ (

21

November 29, 2016

29 / 60

Unsupervised Learning A good first reference is (Bengio, 2009) Problems with deep architectures I I


W

∇W 21 J (t, ϕ (

W

54 ϕ (

W

43 ϕ (

32

W x ))))))

(ϕ (

21

∆wn = η∇J wn+1 = wn − ∆wn



November 29, 2016

29 / 60



W

∇W 21 J (t, ϕ (

W

54 ϕ (

W

43 ϕ (

32

W x ))))))

(ϕ (

21


Unsupervised learning



November 29, 2016

29 / 60



W

∇W 21 J (t, ϕ (

W

54 ϕ (

W

43 ϕ (

32

W x ))))))

(ϕ (

21


Unsupervised learning I

Utilizes unlabeled data



November 29, 2016

29 / 60



W

∇W 21 J (t, ϕ (

W

54 ϕ (

W

43 ϕ (

32

W x ))))))

(ϕ (

21


Unsupervised learning I I

Utilizes unlabeled data Uses layerwise pre-training



November 29, 2016

29 / 60

Unsupervised Pre-training



November 29, 2016

30 / 60

Unsupervised Pre-training There is more information available in the form of data than in the form of labels



November 29, 2016

30 / 60

Unsupervised Pre-training There is more information available in the form of data than in the form of labels Learn a good initialization of the weights using the unlabeled data



November 29, 2016

30 / 60

Unsupervised Pre-training There is more information available in the form of data than in the form of labels Learn a good initialization of the weights using the unlabeled data Fine-tune the weights with labeled data



November 29, 2016

30 / 60

Unsupervised Pre-training There is more information available in the form of data than in the form of labels Learn a good initialization of the weights using the unlabeled data Fine-tune the weights with labeled data Two popular models used for pre-initialization of weights are autoencoder and restricted Boltzmann machines



November 29, 2016

30 / 60

Unsupervised Pre-training There is more information available in the form of data than in the form of labels Learn a good initialization of the weights using the unlabeled data Fine-tune the weights with labeled data Two popular models used for pre-initialization of weights are autoencoder and restricted Boltzmann machines Input layer

Output layer Hidden layer

x1

x ˆ1

x2

x ˆ2

x3

x ˆ3

x4

x ˆ4

x5

x ˆ5

x6

x ˆ6

Autoencoder



November 29, 2016

30 / 60

Unsupervised Pre-training There is more information available in the form of data than in the form of labels Learn a good initialization of the weights using the unlabeled data Fine-tune the weights with labeled data Two popular models used for pre-initialization of weights are autoencoder and restricted Boltzmann machines Input layer

Output layer Hidden layer

x1

Visible layer Hidden layer

x ˆ1

x1

x2

x ˆ2

x2

h1

x3

x ˆ3

x3

h2

x4

x ˆ4

x4

h3

x5

x ˆ5

x5

h4

x6

x ˆ6

x6

Autoencoder


RBM


November 29, 2016

30 / 60

Layer-wise Pre-training Training a deep network Input layer

x1 x2 x3

Hidden layer 1

Hidden layer 2

Hidden layer 3

Output layer

W21 W32

W43

W54

x4

yˆ1

x5

yˆ2

x6 x7 x8



November 29, 2016

31 / 60


x1 x2 x3

Hidden layer 1

Hidden layer 2

Hidden layer 3

Output layer

W21 W32

W43

W54

x4

yˆ1

x5

yˆ2

x6 x7 x8

Raw input pixels

Hidden features α

Reconstructed input

x1

x ˆ1

x2

x ˆ2

x3

x ˆ3

x4

x ˆ4

x5

x ˆ5

x6

x ˆ6

x7

x ˆ7

x8

x ˆ8



November 29, 2016

31 / 60


x1

Hidden layer 1

Hidden layer 2

Hidden layer 3

Output layer

W21 W32

x2 x3

W43

W54

x4

yˆ1

x5

yˆ2

x6 x7 x8

Raw input pixels

Hidden features α

Reconstructed input

Input features α

Hidden features β

Reconstructed α

x1

x ˆ1

x2

x ˆ2

α1

α ˆ1

x3

x ˆ3

α2

α ˆ2

x4

x ˆ4

α3

α ˆ3

x5

x ˆ5

α4

α ˆ4

x6

x ˆ6

α5

α ˆ5

x7

x ˆ7

α6

α ˆ6

x8

x ˆ8



November 29, 2016

31 / 60


x1

Hidden layer 1

Hidden layer 2

Output layer

Hidden layer 3

W21 W32

x2 x3

W43

W54

x4

yˆ1

x5

yˆ2

x6 x7 x8

Raw input pixels

Hidden features α

Reconstructed input

Input features α

Hidden features β

Reconstructed α

x1

x ˆ1

x2

x ˆ2

α1

α ˆ1

x3

x ˆ3

α2

α ˆ2

β1

βˆ1

x4

x ˆ4

α3

α ˆ3

β2

βˆ2

x5

x ˆ5

α4

α ˆ4

β3

βˆ3

x6

x ˆ6

α5

α ˆ5

β4

βˆ4

x7

x ˆ7

α6

α ˆ6

x8

x ˆ8


Input features β


Hidden features γ

Reconstructed β

November 29, 2016

31 / 60


x1

Hidden layer 1

Hidden layer 2

Output layer

Hidden layer 3

W21 W32

x2 x3

W43

W54

x4

yˆ1

x5

yˆ2

x6 x7 x8

Raw input pixels

Hidden features α

Reconstructed input

Input features α

Hidden features β

Reconstructed α

x1

x ˆ1

x2

x ˆ2

α1

α ˆ1

x3

x ˆ3

α2

α ˆ2

β1

βˆ1

x4

x ˆ4

α3

α ˆ3

β2

βˆ2

x5

x ˆ5

α4

α ˆ4

β3

βˆ3

x6

x ˆ6

α5

α ˆ5

β4

βˆ4

x7

x ˆ7

α6

α ˆ6

x8

x ˆ8


Input features β


Hidden features γ

Hidden features γ

Reconstructed β

Output

γ1 yˆ1 γ2 yˆ2 γ3

November 29, 2016

31 / 60


x1

Hidden layer 1

Hidden layer 2

Output layer

Hidden layer 3

W21 W32

x2 x3

W43

W54

x4

yˆ1

x5

yˆ2

x6 x7 x8

Raw input pixels

Hidden features α

Reconstructed input

Input features α

Hidden features β

Reconstructed α

x1

x ˆ1

x2

x ˆ2

α1

α ˆ1

x3

x ˆ3

α2

α ˆ2

β1

βˆ1

x4

x ˆ4

α3

α ˆ3

β2

βˆ2

x5

x ˆ5

α4

α ˆ4

β3

βˆ3

x6

x ˆ6

α5

α ˆ5

β4

βˆ4

x7

x ˆ7

α6

α ˆ6

x8

x ˆ8


Input features β


Hidden features γ

Hidden features γ

Reconstructed β

Output


November 29, 2016

31 / 60


Hidden layer 1

Hidden layer 2

Output layer

Hidden layer 3

W21 W32

x2 x3

W43

W54

x4

F la ine b e tu led ne sa wi m th pl es

x1

yˆ1

x5

yˆ2

x6 x7 x8

Raw input pixels

Hidden features α

Reconstructed input

Input features α

Hidden features β

Reconstructed α

x1

x ˆ1

x2

x ˆ2

α1

α ˆ1

x3

x ˆ3

α2

α ˆ2

β1

βˆ1

x4

x ˆ4

α3

α ˆ3

β2

βˆ2

x5

x ˆ5

α4

α ˆ4

β3

βˆ3

x6

x ˆ6

α5

α ˆ5

β4

βˆ4

x7

x ˆ7

α6

α ˆ6

x8

x ˆ8


Input features β


Hidden features γ

Hidden features γ

Reconstructed β

Output


November 29, 2016

31 / 60

Varients of Autoencoders (Bengio et al., 2013)

Regularized Autoencoders Sparse Autoencoders Stacked Denoising Autoencoders Contractive Autoencoders



November 29, 2016

32 / 60

Convolutional Neural Networks (Karpathy, 2016)



November 29, 2016

33 / 60


Developed for solving problems in computer vision



November 29, 2016

33 / 60


Developed for solving problems in computer vision Inspired by the animal visual cortex



November 29, 2016

33 / 60


Developed for solving problems in computer vision Inspired by the animal visual cortex Parameter sharing makes them easier to train



November 29, 2016

33 / 60


Developed for solving problems in computer vision Inspired by the animal visual cortex Parameter sharing makes them easier to train Input size 64x64x3, let’s say we train a hidden layer with 512 neurons



November 29, 2016

33 / 60


Developed for solving problems in computer vision Inspired by the animal visual cortex Parameter sharing makes them easier to train Input size 64x64x3, let’s say we train a hidden layer with 512 neurons How many parameters do we need to train in a fully connected layer?



November 29, 2016

33 / 60


Developed for solving problems in computer vision Inspired by the animal visual cortex Parameter sharing makes them easier to train Input size 64x64x3, let’s say we train a hidden layer with 512 neurons How many parameters do we need to train in a fully connected layer? 64x64x3x512 ≈ 6 Million



November 29, 2016

33 / 60


Developed for solving problems in computer vision Inspired by the animal visual cortex Parameter sharing makes them easier to train Input size 64x64x3, let’s say we train a hidden layer with 512 neurons How many parameters do we need to train in a fully connected layer? 64x64x3x512 ≈ 6 Million If we use 10 kernels of size 5x5x3, we have 750 parameters to learn



November 29, 2016

33 / 60

Convolutional Neural Networks

Input Layer


Input Layer

Conv Layer


Input Layer

Conv Layer

Pooling Layer


Input Layer

Conv Layer

Pooling Layer

Fully connected Layer


Input Layer

Conv Layer

Pooling Layer

Fully connected Layer

Softmax SVM ...



November 29, 2016

34 / 60

Convolutional Neural Networks Assumes local connectivity Generates activation maps using convolution operation instead of dot products 1D convolution (w ∗ x)n =

X

wm xn−m

m kern

Activation map activ(n) = ReLU((w ∗ x)n )

(1)

where ReLU(z) = max(0, z).



November 29, 2016

35 / 60

Convolutional Neural Networks Raw pixels


ReLU((w ∗ x))

Raw pixels

Activation maps


ReLU((w ∗ x))

Raw pixels

Activation maps


ReLU((w ∗ x))

Raw pixels

Activation maps


ReLU((w ∗ x))

Raw pixels

Activation maps

Convolutional Neural Networks Activation maps

Pooling

max (activ)

ReLU((w ∗ x))

Raw pixels


Pooling Features

max (activ)

ReLU((w ∗ x))

Raw pixels


Pooling Features

max (activ)

ReLU((w ∗ x))

Raw pixels


Pooling Features

max (activ)

ReLU((w ∗ x))

Raw pixels


Pooling Features

max (activ)

ReLU((w ∗ x))

Raw pixels



Activation maps

Pooling Features

max (activ)

ReLU((w ∗ x))

Raw pixels


Repeat

November 29, 2016

36 / 60

GoogLeNet softmax2

SoftmaxActivation

FC

AveragePool 7x7+1(V)

DepthConcat

Conv 1x1+1(S)

Conv 3x3+1(S)

Conv 5x5+1(S)

Conv 1x1+1(S)

Conv 1x1+1(S)

Conv 1x1+1(S)

MaxPool 3x3+1(S)

DepthConcat

Conv 1x1+1(S)

Conv 3x3+1(S)

Conv 1x1+1(S)

Conv 5x5+1(S)

Conv 1x1+1(S)

softmax1

Conv 1x1+1(S)

MaxPool 3x3+1(S)

SoftmaxActivation

MaxPool 3x3+2(S)

FC

DepthConcat

Conv 1x1+1(S)

FC

Conv 3x3+1(S)

Conv 5x5+1(S)

Conv 1x1+1(S)

Conv 1x1+1(S)

Conv 1x1+1(S)

Conv 1x1+1(S)

MaxPool 3x3+1(S)


DepthConcat

Conv 1x1+1(S)

Conv 3x3+1(S)

Conv 5x5+1(S)

Conv 1x1+1(S)

Conv 1x1+1(S)

Conv 1x1+1(S)

MaxPool 3x3+1(S)

DepthConcat

Conv 1x1+1(S)

softmax0

Conv 3x3+1(S)

Conv 5x5+1(S)

Conv 1x1+1(S)

Conv 1x1+1(S)

Conv 1x1+1(S)

MaxPool 3x3+1(S)

SoftmaxActivation

FC

DepthConcat

Conv 1x1+1(S)

FC

Conv 3x3+1(S)

Conv 5x5+1(S)

Conv 1x1+1(S)

Conv 1x1+1(S)

Conv 1x1+1(S)

Conv 1x1+1(S)

MaxPool 3x3+1(S)


DepthConcat

Conv 1x1+1(S)

Conv 3x3+1(S)

Conv 5x5+1(S)

Conv 1x1+1(S)

Conv 1x1+1(S)

Conv 1x1+1(S)

MaxPool 3x3+1(S)

MaxPool 3x3+2(S)

DepthConcat

Conv 1x1+1(S)

Conv 3x3+1(S)

Conv 5x5+1(S)

Conv 1x1+1(S)

Conv 1x1+1(S)

Conv 1x1+1(S)

MaxPool 3x3+1(S)

DepthConcat

Conv 1x1+1(S)

Conv 3x3+1(S)

Conv 5x5+1(S)

Conv 1x1+1(S)

Conv 1x1+1(S)

Conv 1x1+1(S)

MaxPool 3x3+1(S)

MaxPool 3x3+2(S)

LocalRespNorm

Conv 3x3+1(S)

Conv 1x1+1(V)

LocalRespNorm

MaxPool 3x3+2(S)

Conv 7x7+2(S)

input

Figure – GoogLeNet network with all the bells and whistles (Szegedy et al., 2015) Najeeb Khan (BIGLAB)


November 29, 2016

37 / 60

Residual Neural Networks x weight layer

relu

F (x)

weight layer

F (x) + x

x identity

relu

Figure – Residual Neural Networks (He et al., 2015)



November 29, 2016

38 / 60

Residual Neural Networks 34-layer plain

x weight layer

relu

F (x)

weight layer

F (x) + x

x

image

7x7 conv, 64, /2

pool, /2

pool, /2

3x3 conv, 64

3x3 conv, 64

3x3 conv, 64

identity

3x3 conv, 64

relu

34-layer residual

image

7x7 conv, 64, /2

3x3 conv, 64 3x3 conv, 64

3x3 conv, 64

3x3 conv, 64

3x3 conv, 64

3x3 conv, 64

3x3 conv, 64

3x3 conv, 64

3x3 conv, 128, /2

3x3 conv, 128, /2

3x3 conv, 128

3x3 conv, 128

3x3 conv, 128

3x3 conv, 128

3x3 conv, 128

3x3 conv, 128

3x3 conv, 128

3x3 conv, 128

3x3 conv, 128 3x3 conv, 128

3x3 conv, 128 3x3 conv, 128

3x3 conv, 128

3x3 conv, 128

3x3 conv, 256, /2

3x3 conv, 256, /2

3x3 conv, 256

3x3 conv, 256

3x3 conv, 256

3x3 conv, 256

3x3 conv, 256 3x3 conv, 256

3x3 conv, 256 3x3 conv, 256

3x3 conv, 256

3x3 conv, 256

3x3 conv, 256

3x3 conv, 256

3x3 conv, 256

3x3 conv, 256

3x3 conv, 256

3x3 conv, 256

3x3 conv, 256

3x3 conv, 256

3x3 conv, 256

3x3 conv, 256

3x3 conv, 256

3x3 conv, 256

3x3 conv, 512, /2

3x3 conv, 512, /2

3x3 conv, 512

3x3 conv, 512

3x3 conv, 512

3x3 conv, 512

3x3 conv, 512

3x3 conv, 512

3x3 conv, 512

3x3 conv, 512

3x3 conv, 512

3x3 conv, 512

avg pool

avg pool

fc 1000

fc 1000




November 29, 2016

38 / 60

Residual Neural Networks 34-layer plain

x weight layer

relu

F (x)

x

weight layer

F (x) + x

pool, /2

3x3 conv, 64

3x3 conv, 64

3x3 conv, 64

relu

20

image

7x7 conv, 64, /2

pool, /2

3x3 conv, 64

identity

20

56-layer

5

0 0

20-layer

110-layer

5

plain-20 plain-32 plain-44 plain-56 1

error (%)

error (%)

20-layer

10

2

3

iter. (1e4)

4

5

6

0 0

1

2

3

iter. (1e4)

4

5

6

3x3 conv, 64 3x3 conv, 64

3x3 conv, 64

3x3 conv, 64

3x3 conv, 64

3x3 conv, 64

3x3 conv, 128, /2

3x3 conv, 128, /2

3x3 conv, 128

3x3 conv, 128

3x3 conv, 128

3x3 conv, 128

3x3 conv, 128

3x3 conv, 128

3x3 conv, 128

3x3 conv, 128

3x3 conv, 128

3x3 conv, 128 3x3 conv, 128

3x3 conv, 128

3x3 conv, 128

3x3 conv, 256, /2

3x3 conv, 256, /2

3x3 conv, 256

3x3 conv, 256

3x3 conv, 256

3x3 conv, 256

3x3 conv, 256 3x3 conv, 256

10

3x3 conv, 64

3x3 conv, 64

3x3 conv, 128

ResNet-20 ResNet-32 ResNet-44 ResNet-56 ResNet-110

34-layer residual

image

7x7 conv, 64, /2

3x3 conv, 256 3x3 conv, 256

3x3 conv, 256

3x3 conv, 256

3x3 conv, 256

3x3 conv, 256

3x3 conv, 256

3x3 conv, 256

3x3 conv, 256

3x3 conv, 256

3x3 conv, 256

3x3 conv, 256

3x3 conv, 256

3x3 conv, 256

3x3 conv, 256

3x3 conv, 256

3x3 conv, 512, /2

3x3 conv, 512, /2

3x3 conv, 512

3x3 conv, 512

3x3 conv, 512

3x3 conv, 512

3x3 conv, 512

3x3 conv, 512

3x3 conv, 512

3x3 conv, 512

3x3 conv, 512

3x3 conv, 512

avg pool

avg pool

fc 1000

fc 1000




November 29, 2016

38 / 60

Recurrent Neural Networks (Lipton et al., 2015)

Modeling sequential and variable length data, e.g, what is happening in a video?



November 29, 2016

39 / 60


Modeling sequential and variable length data, e.g, what is happening in a video? Extend feedforward neural networks with recursive edges



November 29, 2016

39 / 60


y Modeling sequential and variable length data, e.g, what is happening in a video? Extend feedforward neural networks with recursive edges Training is performed using Back-propagation Through Time (BPTT) x



November 29, 2016

39 / 60

RNN Unfolding

yt0

xt0

RNN Unfolding

yt0

yt1

xt0

xt1

RNN Unfolding

yt0

yt1

yt2

xt0

xt1

xt2

RNN Unfolding

yt0

yt1

yt2

yt3

xt0

xt1

xt2

xt3



November 29, 2016

40 / 60

Long Short Term Memory (LSTM) RNNs cannot model long term dependencies

Source: Christopher Olah, http://colah.github.io



November 29, 2016

41 / 60

Long Short Term Memory (LSTM)




November 29, 2016

41 / 60





November 29, 2016

41 / 60





November 29, 2016

41 / 60





November 29, 2016

41 / 60





November 29, 2016

41 / 60

Overfitting




November 29, 2016

42 / 60

Overfitting: Do we have enough data?



November 29, 2016

43 / 60


Consider we are classifying binary images of size 10 × 10



November 29, 2016

43 / 60


Consider we are classifying binary images of size 10 × 10 We have a dataset containing 1 million labeled images



November 29, 2016

43 / 60


Consider we are classifying binary images of size 10 × 10 We have a dataset containing 1 million labeled images Number of possible images?



November 29, 2016

43 / 60


Consider we are classifying binary images of size 10 × 10 We have a dataset containing 1 million labeled images Number of possible images? 2100



November 29, 2016

43 / 60


Consider we are classifying binary images of size 10 × 10 We have a dataset containing 1 million labeled images Number of possible images? 2100 Fraction of possible images for which we have labels?



November 29, 2016

43 / 60


Consider we are classifying binary images of size 10 × 10 We have a dataset containing 1 million labeled images Number of possible images? 2100 Fraction of possible images for which we have labels? 106 2100



November 29, 2016

43 / 60



∼

106 1030



November 29, 2016

43 / 60



∼

106 1030

∼ 10−24

0.000, 000, 000, 000, 000, 000, 000, 1%



November 29, 2016

43 / 60

Overfitting



November 29, 2016

44 / 60

Overfitting Overcoming over-fitting



November 29, 2016

44 / 60

Overfitting Overcoming over-fitting Get more data I I

Use crowd-sourcing, e.g., Amazon MTurk Use data-augmentation



November 29, 2016

44 / 60



Don’t train too much I

Stop when the validation error starts ascending



November 29, 2016

44 / 60





Use some form of regularization I I

W

Penalize weights, e.g., add | |2 to J Damage neurons in innovative ways, e.g,



November 29, 2016

44 / 60






W

Penalize weights, e.g., add | |2 to J Damage neurons in innovative ways, e.g, DropOut (Srivastava et al., 2014),



November 29, 2016

44 / 60






W

Penalize weights, e.g., add | |2 to J Damage neurons in innovative ways, e.g, DropOut (Srivastava et al., 2014), DropConnect (Wan et al., 2013),



November 29, 2016

44 / 60






W

Penalize weights, e.g., add | |2 to J Damage neurons in innovative ways, e.g, DropOut (Srivastava et al., 2014), DropConnect (Wan et al., 2013), ShakeOut (Kang et al., 2016) etc.



November 29, 2016

44 / 60






I

W

Penalize weights, e.g., add | |2 to J Damage neurons in innovative ways, e.g, DropOut (Srivastava et al., 2014), DropConnect (Wan et al., 2013), ShakeOut (Kang et al., 2016) etc. Induce noise into your model



November 29, 2016

44 / 60

Outline 1


2


3




November 29, 2016

45 / 60

Implementations

Image Source: http://imgur.com/ZfkhOt4



November 29, 2016

46 / 60

TensorFlow TensorFlow is an interface for expressing machine learning algorithms, and an implementation for executing such algorithms (Abadi et al., 2015).



November 29, 2016

47 / 60

TensorFlow TensorFlow is an interface for expressing machine learning algorithms, and an implementation for executing such algorithms (Abadi et al., 2015). Represents computations as graphs.



November 29, 2016

47 / 60

TensorFlow TensorFlow is an interface for expressing machine learning algorithms, and an implementation for executing such algorithms (Abadi et al., 2015). Represents computations as graphs. I

Nodes in the graph are called ops



November 29, 2016

47 / 60

TensorFlow TensorFlow is an interface for expressing machine learning algorithms, and an implementation for executing such algorithms (Abadi et al., 2015). Represents computations as graphs. I I

Nodes in the graph are called ops Edges are tensors



November 29, 2016

47 / 60



Represents data as tensors.



November 29, 2016

47 / 60



Represents data as tensors. I

A tensor is an n-dimensional array with a rank, shape and type.



November 29, 2016

47 / 60



Represents data as tensors. I I

A tensor is an n-dimensional array with a rank, shape and type. For example [batch, height, width, channels]



November 29, 2016

47 / 60





Executes graphs in the context of sessions.



November 29, 2016

47 / 60





Executes graphs in the context of sessions. I

A session places the graph ops onto devices such as CPUs/GPUs etc.



November 29, 2016

47 / 60







Maintains state with variables.



November 29, 2016

47 / 60







Maintains state with variables. I

Typically represent the parameters of a statistical model as a set of variables



November 29, 2016

47 / 60







Maintains state with variables. I

Typically represent the parameters of a statistical model as a set of variables

Uses feeds and fetches to get data into and out of arbitrary operations. Najeeb Khan (BIGLAB)


November 29, 2016

47 / 60

TensorFlow Example I



November 29, 2016

48 / 60

TensorFlow Example I import t e n s o r f l o w as t f # D e f i n e two c o n s t a n t s matrix1 = t f . constant ( [ [ 3 . , 3 . ] ] ) matrix2 = t f . constant ( [ [ 2 . ] , [ 2 . ] ] ) # D e f i n e a matmul o p e r a t i o n p r o d u c t = t f . matmul ( m a t r i x 1 , m a t r i x 2 ) # Launch t h e d e f a u l t g r a p h . sess = tf . Session () # Run t h e matmul o p e r a t i o n r e s u l t = s e s s . run ( product ) print ( result ) [[ 12.]] # Close the Session . sess . close ()



November 29, 2016

48 / 60




November 29, 2016

48 / 60




November 29, 2016

48 / 60




November 29, 2016

48 / 60




November 29, 2016

48 / 60




November 29, 2016

48 / 60

TensorFlow Example II



November 29, 2016

49 / 60

TensorFlow Example II s t a t e = t f . V a r i a b l e ( 0 . 0 , name=” c o u n t e r ” ) inc = tf . placeholder ( tf . float32 ) n e w v a l u e = t f . add ( s t a t e , i n c ) update = t f . a s s i g n ( state , new value ) in it op = tf . i n i t i a l i z e a l l v a r i a b l e s () # Launch t h e g r a p h with t f . S e s s i o n () as s e s s : w i t h t f . d e v i c e ( ” / gpu : 0 ” ) : # Run t h e i n i t op s e s s . run ( i n i t o p ) # Run t h e op t h a t u p d a t e s s t a t e for in range ( 3 ) : s e s s . r u n ( [ u p d a t e ] , f e e d d i c t ={ i n c : 0 . 5 } ) p r i n t ( s e s s . run ( s t a t e ) )



November 29, 2016

49 / 60




November 29, 2016

49 / 60




November 29, 2016

49 / 60




November 29, 2016

49 / 60




November 29, 2016

49 / 60




November 29, 2016

49 / 60




November 29, 2016

49 / 60




November 29, 2016

49 / 60

Keras Example Feedforward Network



November 29, 2016

50 / 60

Keras Example Feedforward Network from k e r a s . m o d e l s i m p o r t S e q u e n t i a l from k e r a s . l a y e r s i m p o r t Dense , Dropout , A c t i v a t i o n from k e r a s . o p t i m i z e r s i m p o r t SGD model = S e q u e n t i a l ( ) model . add ( Dense ( 6 4 , i n p u t d i m =20 , i n i t = ’ u n i f o r m ’ ) ) model . add ( A c t i v a t i o n ( ’ t a n h ’ ) ) model . add ( Dropout ( 0 . 5 ) ) model . add ( Dense ( 6 4 , i n i t = ’ u n i f o r m ’ ) ) model . add ( A c t i v a t i o n ( ’ t a n h ’ ) ) model . add ( Dropout ( 0 . 5 ) ) model . add ( Dense ( 1 0 , i n i t = ’ u n i f o r m ’ ) ) model . add ( A c t i v a t i o n ( ’ s o f t m a x ’ ) ) Najeeb Khan (BIGLAB)


November 29, 2016

50 / 60



November 29, 2016

50 / 60



November 29, 2016

50 / 60



November 29, 2016

50 / 60



November 29, 2016

50 / 60



November 29, 2016

50 / 60



November 29, 2016

50 / 60

Keras Example Feedforward Network s g d = SGD( l r =0.1 , d e c a y=1e −6, momentum =0.9 , n e s t e r o v=True ) model . c o m p i l e ( l o s s= ’ c a t e g o r i c a l c r o s s e n t r o p y ’ , o p t i m i z e r=sgd , m e t r i c s =[ ’ a c c u r a c y ’ ] ) model . f i t ( X t r a i n , y t r a i n , n b e p o c h =20 , b a t c h s i z e =16) s c o r e = model . e v a l u a t e ( X t e s t , y t e s t , b a t c h s i z e =16)



November 29, 2016

51 / 60




November 29, 2016

51 / 60




November 29, 2016

51 / 60




November 29, 2016

51 / 60

Hardware NVIDIA GTX TITAN X CUDA Cores 3072 11 teraflops Core Clock 1000 MHz Price ≈ $1000

source: http://www.geforce.com



November 29, 2016

52 / 60

Hardware NVIDIA DGX-1 CUDA Cores 8x 3584 170 teraflops Price ≈ $130,000

source: http://www.geforce.com



November 29, 2016

53 / 60

Hardware

Department of CS Skorpio University of Saskatchewan HPC Plato Zeno Meton



November 29, 2016

54 / 60

Want to learn more?

Available free of cost at: deeplearningbook.org



November 29, 2016

55 / 60

Questions?

Acknowledgment



November 29, 2016

57 / 60

Acknowledgment

Dr. Kevin Stanley



November 29, 2016

57 / 60

Acknowledgment

Dr. Kevin Stanley


Dr. Ian Stavness


November 29, 2016

57 / 60

Acknowledgment

Dr. Kevin Stanley

Dr. Ian Stavness

Dr. Jung Lee



November 29, 2016

57 / 60

Acknowledgment

Dr. Kevin Stanley

Dr. Ian Stavness

Dr. Jung Lee

Dr. Jawad Shah



November 29, 2016

57 / 60

Citation

@unpublished{najeeb2016dnn, Author = {Khan, Najeeb}, Institution = {University of Saskatchewan}, Year = {2016}, Title = {Deep Neural Networks: Introduction, Architectures and Implementations} URL = {usask.ca/~najeeb.khan/docs/dnn2016.pdf} }



November 29, 2016

58 / 60

References Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., et al. (2015). Tensorflow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow. org, 1. R in Machine Bengio, Y. (2009). Learning deep architectures for ai. Foundations and trends Learning, 2(1):1–127.

Bengio, Y., Courville, A., and Vincent, P. (2013). Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828. Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4):303–314. Dean, J. (2016). Large-scale deep learning for intelligent computer systems. Web Search and Data Mining. Deng, L. (2012). The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29(6):141–142. Domingos, P. (2015). The master algorithm: How the quest for the ultimate learning machine will remake our world. Basic Books. Graham, B. (2014). Fractional max-pooling. arXiv preprint arXiv:1412.6071. He, K., Zhang, X., Ren, S., and Sun, J. (2015). Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385.



November 29, 2016

59 / 60

References (cont.) Kang, G., Li, J., and Tao, D. (2016). Shakeout: A new regularized deep neural network training scheme. In Thirtieth AAAI Conference on Artificial Intelligence. Karpathy, A. (2016). CS231n: Convolutional Neural Networks for Visual Recognition. http://cs231n.stanford.edu/ [Accessed: October 20, 2016]. Krizhevsky, A. and Hinton, G. (2009). Learning multiple layers of features from tiny images. LeCun, Y., Cortes, C., and Burges, C. J. (1998). The mnist database of handwritten digits. Lipton, Z. C., Berkowitz, J., and Elkan, C. (2015). A critical review of recurrent neural networks for sequence learning. arXiv preprint arXiv:1506.00019. Rennie, J. D. (2005). Regularized logistic regression is strictly convex. Unpublished manuscript. URL people. csail. mit. edu/jrennie/writing/convexLR. pdf. Senior, A., Heigold, G., Yang, K., et al. (2013). An empirical study of learning rates in deep neural networks for speech recognition. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 6724–6728. IEEE. Srivastava, N., Hinton, G. E., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929–1958. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. (2015). Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1–9. Wan, L., Zeiler, M., Zhang, S., Cun, Y. L., and Fergus, R. (2013). Regularization of neural networks using dropconnect. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pages 1058–1066. Najeeb Khan (BIGLAB)


November 29, 2016

60 / 60