Sequence Modeling with Neural Networks - MIT 6.S191

0 downloads 95 Views 878KB Size Report
let's try it out for W with the chain rule: x. 0. W s. 0. U s. 1. U x. 1. W x. 2. W s. 2. U . . . V. V. V y. 0 y. 1 y. 2
Sequence Modeling with Neural Networks Harini Suresh y1

y0 s0

x0

y2 s1

x1

s2

...

x2

MIT 6.S191 | Intro to Deep Learning | IAP 2018

What is a sequence? ●

“This morning I took the dog for a walk.”

sentence

● medical signals

● speech waveform

MIT 6.S191 | Intro to Deep Learning | IAP 2018

Successes of deep models Machine translation

Question Answering

MIT 6.S191 | Intro to Deep Learning | IAP 2018

Left: https://research.googleblog.com/2016/09/aneural-network-for-machine.html Right: https://rajpurkar.github.io/SQuAD-explorer/

Successes of deep models

MIT 6.S191 | Intro to Deep Learning | IAP 2018

a sequence modeling problem: predict the next word

MIT 6.S191 | Intro to Deep Learning | IAP 2018

a sequence modeling problem “This morning I took the dog for a walk.”

MIT 6.S191 | Intro to Deep Learning | IAP 2018

a sequence modeling problem “This morning I took the dog for a walk.” given these words

MIT 6.S191 | Intro to Deep Learning | IAP 2018

a sequence modeling problem “This morning I took the dog for a walk.” given these words

predict what comes next?

MIT 6.S191 | Intro to Deep Learning | IAP 2018

idea: use a fixed window “This morning I took the dog for a walk.”

MIT 6.S191 | Intro to Deep Learning | IAP 2018

given these 2 words, predict the next word

idea: use a fixed window “This morning I took the dog for a walk.” [1000001000] for

a

One hot feature vector indicates what each word is

prediction MIT 6.S191 | Intro to Deep Learning | IAP 2018

given these 2 words, predict the next word

problem: we can’t model long-term dependencies “In France, I had a great time and I learnt some of the _____ language.” We need information from the far past and future to accurately guess the correct word.

MIT 6.S191 | Intro to Deep Learning | IAP 2018

idea: use entire sequence, as a set of counts

This morning I took the dog for a “bag of words”

[0100100…00110001]

prediction

MIT 6.S191 | Intro to Deep Learning | IAP 2018

problem: counts don’t preserve order

MIT 6.S191 | Intro to Deep Learning | IAP 2018

problem: counts don’t preserve order “The food was good, not bad at all.” vs “The food was bad, not good at all.”

MIT 6.S191 | Intro to Deep Learning | IAP 2018

idea: use a really big fixed window “This morning I took the dog for a walk.”

given these 7 words, predict the next word

[ 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 ... ]

morning

I

took

the

dog

prediction

MIT 6.S191 | Intro to Deep Learning | IAP 2018

...

problem: no parameter sharing this

morning

[ 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 ... ] each of these inputs has a separate parameter

MIT 6.S191 | Intro to Deep Learning | IAP 2018

problem: no parameter sharing this

morning

[ 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 ... ] each of these inputs has a separate parameter this

morning

[ 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 ... ]

MIT 6.S191 | Intro to Deep Learning | IAP 2018

problem: no parameter sharing this

morning

[ 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 ... ] each of these inputs has a separate parameter this

morning

[ 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 ... ]

things we learn about the sequence won’t transfer if they appear at different points in the sequence. MIT 6.S191 | Intro to Deep Learning | IAP 2018

to model sequences, we need: 1. 2. 3. 4.

to deal with variable-length sequences to maintain sequence order to keep track of long-term dependencies to share parameters across the sequence

MIT 6.S191 | Intro to Deep Learning | IAP 2018

to model sequences, we need: 1. 2. 3. 4.

to deal with variable-length sequences to maintain sequence order to keep track of long-term dependencies to share parameters across the sequence

let’s turn to recurrent neural networks. MIT 6.S191 | Intro to Deep Learning | IAP 2018

example network:

. . . . . .

input

hidden

. . .

output

MIT 6.S191 | Intro to Deep Learning | IAP 2018

example network:

. . . . . .

. . .

let’s take a look at this one hidden unit input

hidden

output

MIT 6.S191 | Intro to Deep Learning | IAP 2018

RNNS remember their previous state: x0 : “it”

W

s1 s0 U t=0

MIT 6.S191 | Intro to Deep Learning | IAP 2018

RNNS remember their previous state: x1 : “was”

W

s2 s1 U t=1

MIT 6.S191 | Intro to Deep Learning | IAP 2018

1 2

“unfolding” the RNN across time: time

s0

s1

U

U

W

W x0

s2

...

U

W x1

x2

MIT 6.S191 | Intro to Deep Learning | IAP 2018

“unfolding” the RNN across time: time

s0

s1

U

U

W

W x0

s2

...

U

W x1

x2

MIT 6.S191 | Intro to Deep Learning | IAP 2018

notice that we use the same parameters, W and U

“unfolding” the RNN across time: time

s0

s1

U

U

W

W x0

s2

...

U

W x1

x2

MIT 6.S191 | Intro to Deep Learning | IAP 2018

sn can contain information from all past timesteps

how do we train an RNN?

MIT 6.S191 | Intro to Deep Learning | IAP 2018

how do we train an RNN? backpropagation! (through time)

MIT 6.S191 | Intro to Deep Learning | IAP 2018

remember: backpropagation 1. take the derivative (gradient) of the loss with respect to each parameter 2. shift parameters in the opposite direction in order to minimize loss

MIT 6.S191 | Intro to Deep Learning | IAP 2018

we have a loss at each timestep: (since we’re making a prediction at each timestep) J0

J1

J2

y0

y1

y2

V

V

V

s0

s1

U

U W

W x0

s2

...

U W

x1

x2

MIT 6.S191 | Intro to Deep Learning | IAP 2018

we have a loss at each timestep: (since we’re making a prediction at each timestep) J0

J1

J2

y0

y1

y2

V

V

V

s0

s1

U

x0

s2

U W

W

loss at each timestep

...

U W

x1

x2

MIT 6.S191 | Intro to Deep Learning | IAP 2018

we sum the losses across time:

loss at time t = Jt( ) = our parameters, like weights

total loss = J( ) = Σt Jt( )

MIT 6.S191 | Intro to Deep Learning | IAP 2018

what are our gradients? we sum gradients across time for each parameter P:

MIT 6.S191 | Intro to Deep Learning | IAP 2018

let’s try it out for W with the chain rule: J0

J1

J2

y0

y1

y2

V

V

V

s0

s1

U

U W

W x0

s2

...

U W

x1

x2

MIT 6.S191 | Intro to Deep Learning | IAP 2018

let’s try it out for W with the chain rule: J0

J1

J2

y0

y1

y2

V

V

V

s0

s1

U

x0

s2

U W

W

so let’s take a single timestep t:

...

U W

x1

x2

MIT 6.S191 | Intro to Deep Learning | IAP 2018

let’s try it out for W with the chain rule: J0

J1

J2

y0

y1

y2

V

V

V

s0

s1

U

x0

s2

U W

W

so let’s take a single timestep t:

...

U W

x1

x2

MIT 6.S191 | Intro to Deep Learning | IAP 2018

let’s try it out for W with the chain rule: J0

J1

J2

y0

y1

y2

V

V

V

s0

s1

U

x0

s2

U W

W

so let’s take a single timestep t:

...

U W

x1

x2

MIT 6.S191 | Intro to Deep Learning | IAP 2018

let’s try it out for W with the chain rule: J0

J1

J2

y0

y1

y2

V

V

V

s0

s1

U

x0

s2

U W

W

so let’s take a single timestep t:

...

U W

x1

x2

MIT 6.S191 | Intro to Deep Learning | IAP 2018

let’s try it out for W with the chain rule: J0

J1

J2

y0

y1

y2

V

V

V

s0

s1

U

x0

s2

U W

W

so let’s take a single timestep t:

...

U W

x1

x2

MIT 6.S191 | Intro to Deep Learning | IAP 2018

let’s try it out for W with the chain rule: J0

J1

J2

y0

y1

y2

V

V

V

s0

s1

U

x0

s2

U W

W

so let’s take a single timestep t:

...

U

but wait…

W x1

x2

MIT 6.S191 | Intro to Deep Learning | IAP 2018

let’s try it out for W with the chain rule: J0

J1

J2

y0

y1

y2

V

V

V

s0

s1

U

x0

s2

U W

W

so let’s take a single timestep t:

...

U

but wait…

W x1

x2

MIT 6.S191 | Intro to Deep Learning | IAP 2018

let’s try it out for W with the chain rule: J0

J1

J2

y0

y1

y2

V

V

V

s0

s1

U

x0

s2

U W

W

so let’s take a single timestep t:

...

U

but wait…

W x1

x2

s1 also depends on W so we can’t just treat as a constant!

MIT 6.S191 | Intro to Deep Learning | IAP 2018

how does s2 depend on W? J0

J1

J2

y0

y1

y2

V

V

V

s0

s1

U

U W

W x0

s2

...

U W

x1

x2

MIT 6.S191 | Intro to Deep Learning | IAP 2018

how does s2 depend on W? J0

J1

J2

y0

y1

y2

V

V

V

s0

s1

U

U W

W x0

s2

...

U W

x1

x2

MIT 6.S191 | Intro to Deep Learning | IAP 2018

how does s2 depend on W? J0

J1

J2

y0

y1

y2

V

V

V

s0

s1

U

U W

W x0

s2

...

U W

x1

x2

MIT 6.S191 | Intro to Deep Learning | IAP 2018

how does s2 depend on W? J0

J1

J2

y0

y1

y2

V

V

V

s0

s1

U

U W

W x0

s2

...

U W

x1

x2

MIT 6.S191 | Intro to Deep Learning | IAP 2018

backpropagation through time:

Contributions of W in previous timesteps to the error at timestep t

MIT 6.S191 | Intro to Deep Learning | IAP 2018

backpropagation through time:

Contributions of W in previous timesteps to the error at timestep t

MIT 6.S191 | Intro to Deep Learning | IAP 2018

why are RNNs hard to train?

MIT 6.S191 | Intro to Deep Learning | IAP 2018

problem: vanishing gradient

MIT 6.S191 | Intro to Deep Learning | IAP 2018

problem: vanishing gradient

MIT 6.S191 | Intro to Deep Learning | IAP 2018

problem: vanishing gradient

y1

y0 s0

x0

y2 s1

x1

s2

at k = 0:

x2

MIT 6.S191 | Intro to Deep Learning | IAP 2018

problem: vanishing gradient

y1

y0 s0

x0

y2 s1

x1

y3 s2

x2

yn s3

x3

sn

. . .

xn

MIT 6.S191 | Intro to Deep Learning | IAP 2018

problem: vanishing gradient

y1

y0 s0

x0

y2 s1

x1

y3 s2

x2

yn s3

x3

as the gap between timesteps gets bigger, this product gets longer and longer! sn

. . .

xn

MIT 6.S191 | Intro to Deep Learning | IAP 2018

problem: vanishing gradient

MIT 6.S191 | Intro to Deep Learning | IAP 2018

problem: vanishing gradient what are each of these terms?

MIT 6.S191 | Intro to Deep Learning | IAP 2018

problem: vanishing gradient what are each of these terms?

W = sampled from standard normal distribution = mostly < 1

f = tanh or sigmoid so f’ < 1

MIT 6.S191 | Intro to Deep Learning | IAP 2018

problem: vanishing gradient what are each of these terms?

W = sampled from standard normal distribution = mostly < 1

f = tanh or sigmoid so f’ < 1

we’re multiplying a lot of small numbers together. MIT 6.S191 | Intro to Deep Learning | IAP 2018

we’re multiplying a lot of small numbers together.

so what? errors due to further back timesteps have increasingly smaller gradients.

so what? parameters become biased to capture shorter-term dependencies. MIT 6.S191 | Intro to Deep Learning | IAP 2018

“In France, I had a great time and I learnt some of the _____ language.”

our parameters are not trained to capture long-term dependencies, so the word we predict will mostly depend on the previous few words, not much earlier ones

MIT 6.S191 | Intro to Deep Learning | IAP 2018

solution #1: activation functions ReLU derivative

prevents f’ from shrinking the gradients

tanh derivative sigmoid derivative

MIT 6.S191 | Intro to Deep Learning | IAP 2018

solution #2: initialization weights initialized to identity matrix biases initialized to zeros

prevents W from shrinking the gradients

MIT 6.S191 | Intro to Deep Learning | IAP 2018

a different type of solution: more complex cells

MIT 6.S191 | Intro to Deep Learning | IAP 2018

solution #3: gated cells rather each node being just a simple RNN cell, make each node a more complex unit with gates controlling what information is passed through. Long short term memory cells are able to keep track of information throughout many timesteps.

vs

RNN

LSTM, GRU, etc

MIT 6.S191 | Intro to Deep Learning | IAP 2018

solution #3: more on LSTMs cj

cj+1

cell state

MIT 6.S191 | Intro to Deep Learning | IAP 2018

solution #3: more on LSTMs cj

cj+1 forget irrelevant parts of previous state

MIT 6.S191 | Intro to Deep Learning | IAP 2018

solution #3: more on LSTMs cj

cj+1 selectively update cell state values

MIT 6.S191 | Intro to Deep Learning | IAP 2018

solution #3: more on LSTMs cj

cj+1 output certain parts of cell state

MIT 6.S191 | Intro to Deep Learning | IAP 2018

solution #3: more on LSTMs cj

cj+1 forget irrelevant parts of previous state

selectively update cell state values

output certain parts of cell state

MIT 6.S191 | Intro to Deep Learning | IAP 2018

solution #3: more on LSTMs cj

st

X

+

forget gate

input gate

xt

st

cj+1

output gate

xt

MIT 6.S191 | Intro to Deep Learning | IAP 2017

st+1

why do LSTMs help? 1. forget gate allows information to pass through unchanged 2. cell state is separate from what’s outputted 3. sj depends on sj-1 through addition! → derivatives don’t expand into a long product!

MIT 6.S191 | Intro to Deep Learning | IAP 2018

possible task: classification (i.e. sentiment)

:(

:) MIT 6.S191 | Intro to Deep Learning | IAP 2018

possible task: classification (i.e. sentiment) y negative

V s0

s1

U

W don’t x0

sn

...

U

W fly x1

W luggage xn

MIT 6.S191 | Intro to Deep Learning | IAP 2018

y is a probability distribution over possible classes (like positive, negative, neutral), aka a softmax

possible task: music generation

RNN

Music by: Francesco Marchesani, Computer Science Engineer, PoliMi

MIT 6.S191 | Intro to Deep Learning | IAP 2018

possible task: music generation y0

y1

y2

E

D

F#

V

V

V

s0

s1

U

W x0

yi is actually a probability distribution over possible next notes, aka a softmax s2

U

W

...

U

W E x1

D x2

MIT 6.S191 | Intro to Deep Learning | IAP 2018

possible task: machine translation le

chien

K

s0

s1 U

W

the

s2

K

c0

U

mange K

c1 L

K

c2 L

W

W

J

J

dog

eats

s2 ,

s2 , le

MIT 6.S191 | Intro to Deep Learning | IAP 2018

c3 L

J

s2 , chien

J

s2 , mange

possible task: machine translation le

chien

K

s0

s1 U

W

the

s2

mange

K

c0

U

K

c1 L

K

c2 L

W

W

J

J

dog

eats

s2 ,

s2 , le

MIT 6.S191 | Intro to Deep Learning | IAP 2018

c3

L J

s2 , chien

J

s2 , mange

problem: a single encoding is limiting le

chien

K

s0

s1 U

W

the

s2

mange

K

c0

U

K

c1 L

K

L

W

W

J

J

dog

eats

s2 ,

s2 , le

c3

c2 L J

s2 , chien

J

s2 , mange

all the decoder knows about the input sentence is in one fixed length vector, s2 MIT 6.S191 | Intro to Deep Learning | IAP 2018

solution: attend over all encoder states le

chien

K

s0

s1 U

s2

mange

K

c0

U

K

c1 L

K

L

J

J

s* ,

s* , le

c3

c2

MIT 6.S191 | Intro to Deep Learning | IAP 2018

L J

s* , chien

J

s* , mange

solution: attend over all encoder states le

chien

K

s0

s1 U

s2 U

mange

K

c0

K

c1 L

K

c3

c2 L

J

s* , le

MIT 6.S191 | Intro to Deep Learning | IAP 2018

L J

s* , chien

J

s* , mange

solution: attend over all encoder states le

chien

K

s0

s1 U

s2 U

mange

K

c0

K

c1 L

K

c3

c2 L

L J

s* , chien

MIT 6.S191 | Intro to Deep Learning | IAP 2018

J

s* , mange

now we can model sequences! ● why recurrent neural networks? ● training them with backpropagation through time ● solving the vanishing gradient problem with activation functions, initialization, and gated cells (like LSTMs) ● building models for classification, music generation and machine translation ● using attention mechanisms

MIT 6.S191 | Intro to Deep Learning | IAP 2018

and there’s lots more to do! ● ● ● ● ● ● ● ● ●

extending our models to timeseries + waveforms complex language models to generate long text or books language models to generate code controlling cars + robots predicting stock market trends summarizing books + articles handwriting generation multilingual translation models … many more! MIT 6.S191 | Intro to Deep Learning | IAP 2018