let's try it out for W with the chain rule: x. 0. W s. 0. U s. 1. U x. 1. W x. 2. W s. 2. U . . . V. V. V y. 0 y. 1 y. 2
Sequence Modeling with Neural Networks Harini Suresh y1
y0 s0
x0
y2 s1
x1
s2
...
x2
MIT 6.S191 | Intro to Deep Learning | IAP 2018
What is a sequence? ●
“This morning I took the dog for a walk.”
sentence
● medical signals
● speech waveform
MIT 6.S191 | Intro to Deep Learning | IAP 2018
Successes of deep models Machine translation
Question Answering
MIT 6.S191 | Intro to Deep Learning | IAP 2018
Left: https://research.googleblog.com/2016/09/aneural-network-for-machine.html Right: https://rajpurkar.github.io/SQuAD-explorer/
Successes of deep models
MIT 6.S191 | Intro to Deep Learning | IAP 2018
a sequence modeling problem: predict the next word
MIT 6.S191 | Intro to Deep Learning | IAP 2018
a sequence modeling problem “This morning I took the dog for a walk.”
MIT 6.S191 | Intro to Deep Learning | IAP 2018
a sequence modeling problem “This morning I took the dog for a walk.” given these words
MIT 6.S191 | Intro to Deep Learning | IAP 2018
a sequence modeling problem “This morning I took the dog for a walk.” given these words
predict what comes next?
MIT 6.S191 | Intro to Deep Learning | IAP 2018
idea: use a fixed window “This morning I took the dog for a walk.”
MIT 6.S191 | Intro to Deep Learning | IAP 2018
given these 2 words, predict the next word
idea: use a fixed window “This morning I took the dog for a walk.” [1000001000] for
a
One hot feature vector indicates what each word is
prediction MIT 6.S191 | Intro to Deep Learning | IAP 2018
given these 2 words, predict the next word
problem: we can’t model long-term dependencies “In France, I had a great time and I learnt some of the _____ language.” We need information from the far past and future to accurately guess the correct word.
MIT 6.S191 | Intro to Deep Learning | IAP 2018
idea: use entire sequence, as a set of counts
This morning I took the dog for a “bag of words”
[0100100…00110001]
prediction
MIT 6.S191 | Intro to Deep Learning | IAP 2018
problem: counts don’t preserve order
MIT 6.S191 | Intro to Deep Learning | IAP 2018
problem: counts don’t preserve order “The food was good, not bad at all.” vs “The food was bad, not good at all.”
MIT 6.S191 | Intro to Deep Learning | IAP 2018
idea: use a really big fixed window “This morning I took the dog for a walk.”
given these 7 words, predict the next word
[ 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 ... ]
morning
I
took
the
dog
prediction
MIT 6.S191 | Intro to Deep Learning | IAP 2018
...
problem: no parameter sharing this
morning
[ 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 ... ] each of these inputs has a separate parameter
MIT 6.S191 | Intro to Deep Learning | IAP 2018
problem: no parameter sharing this
morning
[ 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 ... ] each of these inputs has a separate parameter this
morning
[ 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 ... ]
MIT 6.S191 | Intro to Deep Learning | IAP 2018
problem: no parameter sharing this
morning
[ 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 ... ] each of these inputs has a separate parameter this
morning
[ 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 ... ]
things we learn about the sequence won’t transfer if they appear at different points in the sequence. MIT 6.S191 | Intro to Deep Learning | IAP 2018
to model sequences, we need: 1. 2. 3. 4.
to deal with variable-length sequences to maintain sequence order to keep track of long-term dependencies to share parameters across the sequence
MIT 6.S191 | Intro to Deep Learning | IAP 2018
to model sequences, we need: 1. 2. 3. 4.
to deal with variable-length sequences to maintain sequence order to keep track of long-term dependencies to share parameters across the sequence
let’s turn to recurrent neural networks. MIT 6.S191 | Intro to Deep Learning | IAP 2018
example network:
. . . . . .
input
hidden
. . .
output
MIT 6.S191 | Intro to Deep Learning | IAP 2018
example network:
. . . . . .
. . .
let’s take a look at this one hidden unit input
hidden
output
MIT 6.S191 | Intro to Deep Learning | IAP 2018
RNNS remember their previous state: x0 : “it”
W
s1 s0 U t=0
MIT 6.S191 | Intro to Deep Learning | IAP 2018
RNNS remember their previous state: x1 : “was”
W
s2 s1 U t=1
MIT 6.S191 | Intro to Deep Learning | IAP 2018
1 2
“unfolding” the RNN across time: time
s0
s1
U
U
W
W x0
s2
...
U
W x1
x2
MIT 6.S191 | Intro to Deep Learning | IAP 2018
“unfolding” the RNN across time: time
s0
s1
U
U
W
W x0
s2
...
U
W x1
x2
MIT 6.S191 | Intro to Deep Learning | IAP 2018
notice that we use the same parameters, W and U
“unfolding” the RNN across time: time
s0
s1
U
U
W
W x0
s2
...
U
W x1
x2
MIT 6.S191 | Intro to Deep Learning | IAP 2018
sn can contain information from all past timesteps
how do we train an RNN?
MIT 6.S191 | Intro to Deep Learning | IAP 2018
how do we train an RNN? backpropagation! (through time)
MIT 6.S191 | Intro to Deep Learning | IAP 2018
remember: backpropagation 1. take the derivative (gradient) of the loss with respect to each parameter 2. shift parameters in the opposite direction in order to minimize loss
MIT 6.S191 | Intro to Deep Learning | IAP 2018
we have a loss at each timestep: (since we’re making a prediction at each timestep) J0
J1
J2
y0
y1
y2
V
V
V
s0
s1
U
U W
W x0
s2
...
U W
x1
x2
MIT 6.S191 | Intro to Deep Learning | IAP 2018
we have a loss at each timestep: (since we’re making a prediction at each timestep) J0
J1
J2
y0
y1
y2
V
V
V
s0
s1
U
x0
s2
U W
W
loss at each timestep
...
U W
x1
x2
MIT 6.S191 | Intro to Deep Learning | IAP 2018
we sum the losses across time:
loss at time t = Jt( ) = our parameters, like weights
total loss = J( ) = Σt Jt( )
MIT 6.S191 | Intro to Deep Learning | IAP 2018
what are our gradients? we sum gradients across time for each parameter P:
MIT 6.S191 | Intro to Deep Learning | IAP 2018
let’s try it out for W with the chain rule: J0
J1
J2
y0
y1
y2
V
V
V
s0
s1
U
U W
W x0
s2
...
U W
x1
x2
MIT 6.S191 | Intro to Deep Learning | IAP 2018
let’s try it out for W with the chain rule: J0
J1
J2
y0
y1
y2
V
V
V
s0
s1
U
x0
s2
U W
W
so let’s take a single timestep t:
...
U W
x1
x2
MIT 6.S191 | Intro to Deep Learning | IAP 2018
let’s try it out for W with the chain rule: J0
J1
J2
y0
y1
y2
V
V
V
s0
s1
U
x0
s2
U W
W
so let’s take a single timestep t:
...
U W
x1
x2
MIT 6.S191 | Intro to Deep Learning | IAP 2018
let’s try it out for W with the chain rule: J0
J1
J2
y0
y1
y2
V
V
V
s0
s1
U
x0
s2
U W
W
so let’s take a single timestep t:
...
U W
x1
x2
MIT 6.S191 | Intro to Deep Learning | IAP 2018
let’s try it out for W with the chain rule: J0
J1
J2
y0
y1
y2
V
V
V
s0
s1
U
x0
s2
U W
W
so let’s take a single timestep t:
...
U W
x1
x2
MIT 6.S191 | Intro to Deep Learning | IAP 2018
let’s try it out for W with the chain rule: J0
J1
J2
y0
y1
y2
V
V
V
s0
s1
U
x0
s2
U W
W
so let’s take a single timestep t:
...
U W
x1
x2
MIT 6.S191 | Intro to Deep Learning | IAP 2018
let’s try it out for W with the chain rule: J0
J1
J2
y0
y1
y2
V
V
V
s0
s1
U
x0
s2
U W
W
so let’s take a single timestep t:
...
U
but wait…
W x1
x2
MIT 6.S191 | Intro to Deep Learning | IAP 2018
let’s try it out for W with the chain rule: J0
J1
J2
y0
y1
y2
V
V
V
s0
s1
U
x0
s2
U W
W
so let’s take a single timestep t:
...
U
but wait…
W x1
x2
MIT 6.S191 | Intro to Deep Learning | IAP 2018
let’s try it out for W with the chain rule: J0
J1
J2
y0
y1
y2
V
V
V
s0
s1
U
x0
s2
U W
W
so let’s take a single timestep t:
...
U
but wait…
W x1
x2
s1 also depends on W so we can’t just treat as a constant!
MIT 6.S191 | Intro to Deep Learning | IAP 2018
how does s2 depend on W? J0
J1
J2
y0
y1
y2
V
V
V
s0
s1
U
U W
W x0
s2
...
U W
x1
x2
MIT 6.S191 | Intro to Deep Learning | IAP 2018
how does s2 depend on W? J0
J1
J2
y0
y1
y2
V
V
V
s0
s1
U
U W
W x0
s2
...
U W
x1
x2
MIT 6.S191 | Intro to Deep Learning | IAP 2018
how does s2 depend on W? J0
J1
J2
y0
y1
y2
V
V
V
s0
s1
U
U W
W x0
s2
...
U W
x1
x2
MIT 6.S191 | Intro to Deep Learning | IAP 2018
how does s2 depend on W? J0
J1
J2
y0
y1
y2
V
V
V
s0
s1
U
U W
W x0
s2
...
U W
x1
x2
MIT 6.S191 | Intro to Deep Learning | IAP 2018
backpropagation through time:
Contributions of W in previous timesteps to the error at timestep t
MIT 6.S191 | Intro to Deep Learning | IAP 2018
backpropagation through time:
Contributions of W in previous timesteps to the error at timestep t
MIT 6.S191 | Intro to Deep Learning | IAP 2018
why are RNNs hard to train?
MIT 6.S191 | Intro to Deep Learning | IAP 2018
problem: vanishing gradient
MIT 6.S191 | Intro to Deep Learning | IAP 2018
problem: vanishing gradient
MIT 6.S191 | Intro to Deep Learning | IAP 2018
problem: vanishing gradient
y1
y0 s0
x0
y2 s1
x1
s2
at k = 0:
x2
MIT 6.S191 | Intro to Deep Learning | IAP 2018
problem: vanishing gradient
y1
y0 s0
x0
y2 s1
x1
y3 s2
x2
yn s3
x3
sn
. . .
xn
MIT 6.S191 | Intro to Deep Learning | IAP 2018
problem: vanishing gradient
y1
y0 s0
x0
y2 s1
x1
y3 s2
x2
yn s3
x3
as the gap between timesteps gets bigger, this product gets longer and longer! sn
. . .
xn
MIT 6.S191 | Intro to Deep Learning | IAP 2018
problem: vanishing gradient
MIT 6.S191 | Intro to Deep Learning | IAP 2018
problem: vanishing gradient what are each of these terms?
MIT 6.S191 | Intro to Deep Learning | IAP 2018
problem: vanishing gradient what are each of these terms?
W = sampled from standard normal distribution = mostly < 1
f = tanh or sigmoid so f’ < 1
MIT 6.S191 | Intro to Deep Learning | IAP 2018
problem: vanishing gradient what are each of these terms?
W = sampled from standard normal distribution = mostly < 1
f = tanh or sigmoid so f’ < 1
we’re multiplying a lot of small numbers together. MIT 6.S191 | Intro to Deep Learning | IAP 2018
we’re multiplying a lot of small numbers together.
so what? errors due to further back timesteps have increasingly smaller gradients.
so what? parameters become biased to capture shorter-term dependencies. MIT 6.S191 | Intro to Deep Learning | IAP 2018
“In France, I had a great time and I learnt some of the _____ language.”
our parameters are not trained to capture long-term dependencies, so the word we predict will mostly depend on the previous few words, not much earlier ones
MIT 6.S191 | Intro to Deep Learning | IAP 2018
solution #1: activation functions ReLU derivative
prevents f’ from shrinking the gradients
tanh derivative sigmoid derivative
MIT 6.S191 | Intro to Deep Learning | IAP 2018
solution #2: initialization weights initialized to identity matrix biases initialized to zeros
prevents W from shrinking the gradients
MIT 6.S191 | Intro to Deep Learning | IAP 2018
a different type of solution: more complex cells
MIT 6.S191 | Intro to Deep Learning | IAP 2018
solution #3: gated cells rather each node being just a simple RNN cell, make each node a more complex unit with gates controlling what information is passed through. Long short term memory cells are able to keep track of information throughout many timesteps.
vs
RNN
LSTM, GRU, etc
MIT 6.S191 | Intro to Deep Learning | IAP 2018
solution #3: more on LSTMs cj
cj+1
cell state
MIT 6.S191 | Intro to Deep Learning | IAP 2018
solution #3: more on LSTMs cj
cj+1 forget irrelevant parts of previous state
MIT 6.S191 | Intro to Deep Learning | IAP 2018
solution #3: more on LSTMs cj
cj+1 selectively update cell state values
MIT 6.S191 | Intro to Deep Learning | IAP 2018
solution #3: more on LSTMs cj
cj+1 output certain parts of cell state
MIT 6.S191 | Intro to Deep Learning | IAP 2018
solution #3: more on LSTMs cj
cj+1 forget irrelevant parts of previous state
selectively update cell state values
output certain parts of cell state
MIT 6.S191 | Intro to Deep Learning | IAP 2018
solution #3: more on LSTMs cj
st
X
+
forget gate
input gate
xt
st
cj+1
output gate
xt
MIT 6.S191 | Intro to Deep Learning | IAP 2017
st+1
why do LSTMs help? 1. forget gate allows information to pass through unchanged 2. cell state is separate from what’s outputted 3. sj depends on sj-1 through addition! → derivatives don’t expand into a long product!
MIT 6.S191 | Intro to Deep Learning | IAP 2018
possible task: classification (i.e. sentiment)
:(
:) MIT 6.S191 | Intro to Deep Learning | IAP 2018
possible task: classification (i.e. sentiment) y negative
V s0
s1
U
W don’t x0
sn
...
U
W fly x1
W luggage xn
MIT 6.S191 | Intro to Deep Learning | IAP 2018
y is a probability distribution over possible classes (like positive, negative, neutral), aka a softmax
possible task: music generation
RNN
Music by: Francesco Marchesani, Computer Science Engineer, PoliMi
MIT 6.S191 | Intro to Deep Learning | IAP 2018
possible task: music generation y0
y1
y2
E
D
F#
V
V
V
s0
s1
U
W x0
yi is actually a probability distribution over possible next notes, aka a softmax s2
U
W
...
U
W E x1
D x2
MIT 6.S191 | Intro to Deep Learning | IAP 2018
possible task: machine translation le
chien
K
s0
s1 U
W
the
s2
K
c0
U
mange K
c1 L
K
c2 L
W
W
J
J
dog
eats
s2 ,
s2 , le
MIT 6.S191 | Intro to Deep Learning | IAP 2018
c3 L
J
s2 , chien
J
s2 , mange
possible task: machine translation le
chien
K
s0
s1 U
W
the
s2
mange
K
c0
U
K
c1 L
K
c2 L
W
W
J
J
dog
eats
s2 ,
s2 , le
MIT 6.S191 | Intro to Deep Learning | IAP 2018
c3
L J
s2 , chien
J
s2 , mange
problem: a single encoding is limiting le
chien
K
s0
s1 U
W
the
s2
mange
K
c0
U
K
c1 L
K
L
W
W
J
J
dog
eats
s2 ,
s2 , le
c3
c2 L J
s2 , chien
J
s2 , mange
all the decoder knows about the input sentence is in one fixed length vector, s2 MIT 6.S191 | Intro to Deep Learning | IAP 2018
solution: attend over all encoder states le
chien
K
s0
s1 U
s2
mange
K
c0
U
K
c1 L
K
L
J
J
s* ,
s* , le
c3
c2
MIT 6.S191 | Intro to Deep Learning | IAP 2018
L J
s* , chien
J
s* , mange
solution: attend over all encoder states le
chien
K
s0
s1 U
s2 U
mange
K
c0
K
c1 L
K
c3
c2 L
J
s* , le
MIT 6.S191 | Intro to Deep Learning | IAP 2018
L J
s* , chien
J
s* , mange
solution: attend over all encoder states le
chien
K
s0
s1 U
s2 U
mange
K
c0
K
c1 L
K
c3
c2 L
L J
s* , chien
MIT 6.S191 | Intro to Deep Learning | IAP 2018
J
s* , mange
now we can model sequences! ● why recurrent neural networks? ● training them with backpropagation through time ● solving the vanishing gradient problem with activation functions, initialization, and gated cells (like LSTMs) ● building models for classification, music generation and machine translation ● using attention mechanisms
MIT 6.S191 | Intro to Deep Learning | IAP 2018
and there’s lots more to do! ● ● ● ● ● ● ● ● ●
extending our models to timeseries + waveforms complex language models to generate long text or books language models to generate code controlling cars + robots predicting stock market trends summarizing books + articles handwriting generation multilingual translation models … many more! MIT 6.S191 | Intro to Deep Learning | IAP 2018