Variational Inference - University of Colorado Boulder

0 downloads 169 Views 521KB Size Report
Can't compute posterior for many interesting models. GMM (finite). 1. Draw µk ∼ N(0 ..... 99 cents through Apple Comp
Variational Inference Machine Learning: Jordan Boyd-Graber University of Colorado Boulder LECTURE 19

Machine Learning: Jordan Boyd-Graber

|

Boulder

Variational Inference

|

1 of 29

Variational Inference

• Inferring hidden variables • Unlike MCMC: ◦ Deterministic ◦ Easy to gauge convergence ◦ Requires dozens of iterations • Doesn’t require conjugacy • Slightly hairier math

Machine Learning: Jordan Boyd-Graber

|

Boulder

Variational Inference

|

2 of 29

Setup

• ~x = x1:n observations

• ~z = z1:m hidden variables

• α fixed parameters

• Want the posterior distribution

p(z | x, α) = R

Machine Learning: Jordan Boyd-Graber

|

Boulder

p(z, x | α) z p(z, x | α)

(1)

Variational Inference

|

3 of 29

Motivation • Can’t compute posterior for many interesting models

GMM (finite) 1. Draw µk ∼ N (0, τ 2 )

2. For each observation i = 1 . . . n: 2.1 Draw zi ∼ Mult(π) 2.2 Draw xi ∼ N (µzi , σ02 )

• Posterior is intractable for large n, and we might want to add priors

QK p(µ1:K , z1:n | x1:n ) = R

µ1:K

Qn k=1 p(µk ) i=1 p(zi )p(xi | zi , µ1:K ) P QK Qn z1:n k=1 p(µk ) i=1 p(zi )p(xi | zi , µ1:K ) (2)

Machine Learning: Jordan Boyd-Graber

|

Boulder

Variational Inference

|

4 of 29

Motivation • Can’t compute posterior for many interesting models

GMM (finite) 1. Draw µk ∼ N (0, τ 2 )

2. For each observation i = 1 . . . n: 2.1 Draw zi ∼ Mult(π) 2.2 Draw xi ∼ N (µzi , σ02 )

• Posterior is intractable for large n, and we might want to add priors

QK p(µ1:K , z1:n | x1:n ) = R

µ1:K

Qn k=1 p(µk ) i=1 p(zi )p(xi | zi , µ1:K ) Qn P QK z1:n k=1 p(µk ) i=1 p(zi )p(xi | zi , µ1:K ) (2)

Consider all means Machine Learning: Jordan Boyd-Graber

|

Boulder

Variational Inference

|

4 of 29

Motivation • Can’t compute posterior for many interesting models

GMM (finite) 1. Draw µk ∼ N (0, τ 2 )

2. For each observation i = 1 . . . n: 2.1 Draw zi ∼ Mult(π) 2.2 Draw xi ∼ N (µzi , σ02 )

• Posterior is intractable for large n, and we might want to add priors

QK p(µ1:K , z1:n | x1:n ) = R

µ1:K

Qn k=1 p(µk ) i=1 p(zi )p(xi | zi , µ1:K ) P QK Qn z1:n k=1 p(µk ) i=1 p(zi )p(xi | zi , µ1:K ) (2)

Consider all assignments Machine Learning: Jordan Boyd-Graber

|

Boulder

Variational Inference

|

4 of 29

Main Idea

• We create a variational distribution over the latent variables

q(z1:m | ν)

(3)

• Find the settings of ν so that q is close to the posterior • If q == p, then this is vanilla EM

Machine Learning: Jordan Boyd-Graber

|

Boulder

Variational Inference

|

5 of 29

What does it mean for distributions to be close?

• We measure the closeness of distributions using Kullback-Leibler

Divergence

 KL(q || p) ≡ Eq

Machine Learning: Jordan Boyd-Graber

|

Boulder

q(Z ) log p(Z | x)

 (4)

Variational Inference

|

6 of 29

What does it mean for distributions to be close?

• We measure the closeness of distributions using Kullback-Leibler

Divergence

 KL(q || p) ≡ Eq

q(Z ) log p(Z | x)

 (4)

• Characterizing KL divergence ◦ If q and p are high, we’re happy ◦ If q is high but p isn’t, we pay a price ◦ If q is low, we don’t care ◦ If KL = 0, then distribution are equal

Machine Learning: Jordan Boyd-Graber

|

Boulder

Variational Inference

|

6 of 29

What does it mean for distributions to be close?

• We measure the closeness of distributions using Kullback-Leibler

Divergence

 KL(q || p) ≡ Eq

q(Z ) log p(Z | x)

 (4)

• Characterizing KL divergence ◦ If q and p are high, we’re happy ◦ If q is high but p isn’t, we pay a price ◦ If q is low, we don’t care ◦ If KL = 0, then distribution are equal

This behavior is often called “mode splitting”: we want a good solution, not every solution. Machine Learning: Jordan Boyd-Graber

|

Boulder

Variational Inference

|

6 of 29

Jensen’s Inequality: Concave Functions and Expectations

log(t · x1 + (1

t) · x2 ) When f is concave

t log(x1 ) + (1

x1

t) log(x2 )

f (E [X ]) ≥ E [f (X )]

x2

If you haven’t seen this before, spend fifteen minutes to convince yourself that it’s true Machine Learning: Jordan Boyd-Graber

|

Boulder

Variational Inference

|

7 of 29

Evidence Lower Bound (ELBO)

• Apply Jensen’s inequality on log probability of data



Z p(x, z)

log p(x) = log z

Machine Learning: Jordan Boyd-Graber

|

Boulder

Variational Inference

|

8 of 29

Evidence Lower Bound (ELBO)

• Apply Jensen’s inequality on log probability of data

Z log p(x) = log

 p(x, z)

z

Z = log

p(x, z) z

q(z) q(z)



Add a term that is equal to one

Machine Learning: Jordan Boyd-Graber

|

Boulder

Variational Inference

|

8 of 29

Evidence Lower Bound (ELBO)

• Apply Jensen’s inequality on log probability of data

Z log p(x) = log

 p(x, z)

z

Z

q(z) p(x, z) q(z)  z  p(x, z) = log Eq q(z)



= log

Take the numerator to create an expectation

Machine Learning: Jordan Boyd-Graber

|

Boulder

Variational Inference

|

8 of 29

Evidence Lower Bound (ELBO)

• Apply Jensen’s inequality on log probability of data

Z log p(x) = log

 p(x, z)

z

Z

 q(z) p(x, z) q(z)  z  p(x, z) = log Eq q(z) ≥Eq [log p(x, z)] − Eq [log q(z)] = log

Apply Jensen’s equality and use log difference

Machine Learning: Jordan Boyd-Graber

|

Boulder

Variational Inference

|

8 of 29

Evidence Lower Bound (ELBO) • Apply Jensen’s inequality on log probability of data

Z log p(x) = log

 p(x, z)

Zz

 q(z) = log p(x, z) q(z) z    p(x, z) = log Eq q(z) ≥Eq [log p(x, z)] − Eq [log q(z)] • Fun side effect: Entropy • Maximizing the ELBO gives as tight a bound on on log probability

Machine Learning: Jordan Boyd-Graber

|

Boulder

Variational Inference

|

8 of 29

Evidence Lower Bound (ELBO) • Apply Jensen’s inequality on log probability of data

Z log p(x) = log

 p(x, z)

Zz

 q(z) = log p(x, z) q(z) z    p(x, z) = log Eq q(z) ≥Eq [log p(x, z)] − Eq [log q(z)] • Fun side effect: Entropy • Maximizing the ELBO gives as tight a bound on on log probability

Machine Learning: Jordan Boyd-Graber

|

Boulder

Variational Inference

|

8 of 29

Evidence Lower Bound (ELBO) • Apply Jensen’s inequality on log probability of data

Z log p(x) = log

 p(x, z)

Zz

 q(z) = log p(x, z) q(z) z    p(x, z) = log Eq q(z) ≥Eq [log p(x, z)] − Eq [log q(z)] • Fun side effect: Entropy • Maximizing the ELBO gives as tight a bound on on log probability

Machine Learning: Jordan Boyd-Graber

|

Boulder

Variational Inference

|

8 of 29

Relation to KL Divergence • Conditional probability definition

p(z | x) =

Machine Learning: Jordan Boyd-Graber

|

Boulder

p(z, x) p(x)

(5)

Variational Inference

|

9 of 29

Relation to KL Divergence • Conditional probability definition

p(z | x) =

p(z, x) p(x)

(5)

• Plug into KL divergence

 KL(q(z) || p(z | x)) =Eq

Machine Learning: Jordan Boyd-Graber

|

Boulder

q(z) log p(z | x)



Variational Inference

|

9 of 29

Relation to KL Divergence • Conditional probability definition

p(z | x) =

p(z, x) p(x)

(5)

• Plug into KL divergence

  q(z) KL(q(z) || p(z | x)) =Eq log p(z | x) =Eq [log q(z)] − Eq [log p(z | x)]

Break quotient into difference Machine Learning: Jordan Boyd-Graber

|

Boulder

Variational Inference

|

9 of 29

Relation to KL Divergence • Conditional probability definition

p(z | x) =

p(z, x) p(x)

(5)

• Plug into KL divergence

  q(z) KL(q(z) || p(z | x)) =Eq log p(z | x) =Eq [log q(z)] − Eq [log p(z | x)]

=Eq [log q(z)] − Eq [log p(z, x)] + log p(x)

Apply definition of conditional probability Machine Learning: Jordan Boyd-Graber

|

Boulder

Variational Inference

|

9 of 29

Relation to KL Divergence • Conditional probability definition

p(z | x) = • Plug into KL divergence

p(z, x) p(x)

(5)



 q(z) KL(q(z) || p(z | x)) =Eq log p(z | x) =Eq [log q(z)] − Eq [log p(z | x)]

=Eq [log q(z)] − Eq [log p(z, x)] + log p(x)

= − (Eq [log p(z, x)] − Eq [log q(z)]) + log p(x)

Reorganize terms Machine Learning: Jordan Boyd-Graber

|

Boulder

Variational Inference

|

9 of 29

Relation to KL Divergence • Conditional probability definition

p(z | x) =

p(z, x) p(x)

(5)

• Plug into KL divergence



 q(z) KL(q(z) || p(z | x)) =Eq log p(z | x) =Eq [log q(z)] − Eq [log p(z | x)]

=Eq [log q(z)] − Eq [log p(z, x)] + log p(x)

= − (Eq [log p(z, x)] − Eq [log q(z)]) + log p(x) • Negative of ELBO (plus constant); minimizing KL divergence is

the same as maximizing ELBO

Machine Learning: Jordan Boyd-Graber

|

Boulder

Variational Inference

|

9 of 29

Mean field variational inference

• Assume that your variational distribution factorizes

q(z1 , . . . , zm ) =

m Y

q(zj )

(6)

j=1

• You may want to group some hidden variables together

• Does not contain the true posterior because hidden variables are

dependent

Machine Learning: Jordan Boyd-Graber

|

Boulder

Variational Inference

|

10 of 29

General Blueprint

• Choose q

• Derive ELBO

• Coordinate ascent of each qi • Repeat until convergence

Machine Learning: Jordan Boyd-Graber

|

Boulder

Variational Inference

|

11 of 29

Example: Latent Dirichlet Allocation

TOPIC 1

TOPIC 2

TOPIC 3

computer, technology, system, service, site, phone, internet, machine

sell, sale, store, product, business, advertising, market, consumer

play, film, movie, theater, production, star, director, stage

Machine Learning: Jordan Boyd-Graber

|

Boulder

Variational Inference

|

12 of 29

Example: Latent Dirichlet Allocation

Red Light, Green Light: A 2-Tone L.E.D. to Simplify Screens

TOPIC 1

The three big Internet portals begin to distinguish among themselves as shopping malls

TOPIC 2

Forget the Bootleg, Just Download the Movie Legally

The Shape of Cinema, Transformed At the Click of a Mouse

TOPIC 3

Machine Learning: Jordan Boyd-Graber

Stock Trades: A Better Deal For Investors Isn't Simple

|

Boulder

Multiplex Heralded As Linchpin To Growth

A Peaceful Crew Puts Muppets Where Its Mouth Is

Variational Inference

|

12 of 29

Example: Latent Dirichlet Allocation computer, technology, system, service, site, phone, internet, machine

sell, sale, store, product, business, advertising, market, consumer

play, film, movie, theater, production, star, director, stage

Hollywood studios are preparing to let people download and buy electronic copies of movies over the Internet, much as record labels now sell songs for 99 cents through Apple Computer's iTunes music store and other online services ... Machine Learning: Jordan Boyd-Graber

|

Boulder

Variational Inference

|

12 of 29

LDA Generative Model

βk α

θd

zn

K

wn

N M

• For each topic k ∈ {1, . . . , K }, a multinomial distribution βk

Machine Learning: Jordan Boyd-Graber

|

Boulder

Variational Inference

|

13 of 29

LDA Generative Model

βk α

θd

zn

K

wn

N M

• For each topic k ∈ {1, . . . , K }, a multinomial distribution βk • For each document d ∈ {1, . . . , M}, draw a multinomial

distribution θd from a Dirichlet distribution with parameter α

Machine Learning: Jordan Boyd-Graber

|

Boulder

Variational Inference

|

13 of 29

LDA Generative Model

βk α

θd

zn

K

wn

N M

• For each topic k ∈ {1, . . . , K }, a multinomial distribution βk • For each document d ∈ {1, . . . , M}, draw a multinomial

distribution θd from a Dirichlet distribution with parameter α

• For each word position n ∈ {1, . . . , N}, select a hidden topic zn

from the multinomial distribution parameterized by θ.

Machine Learning: Jordan Boyd-Graber

|

Boulder

Variational Inference

|

13 of 29

LDA Generative Model

βk α

θd

zn

K

wn

N M

• For each topic k ∈ {1, . . . , K }, a multinomial distribution βk • For each document d ∈ {1, . . . , M}, draw a multinomial

distribution θd from a Dirichlet distribution with parameter α

• For each word position n ∈ {1, . . . , N}, select a hidden topic zn

from the multinomial distribution parameterized by θ. • Choose the observed word wn from the distribution βzn . Machine Learning: Jordan Boyd-Graber

|

Boulder

Variational Inference

|

13 of 29

LDA Generative Model

βk α

θd

zn

K

wn

N M

• For each topic k ∈ {1, . . . , K }, a multinomial distribution βk • For each document d ∈ {1, . . . , M}, draw a multinomial

distribution θd from a Dirichlet distribution with parameter α

• For each word position n ∈ {1, . . . , N}, select a hidden topic zn

from the multinomial distribution parameterized by θ. • Choose the observed word wn from the distribution βzn . Machine Learning: Jordan Boyd-Graber | Boulder Statistical inference uncovers most unobserved variables Variational givenInference data.|

13 of 29

Deriving Variational Inference for LDA

Joint distribution: p(θ, z, w | α, β) =

Machine Learning: Jordan Boyd-Graber

|

Y d

Boulder

p(θd | α)

Y n

p(zd,n | θd )p(wd,n | β, zd,n )

Variational Inference

(7)

|

14 of 29

Deriving Variational Inference for LDA

Joint distribution: p(θ, z, w | α, β) = P

Y

Γ( i αi ) • p(θd | α) = Q Γ(α i) i

Machine Learning: Jordan Boyd-Graber

|

d

Q

Boulder

p(θd | α)

k

Y n

p(zd,n | θd )p(wd,n | β, zd,n )

(7)

αk −1 θd,k (Dirichlet)

Variational Inference

|

14 of 29

Deriving Variational Inference for LDA

Joint distribution: p(θ, z, w | α, β) = P

Y d

p(θd | α)

Y n

p(zd,n | θd )p(wd,n | β, zd,n )

(7)

Γ( i αi ) αk −1 • p(θd | α) = Q Γ(α (Dirichlet) k θd,k i) i • p(zd,n | θd ) = θd,zd,n (Draw from Multinomial)

Machine Learning: Jordan Boyd-Graber

|

Q

Boulder

Variational Inference

|

14 of 29

Deriving Variational Inference for LDA

Joint distribution: p(θ, z, w | α, β) = P

Y d

p(θd | α)

Y n

p(zd,n | θd )p(wd,n | β, zd,n )

(7)

Γ( i αi ) αk −1 • p(θd | α) = Q Γ(α (Dirichlet) k θd,k i) i • p(zd,n | θd ) = θd,zd,n (Draw from Multinomial)

Q

• p(wd,n | β, zd,n ) = βzd,n ,wd,n (Draw from Multinomial)

Machine Learning: Jordan Boyd-Graber

|

Boulder

Variational Inference

|

14 of 29

Deriving Variational Inference for LDA

Joint distribution: p(θ, z, w | α, β) =

Y d

p(θd | α)

Y n

p(zd,n | θd )p(wd,n | β, zd,n )

(7)

Variational distribution: q(θ, z) = q(θ | γ)q(z | φ)

Machine Learning: Jordan Boyd-Graber

|

Boulder

(8)

Variational Inference

|

14 of 29

Deriving Variational Inference for LDA

Joint distribution: p(θ, z, w | α, β) =

Y d

p(θd | α)

Y n

p(zd,n | θd )p(wd,n | β, zd,n )

(7)

Variational distribution: q(θ, z) = q(θ | γ)q(z | φ)

(8)

ELBO: L(γ, φ; α, β) =Eq [log p(θ | α)] + Eq [log p(z | θ)] + Eq [log p(w | z, β)] − Eq [log q(θ)] − Eq [log q(z)]

Machine Learning: Jordan Boyd-Graber

|

Boulder

(9)

Variational Inference

|

14 of 29

What is the variational distribution?

~ ~z ) = q(θ,

Y d

q(θd | γd )

Y n

q(zd,n | φd,n )

(10)

• Variational document distribution over topics γd ◦ Vector of length K for each document ◦ Non-negative ◦ Doesn’t sum to 1.0 • Variational token distribution over topic assignments φd,n ◦ Vector of length K for every token ◦ Non-negative, sums to 1.0

Machine Learning: Jordan Boyd-Graber

|

Boulder

Variational Inference

|

15 of 29

Expectation of log Dirichlet

• Most expectations are straightforward to compute • Dirichlet is harder

Edir

  X [log p(θi | α)] = Ψ (αi ) − Ψ  αj 

(11)

j

Machine Learning: Jordan Boyd-Graber

|

Boulder

Variational Inference

|

16 of 29

Expectation 1

" Eq [log p(θ | α)] =Eq log

(

)# P Γ( i αi ) Y αi −1 Q θi i Γ(αi )

(12)

i

(13)

Machine Learning: Jordan Boyd-Graber

|

Boulder

Variational Inference

|

17 of 29

Expectation 1

)# P Γ( i αi ) Y αi −1 Eq [log p(θ | α)] =Eq log Q θi i Γ(αi ) i "  P #  X Γ( i αi ) αi −1 =Eq log Q + log θi i Γ(αi ) "

(

(12)

i

(13)

Log of products becomes sum of logs.

Machine Learning: Jordan Boyd-Graber

|

Boulder

Variational Inference

|

17 of 29

Expectation 1

)# P Γ( i αi ) Y αi −1 (12) Eq [log p(θ | α)] =Eq log Q θi i Γ(αi ) i "  P #  X Γ( i αi ) =Eq log Q + log θiαi −1 i Γ(αi ) i " # X X X = log Γ( αi ) − log Γ(αi ) + Eq (αi − 1) log θi "

(

i

i

i

(13)

Log of exponent becomes product, expectation of constant is constant

Machine Learning: Jordan Boyd-Graber

|

Boulder

Variational Inference

|

17 of 29

Expectation 1

"

)# P Γ( i αi ) Y αi −1 Eq [log p(θ | α)] =Eq log Q (12) θi i Γ(αi ) i # "  P  X Γ( i αi ) αi −1 =Eq log Q + log θi i Γ(αi ) i " # X X X = log Γ( αi ) − log Γ(αi ) + Eq (αi − 1) log θi = log Γ(

(

i

i

X

X

i

αi ) −

i

log Γ(αi )

i



+

X i

  X (αi − 1) Ψ (γi ) − Ψ  γj  j

Expectation of log Dirichlet Machine Learning: Jordan Boyd-Graber

|

Boulder

Variational Inference

|

17 of 29

Expectation 2

#

" Eq [log p(z | θ)] =Eq log

YY n

1[z ==i] θi n

(13)

i

(14)

Machine Learning: Jordan Boyd-Graber

|

Boulder

Variational Inference

|

18 of 29

Expectation 2

"

#

Eq [log p(z | θ)] =Eq log

YY n

1[z ==i] θi n

" =Eq

(13)

i

# XX n

1[z ==i] log θi n

(14)

i

(15) Products to sums

Machine Learning: Jordan Boyd-Graber

|

Boulder

Variational Inference

|

18 of 29

Expectation 2

"

#

Eq [log p(z | θ)] =Eq log

YY n

1[z ==i] θi n

" =Eq

# XX n

=

XX n

(13)

i 1[zn ==i]

log θi

(14)

i

h i 1[z ==i] Eq log θi n

(15)

i

(16) Linearity of expectation

Machine Learning: Jordan Boyd-Graber

|

Boulder

Variational Inference

|

18 of 29

Expectation 2

#

" Eq [log p(z | θ)] =Eq log

YY n

1[z ==i] θi n

#

" =Eq

XX n

=

XX n

=

1[zn ==i]

log θi

(14)

i

h i 1[z ==i] Eq log θi n

(15)

φni Eq [log θi ]

(16)

i

XX n

(13)

i

i

(17) Independence of variational distribution, exponents become products Machine Learning: Jordan Boyd-Graber

|

Boulder

Variational Inference

|

18 of 29

Expectation 2

"

#

Eq [log p(z | θ)] =Eq log

YY n

1[z ==i] θi n

" =Eq

# XX n

=

XX n

=

1[z ==i] log θi n

h i 1[z ==i] Eq log θi n

(15)

φni Eq [log θi ]

(16)

i

 XX n

(14)

i

i

XX n

=

(13)

i

i



φni Ψ (γi ) − Ψ 

 X

γj  

(17)

j

Expectation of log Dirichlet Machine Learning: Jordan Boyd-Graber

|

Boulder

Variational Inference

|

18 of 29

Expectation 3

  Eq [log p(w | z, β)] =Eq log βzd,n ,wd,n

Machine Learning: Jordan Boyd-Graber

|

Boulder

(18) (19)

Variational Inference

|

19 of 29

Expectation 3

  Eq [log p(w | z, β)] =Eq log βzd,n ,wd,n " # V Y K Y 1[v =wd,n ,zd,n =i ] =Eq log βi,v v

(18) (19)

i

(20)

Machine Learning: Jordan Boyd-Graber

|

Boulder

Variational Inference

|

19 of 29

Expectation 3

  Eq [log p(w | z, β)] =Eq log βzd,n ,wd,n " # V Y K Y 1[v =wd,n ,zd,n =i ] =Eq log βi,v v

=

V X K X v

(18) (19)

i

Eq [1 [v = wd,n , zd,n = i]] log βi,v

(20)

i

(21)

Machine Learning: Jordan Boyd-Graber

|

Boulder

Variational Inference

|

19 of 29

Expectation 3

  Eq [log p(w | z, β)] =Eq log βzd,n ,wd,n " # V Y K Y 1[v =wd,n ,zd,n =i ] =Eq log βi,v v

=

V X K X

(18) (19)

i

Eq [1 [v = wd,n , zd,n = i]] log βi,v

(20)

v φn,i wd,n log βi,v

(21)

v

=

i V K XX v

Machine Learning: Jordan Boyd-Graber

|

Boulder

i

Variational Inference

|

19 of 29

Entropies

Entropy of Dirichlet  Hq [γ] = − log Γ 

 X

γj  +

j

X

log Γ(γi )

i





Machine Learning: Jordan Boyd-Graber

|

X

Boulder

i

  k X (γi − 1) Ψ (γi ) − Ψ  γj   j=1

Variational Inference

|

20 of 29

Entropies

Entropy of Dirichlet  Hq [γ] = − log Γ 

 X

γj  +

j

X

log Γ(γi )

i





X i

  k X (γi − 1) Ψ (γi ) − Ψ  γj   j=1

Entropy of Multinomial Hq [φd,n ] = −

Machine Learning: Jordan Boyd-Graber

|

Boulder

X

φd,n,i log φd,n,i

(22)

i

Variational Inference

|

20 of 29

Complete objective function

Note the entropy terms at the end (negative sign) Machine Learning: Jordan Boyd-Graber

|

Boulder

Variational Inference

|

21 of 29

Deriving the algorithm

• Compute partial wrt to variable of interest • Set equal to zero

• Solve for variable

Machine Learning: Jordan Boyd-Graber

|

Boulder

Variational Inference

|

22 of 29

Update for φ

Derivative of ELBO:   X ∂L = Ψ (γi ) − Ψ  γj  + log βi,v − log φni − 1 + λ ∂φni

(23)

j

Solution:





φni ∝ βiv exp Ψ (γi ) − Ψ 

Machine Learning: Jordan Boyd-Graber

|

Boulder

 X

γj  

(24)

j

Variational Inference

|

23 of 29

Update for γ

Derivative of ELBO: ∂L =Ψ0 (γi ) (αi + φn,i − γi ) ∂γi   ! X X X αj + φnj − γj − Ψ0  γj  j

Machine Learning: Jordan Boyd-Graber

|

Boulder

j

n

Variational Inference

|

24 of 29

Update for γ

Derivative of ELBO: ∂L =Ψ0 (γi ) (αi + φn,i − γi ) ∂γi   ! X X X αj + φnj − γj − Ψ0  γj  j

Machine Learning: Jordan Boyd-Graber

|

Boulder

j

n

Variational Inference

|

24 of 29

Update for γ

Derivative of ELBO: ∂L =Ψ0 (γi ) (αi + φn,i − γi ) ∂γi   ! X X X 0 αj + φnj − γj −Ψ γj  j

n

j

Solution: γi = αi +

X

φni

(25)

n

Machine Learning: Jordan Boyd-Graber

|

Boulder

Variational Inference

|

24 of 29

Update for β

Slightly more complicated (requires Lagrange parameter), but solution is obvious: XX j βij ∝ φdni wdn (26) d

Machine Learning: Jordan Boyd-Graber

|

Boulder

n

Variational Inference

|

25 of 29

Overall Algorithm

1. Randomly initialize variational parameters (can’t be uniform) 2. For each iteration: 2.1 For each document, update γ and φ 2.2 For corpus, update β 2.3 Compute L for diagnostics

3. Return expectation of variational parameters for solution to latent variables

Machine Learning: Jordan Boyd-Graber

|

Boulder

Variational Inference

|

26 of 29

Relationship with Gibbs Sampling

• Gibbs sampling: sample from the conditional distribution of all

other variables

• Variational inference: each factor is set to the exponentiated log of

the conditional

• Variational is easier to parallelize, Gibbs faster per step • Gibbs typically easier to implement

Machine Learning: Jordan Boyd-Graber

|

Boulder

Variational Inference

|

27 of 29

Implementation Tips

• Match derivation exactly at first

• Randomize initialization, but specify seed • Use simple languages first

Machine Learning: Jordan Boyd-Graber

|

Boulder

Variational Inference

|

28 of 29

Implementation Tips

• Match derivation exactly at first

• Randomize initialization, but specify seed

• Use simple languages first . . . then match implementation

Machine Learning: Jordan Boyd-Graber

|

Boulder

Variational Inference

|

28 of 29

Implementation Tips

• Match derivation exactly at first

• Randomize initialization, but specify seed

• Use simple languages first . . . then match implementation

• Try to match variables with paper

• Write unit tests for each atomic update

• Monitor variational bound (with asserts)

Machine Learning: Jordan Boyd-Graber

|

Boulder

Variational Inference

|

28 of 29

Implementation Tips

• Match derivation exactly at first

• Randomize initialization, but specify seed

• Use simple languages first . . . then match implementation

• Try to match variables with paper

• Write unit tests for each atomic update

• Monitor variational bound (with asserts)

• Write the state (checkpointing and debugging) • Visualize variational parameters

Machine Learning: Jordan Boyd-Graber

|

Boulder

Variational Inference

|

28 of 29

Implementation Tips

• Match derivation exactly at first

• Randomize initialization, but specify seed

• Use simple languages first . . . then match implementation

• Try to match variables with paper

• Write unit tests for each atomic update

• Monitor variational bound (with asserts)

• Write the state (checkpointing and debugging) • Visualize variational parameters

• Cache / memoize gamma / digamma functions

Machine Learning: Jordan Boyd-Graber

|

Boulder

Variational Inference

|

28 of 29

Next class

• Example on toy LDA problem

• Current research in variational inference

Machine Learning: Jordan Boyd-Graber

|

Boulder

Variational Inference

|

29 of 29