Variational Inference - University of Colorado Boulder

Variational Inference Machine Learning: Jordan Boyd-Graber University of Colorado Boulder LECTURE 19

Machine Learning: Jordan Boyd-Graber

|

Boulder

Variational Inference

|

1 of 29


• Inferring hidden variables • Unlike MCMC: ◦ Deterministic ◦ Easy to gauge convergence ◦ Requires dozens of iterations • Doesn’t require conjugacy • Slightly hairier math


|

Boulder


|

2 of 29

Setup

• ~x = x1:n observations

• ~z = z1:m hidden variables

• α fixed parameters

• Want the posterior distribution

p(z | x, α) = R


|

Boulder

p(z, x | α) z p(z, x | α)

(1)


|

3 of 29

Motivation • Can’t compute posterior for many interesting models

GMM (finite) 1. Draw µk ∼ N (0, τ 2 )

2. For each observation i = 1 . . . n: 2.1 Draw zi ∼ Mult(π) 2.2 Draw xi ∼ N (µzi , σ02 )

• Posterior is intractable for large n, and we might want to add priors

QK p(µ1:K , z1:n | x1:n ) = R

µ1:K

Qn k=1 p(µk ) i=1 p(zi )p(xi | zi , µ1:K ) P QK Qn z1:n k=1 p(µk ) i=1 p(zi )p(xi | zi , µ1:K ) (2)


|

Boulder


|

4 of 29





QK p(µ1:K , z1:n | x1:n ) = R

µ1:K

Qn k=1 p(µk ) i=1 p(zi )p(xi | zi , µ1:K ) Qn P QK z1:n k=1 p(µk ) i=1 p(zi )p(xi | zi , µ1:K ) (2)

Consider all means Machine Learning: Jordan Boyd-Graber

|

Boulder


|

4 of 29





QK p(µ1:K , z1:n | x1:n ) = R

µ1:K

Qn k=1 p(µk ) i=1 p(zi )p(xi | zi , µ1:K ) P QK Qn z1:n k=1 p(µk ) i=1 p(zi )p(xi | zi , µ1:K ) (2)

Consider all assignments Machine Learning: Jordan Boyd-Graber

|

Boulder


|

4 of 29

Main Idea

• We create a variational distribution over the latent variables

q(z1:m | ν)

(3)

• Find the settings of ν so that q is close to the posterior • If q == p, then this is vanilla EM


|

Boulder


|

5 of 29

What does it mean for distributions to be close?

• We measure the closeness of distributions using Kullback-Leibler

Divergence

KL(q || p) ≡ Eq


|

Boulder

q(Z ) log p(Z | x)

(4)


|

6 of 29



Divergence

KL(q || p) ≡ Eq

q(Z ) log p(Z | x)

(4)

• Characterizing KL divergence ◦ If q and p are high, we’re happy ◦ If q is high but p isn’t, we pay a price ◦ If q is low, we don’t care ◦ If KL = 0, then distribution are equal


|

Boulder


|

6 of 29



Divergence

KL(q || p) ≡ Eq

q(Z ) log p(Z | x)

(4)

• Characterizing KL divergence ◦ If q and p are high, we’re happy ◦ If q is high but p isn’t, we pay a price ◦ If q is low, we don’t care ◦ If KL = 0, then distribution are equal

This behavior is often called “mode splitting”: we want a good solution, not every solution. Machine Learning: Jordan Boyd-Graber

|

Boulder


|

6 of 29

Jensen’s Inequality: Concave Functions and Expectations

log(t · x1 + (1

t) · x2 ) When f is concave

t log(x1 ) + (1

x1

t) log(x2 )

f (E [X ]) ≥ E [f (X )]

x2

If you haven’t seen this before, spend fifteen minutes to convince yourself that it’s true Machine Learning: Jordan Boyd-Graber

|

Boulder


|

7 of 29

Evidence Lower Bound (ELBO)

• Apply Jensen’s inequality on log probability of data

Z p(x, z)

log p(x) = log z


|

Boulder


|

8 of 29



Z log p(x) = log

p(x, z)

z

Z = log

p(x, z) z

q(z) q(z)

Add a term that is equal to one


|

Boulder


|

8 of 29



Z log p(x) = log

p(x, z)

z

Z

q(z) p(x, z) q(z) z p(x, z) = log Eq q(z)

= log

Take the numerator to create an expectation


|

Boulder


|

8 of 29



Z log p(x) = log

p(x, z)

z

Z

q(z) p(x, z) q(z) z p(x, z) = log Eq q(z) ≥Eq [log p(x, z)] − Eq [log q(z)] = log

Apply Jensen’s equality and use log difference


|

Boulder


|

8 of 29

Evidence Lower Bound (ELBO) • Apply Jensen’s inequality on log probability of data

Z log p(x) = log

p(x, z)

Zz

q(z) = log p(x, z) q(z) z p(x, z) = log Eq q(z) ≥Eq [log p(x, z)] − Eq [log q(z)] • Fun side effect: Entropy • Maximizing the ELBO gives as tight a bound on on log probability


|

Boulder


|

8 of 29


Z log p(x) = log

p(x, z)

Zz



|

Boulder


|

8 of 29


Z log p(x) = log

p(x, z)

Zz



|

Boulder


|

8 of 29

Relation to KL Divergence • Conditional probability definition

p(z | x) =


|

Boulder

p(z, x) p(x)

(5)


|

9 of 29


p(z | x) =

p(z, x) p(x)

(5)

• Plug into KL divergence

KL(q(z) || p(z | x)) =Eq


|

Boulder

q(z) log p(z | x)


|

9 of 29


p(z | x) =

p(z, x) p(x)

(5)


q(z) KL(q(z) || p(z | x)) =Eq log p(z | x) =Eq [log q(z)] − Eq [log p(z | x)]

Break quotient into difference Machine Learning: Jordan Boyd-Graber

|

Boulder


|

9 of 29


p(z | x) =

p(z, x) p(x)

(5)



=Eq [log q(z)] − Eq [log p(z, x)] + log p(x)

Apply definition of conditional probability Machine Learning: Jordan Boyd-Graber

|

Boulder


|

9 of 29


p(z | x) = • Plug into KL divergence

p(z, x) p(x)

(5)



= − (Eq [log p(z, x)] − Eq [log q(z)]) + log p(x)

Reorganize terms Machine Learning: Jordan Boyd-Graber

|

Boulder


|

9 of 29


p(z | x) =

p(z, x) p(x)

(5)




= − (Eq [log p(z, x)] − Eq [log q(z)]) + log p(x) • Negative of ELBO (plus constant); minimizing KL divergence is

the same as maximizing ELBO


|

Boulder


|

9 of 29

Mean field variational inference

• Assume that your variational distribution factorizes

q(z1 , . . . , zm ) =

m Y

q(zj )

(6)

j=1

• You may want to group some hidden variables together

• Does not contain the true posterior because hidden variables are

dependent


|

Boulder


|

10 of 29

General Blueprint

• Choose q

• Derive ELBO

• Coordinate ascent of each qi • Repeat until convergence


|

Boulder


|

11 of 29

Example: Latent Dirichlet Allocation

TOPIC 1

TOPIC 2

TOPIC 3

computer, technology, system, service, site, phone, internet, machine

sell, sale, store, product, business, advertising, market, consumer

play, film, movie, theater, production, star, director, stage


|

Boulder


|

12 of 29

Example: Latent Dirichlet Allocation

Red Light, Green Light: A 2-Tone L.E.D. to Simplify Screens

TOPIC 1

The three big Internet portals begin to distinguish among themselves as shopping malls

TOPIC 2

Forget the Bootleg, Just Download the Movie Legally

The Shape of Cinema, Transformed At the Click of a Mouse

TOPIC 3


Stock Trades: A Better Deal For Investors Isn't Simple

|

Boulder

Multiplex Heralded As Linchpin To Growth

A Peaceful Crew Puts Muppets Where Its Mouth Is


|

12 of 29

Example: Latent Dirichlet Allocation computer, technology, system, service, site, phone, internet, machine

sell, sale, store, product, business, advertising, market, consumer

play, film, movie, theater, production, star, director, stage

Hollywood studios are preparing to let people download and buy electronic copies of movies over the Internet, much as record labels now sell songs for 99 cents through Apple Computer's iTunes music store and other online services ... Machine Learning: Jordan Boyd-Graber

|

Boulder


|

12 of 29

LDA Generative Model

βk α

θd

zn

K

wn

N M

• For each topic k ∈ {1, . . . , K }, a multinomial distribution βk


|

Boulder


|

13 of 29


βk α

θd

zn

K

wn

N M

• For each topic k ∈ {1, . . . , K }, a multinomial distribution βk • For each document d ∈ {1, . . . , M}, draw a multinomial

distribution θd from a Dirichlet distribution with parameter α


|

Boulder


|

13 of 29


βk α

θd

zn

K

wn

N M



• For each word position n ∈ {1, . . . , N}, select a hidden topic zn

from the multinomial distribution parameterized by θ.


|

Boulder


|

13 of 29


βk α

θd

zn

K

wn

N M




from the multinomial distribution parameterized by θ. • Choose the observed word wn from the distribution βzn . Machine Learning: Jordan Boyd-Graber

|

Boulder


|

13 of 29


βk α

θd

zn

K

wn

N M




from the multinomial distribution parameterized by θ. • Choose the observed word wn from the distribution βzn . Machine Learning: Jordan Boyd-Graber | Boulder Statistical inference uncovers most unobserved variables Variational givenInference data.|

13 of 29

Deriving Variational Inference for LDA

Joint distribution: p(θ, z, w | α, β) =


|

Y d

Boulder

p(θd | α)

Y n

p(zd,n | θd )p(wd,n | β, zd,n )


(7)

|

14 of 29


Joint distribution: p(θ, z, w | α, β) = P

Y

Γ( i αi ) • p(θd | α) = Q Γ(α i) i


|

d

Q

Boulder

p(θd | α)

k

Y n


(7)

αk −1 θd,k (Dirichlet)


|

14 of 29



Y d

p(θd | α)

Y n


(7)

Γ( i αi ) αk −1 • p(θd | α) = Q Γ(α (Dirichlet) k θd,k i) i • p(zd,n | θd ) = θd,zd,n (Draw from Multinomial)


|

Q

Boulder


|

14 of 29



Y d

p(θd | α)

Y n


(7)

Γ( i αi ) αk −1 • p(θd | α) = Q Γ(α (Dirichlet) k θd,k i) i • p(zd,n | θd ) = θd,zd,n (Draw from Multinomial)

Q

• p(wd,n | β, zd,n ) = βzd,n ,wd,n (Draw from Multinomial)


|

Boulder


|

14 of 29



Y d

p(θd | α)

Y n


(7)

Variational distribution: q(θ, z) = q(θ | γ)q(z | φ)


|

Boulder

(8)


|

14 of 29



Y d

p(θd | α)

Y n


(7)

Variational distribution: q(θ, z) = q(θ | γ)q(z | φ)

(8)

ELBO: L(γ, φ; α, β) =Eq [log p(θ | α)] + Eq [log p(z | θ)] + Eq [log p(w | z, β)] − Eq [log q(θ)] − Eq [log q(z)]


|

Boulder

(9)


|

14 of 29

What is the variational distribution?

~ ~z ) = q(θ,

Y d

q(θd | γd )

Y n

q(zd,n | φd,n )

(10)

• Variational document distribution over topics γd ◦ Vector of length K for each document ◦ Non-negative ◦ Doesn’t sum to 1.0 • Variational token distribution over topic assignments φd,n ◦ Vector of length K for every token ◦ Non-negative, sums to 1.0


|

Boulder


|

15 of 29

Expectation of log Dirichlet

• Most expectations are straightforward to compute • Dirichlet is harder

Edir

  X [log p(θi | α)] = Ψ (αi ) − Ψ  αj 

(11)

j


|

Boulder


|

16 of 29

Expectation 1

" Eq [log p(θ | α)] =Eq log

(

)# P Γ( i αi ) Y αi −1 Q θi i Γ(αi )

(12)

i

(13)


|

Boulder


|

17 of 29

Expectation 1

)# P Γ( i αi ) Y αi −1 Eq [log p(θ | α)] =Eq log Q θi i Γ(αi ) i " P # X Γ( i αi ) αi −1 =Eq log Q + log θi i Γ(αi ) "

(

(12)

i

(13)

Log of products becomes sum of logs.


|

Boulder


|

17 of 29

Expectation 1

)# P Γ( i αi ) Y αi −1 (12) Eq [log p(θ | α)] =Eq log Q θi i Γ(αi ) i " P # X Γ( i αi ) =Eq log Q + log θiαi −1 i Γ(αi ) i " # X X X = log Γ( αi ) − log Γ(αi ) + Eq (αi − 1) log θi "

(

i

i

i

(13)

Log of exponent becomes product, expectation of constant is constant


|

Boulder


|

17 of 29

Expectation 1

"

)# P Γ( i αi ) Y αi −1 Eq [log p(θ | α)] =Eq log Q (12) θi i Γ(αi ) i # " P X Γ( i αi ) αi −1 =Eq log Q + log θi i Γ(αi ) i " # X X X = log Γ( αi ) − log Γ(αi ) + Eq (αi − 1) log θi = log Γ(

(

i

i

X

X

i

αi ) −

i

log Γ(αi )

i



+

X i

  X (αi − 1) Ψ (γi ) − Ψ  γj  j

Expectation of log Dirichlet Machine Learning: Jordan Boyd-Graber

|

Boulder


|

17 of 29

Expectation 2

#

" Eq [log p(z | θ)] =Eq log

YY n

1[z ==i] θi n

(13)

i

(14)


|

Boulder


|

18 of 29

Expectation 2

"

#

Eq [log p(z | θ)] =Eq log

YY n

1[z ==i] θi n

" =Eq

(13)

i

# XX n

1[z ==i] log θi n

(14)

i

(15) Products to sums


|

Boulder


|

18 of 29

Expectation 2

"

#


YY n

1[z ==i] θi n

" =Eq

# XX n

=

XX n

(13)

i 1[zn ==i]

log θi

(14)

i

h i 1[z ==i] Eq log θi n

(15)

i

(16) Linearity of expectation


|

Boulder


|

18 of 29

Expectation 2

#

" Eq [log p(z | θ)] =Eq log

YY n

1[z ==i] θi n

#

" =Eq

XX n

=

XX n

=

1[zn ==i]

log θi

(14)

i


(15)

φni Eq [log θi ]

(16)

i

XX n

(13)

i

i

(17) Independence of variational distribution, exponents become products Machine Learning: Jordan Boyd-Graber

|

Boulder


|

18 of 29

Expectation 2

"

#


YY n

1[z ==i] θi n

" =Eq

# XX n

=

XX n

=

1[z ==i] log θi n


(15)

φni Eq [log θi ]

(16)

i

 XX n

(14)

i

i

XX n

=

(13)

i

i



φni Ψ (γi ) − Ψ 

 X

γj  

(17)

j

Expectation of log Dirichlet Machine Learning: Jordan Boyd-Graber

|

Boulder


|

18 of 29

Expectation 3

Eq [log p(w | z, β)] =Eq log βzd,n ,wd,n


|

Boulder

(18) (19)


|

19 of 29

Expectation 3

Eq [log p(w | z, β)] =Eq log βzd,n ,wd,n " # V Y K Y 1[v =wd,n ,zd,n =i ] =Eq log βi,v v

(18) (19)

i

(20)


|

Boulder


|

19 of 29

Expectation 3


=

V X K X v

(18) (19)

i

Eq [1 [v = wd,n , zd,n = i]] log βi,v

(20)

i

(21)


|

Boulder


|

19 of 29

Expectation 3


=

V X K X

(18) (19)

i

Eq [1 [v = wd,n , zd,n = i]] log βi,v

(20)

v φn,i wd,n log βi,v

(21)

v

=

i V K XX v


|

Boulder

i


|

19 of 29

Entropies

Entropy of Dirichlet  Hq [γ] = − log Γ 

 X

γj  +

j

X

log Γ(γi )

i



−


|

X

Boulder

i

  k X (γi − 1) Ψ (γi ) − Ψ  γj   j=1


|

20 of 29

Entropies

Entropy of Dirichlet  Hq [γ] = − log Γ 

 X

γj  +

j

X

log Γ(γi )

i



−

X i

  k X (γi − 1) Ψ (γi ) − Ψ  γj   j=1

Entropy of Multinomial Hq [φd,n ] = −


|

Boulder

X

φd,n,i log φd,n,i

(22)

i


|

20 of 29

Complete objective function

Note the entropy terms at the end (negative sign) Machine Learning: Jordan Boyd-Graber

|

Boulder


|

21 of 29

Deriving the algorithm

• Compute partial wrt to variable of interest • Set equal to zero

• Solve for variable


|

Boulder


|

22 of 29

Update for φ

Derivative of ELBO:   X ∂L = Ψ (γi ) − Ψ  γj  + log βi,v − log φni − 1 + λ ∂φni

(23)

j

Solution:





φni ∝ βiv exp Ψ (γi ) − Ψ 


|

Boulder

 X

γj  

(24)

j


|

23 of 29

Update for γ

Derivative of ELBO: ∂L =Ψ0 (γi ) (αi + φn,i − γi ) ∂γi   ! X X X αj + φnj − γj − Ψ0  γj  j


|

Boulder

j

n


|

24 of 29

Update for γ

Derivative of ELBO: ∂L =Ψ0 (γi ) (αi + φn,i − γi ) ∂γi   ! X X X αj + φnj − γj − Ψ0  γj  j


|

Boulder

j

n


|

24 of 29

Update for γ

Derivative of ELBO: ∂L =Ψ0 (γi ) (αi + φn,i − γi ) ∂γi   ! X X X 0 αj + φnj − γj −Ψ γj  j

n

j

Solution: γi = αi +

X

φni

(25)

n


|

Boulder


|

24 of 29

Update for β

Slightly more complicated (requires Lagrange parameter), but solution is obvious: XX j βij ∝ φdni wdn (26) d


|

Boulder

n


|

25 of 29

Overall Algorithm

1. Randomly initialize variational parameters (can’t be uniform) 2. For each iteration: 2.1 For each document, update γ and φ 2.2 For corpus, update β 2.3 Compute L for diagnostics

3. Return expectation of variational parameters for solution to latent variables


|

Boulder


|

26 of 29

Relationship with Gibbs Sampling

• Gibbs sampling: sample from the conditional distribution of all

other variables

• Variational inference: each factor is set to the exponentiated log of

the conditional

• Variational is easier to parallelize, Gibbs faster per step • Gibbs typically easier to implement


|

Boulder


|

27 of 29

Implementation Tips

• Match derivation exactly at first

• Randomize initialization, but specify seed • Use simple languages first


|

Boulder


|

28 of 29

Implementation Tips


• Randomize initialization, but specify seed

• Use simple languages first . . . then match implementation


|

Boulder


|

28 of 29

Implementation Tips




• Try to match variables with paper

• Write unit tests for each atomic update

• Monitor variational bound (with asserts)


|

Boulder


|

28 of 29

Implementation Tips







• Write the state (checkpointing and debugging) • Visualize variational parameters


|

Boulder


|

28 of 29

Implementation Tips







• Write the state (checkpointing and debugging) • Visualize variational parameters

• Cache / memoize gamma / digamma functions


|

Boulder


|

28 of 29

Next class

• Example on toy LDA problem

• Current research in variational inference


|

Boulder


|

29 of 29