Understanding Maximum Entropy Discrimination ... - Semantic Scholar

Understanding Maximum Entropy Discrimination Models Sam Blasiak

September 22, 2011 George Mason University

Schedule

Schedule 1. Presentation: 25 minutes

2 / 43

Schedule

Schedule 1. Presentation: 25 minutes 2. Exercise: 25 minutes

2 / 43

Schedule

Schedule 1. Presentation: 25 minutes 2. Exercise: 25 minutes 3. Discussion 10 minutes

2 / 43

Introduction

Outline 1. Intro: Approaches to Classification

3 / 43

Introduction

Outline 1. Intro: Approaches to Classification 2. Support Vector Machines

3 / 43

Introduction

Outline 1. Intro: Approaches to Classification 2. Support Vector Machines 3. Maximum Entropy Discrimination

3 / 43

Introduction

Outline 1. Intro: Approaches to Classification 2. Support Vector Machines 3. Maximum Entropy Discrimination 4. Topic Models

3 / 43

Introduction

Outline 1. Intro: Approaches to Classification 2. Support Vector Machines 3. Maximum Entropy Discrimination 4. Topic Models 5. Topic Models + Maximum Entropy Discrimination

3 / 43

Classification

Setting I

observations x1 , . . . , xN I

xn ∈ RK

4 / 43

Classification

Setting I


I

xn ∈ RK

associated classes y1 , . . . , yN I

yn ∈ {−1, 1}

4 / 43

Classification

Setting I


I

xn ∈ RK

associated classes y1 , . . . , yN I

yn ∈ {−1, 1}

Task I

learn a function that maps observations to classes

4 / 43

Generative Models for Classification Assume different categories were generated from different theoretical distributions Theoretical Distributions

1.0

class -1

class 1

0.8

0.6

0.4

0.2

0.0 3

2

1

0

1

2

3

5 / 43

Generative Models for Classification Model the observations given the category Estimated Distributions

1.8 class -1

1.6 1.4 1.2

class 1

1.0 0.8 0.6 0.4 0.2 0.0 3

2

1

0

decision boundary

1

2

3

6 / 43

Generative Models for Classification: Detail Learn class-conditional densities I

find θ(−) that maximizes p(x|θ(−) ) for all negative class observations

I

and find θ(+) that maximizes p(x|θ(+) ) for all positive class observations

7 / 43



I


Classify an unknown observation, xN+1 I

take the maximum likelihood class under the posterior I

p(yN+1 |xN+1 , θ) ∝ p(xN+1 |θy )p(y )

7 / 43



I


Classify an unknown observation, xN+1 I

take the maximum likelihood class under the posterior I

I

p(yN+1 |xN+1 , θ) ∝ p(xN+1 |θy )p(y )

equivalently, use a decision rule I

p(x

|θ

)

p(y =1) N+1 (+) log p(xN+1 |θ(−) ) + log p(y =−1)

7 / 43

Generative Models for Classification What if the models have high variance? High Variance Distributions

0.5

0.4

class -1

0.3

class 1

0.2

0.1

0.0 3

2

1

0

1

2

3

8 / 43

Discriminative Methods Find a boundary between positive and negative categories

9 / 43

Discriminative Methods with Kernels What if the classes are not separable? I

map the observations to an easier space xi → ϕ(xi ) I

Kernel ≈ hϕ(xi ), ϕ(xj )i

10 / 43

Discriminative Methods with Kernels What if the classes are not separable? I

map the observations to an easier space xi → ϕ(xi ) I

Kernel ≈ hϕ(xi ), ϕ(xj )i

Discriminative Classifier + Kernel gives good practical solutions I

tf-idf [8]

I

remote homology detection problem [7]

I

ceptrum coefficients for speech recognition (sort of)

I

Fisher/Eigen Faces

10 / 43

Discriminative Methods with Kernels

11 / 43

Discriminative Methods with Kernels

Where do good kernels come from?

11 / 43

Maximum Entropy Discrimination (MED) [4]

What is MED? I

like pushing conditional class densities farther apart

12 / 43


What is MED? I


I

or constructing a kernel from using hidden parameters from a generative model

12 / 43


What is MED? I


I


Things to think about

12 / 43


What is MED? I


I


Things to think about I

When is MED tractible?

12 / 43


What is MED? I


I


Things to think about I

When is MED tractible?

I

When will MED fail to find a good classifier?

12 / 43

Support Vector Machines (SVMs) [3]

Learn a hyperplane that separates two categories

13 / 43


Learn a hyperplane that separates two categories

13 / 43


Learn a hyperplane that separates two categories I

maximize the “margin”

13 / 43


Learn a hyperplane that separates two categories I

maximize the “margin”

I

determined by a small number of “support vectors”

13 / 43

SVM detail

14 / 43

SVM detail Given a training set I

x1:N , xn ∈ X

I

y1:N , yn ∈ {−1, 1}

14 / 43


x1:N , xn ∈ X

I

y1:N , yn ∈ {−1, 1}

A separating hyperplane I

w> x = b

14 / 43


x1:N , xn ∈ X

I

y1:N , yn ∈ {−1, 1}

A separating hyperplane I

w> x = b

SVM objective 1 min ||w||2 w,b 2 such that yn (w> xn − b) ≥ 1

∀n ∈ 1 : N

14 / 43

SVM intuition Decrease ||w|| to increase the margin 3

2 x1 =[1.0,1.0] w> ·x1 =1.00 > ·x1 =1.41

w

||w||

1

w =[0.50,0.50]

0 x0 =[−1.0,−1.0] w> ·x0 = −1.00 w > ·x = −1.41 0 ||w||

1

2

33

2

1

0

1

2

3

15 / 43

SVM intuition Decrease ||w|| to increase the margin 3 x1 =[2.0,2.0] w> ·x1 =1.00 w > ·x =2.83 1

||w||

2

1

0

w =[0.25,0.25]

1 x0 =[−2.0,−2.0] w> ·x0 = −1.00 w > ·x = −2.83 0

||w||

2

33

2

1

0

1

2

3

15 / 43

MED setting (same as generative) A training set I

x1:N , xn ∈ X

I

y1:N , yn ∈ {−1, 1}

16 / 43


x1:N , xn ∈ X

I

y1:N , yn ∈ {−1, 1}

The form of two class-conditional probability distributions I

p(x|θ(+) ), p(x|θ(−) )

I

and a prior distribution over class frequencies, p(y )

16 / 43


x1:N , xn ∈ X

I

y1:N , yn ∈ {−1, 1}


p(x|θ(+) ), p(x|θ(−) )

I


Define a discrimination function I

p(xn |θ

)

p(y =1) define L(xn , θ) ≡ log p(xn |θ (+) ) + log p(y =−1) (−)

16 / 43


x1:N , xn ∈ X

I

y1:N , yn ∈ {−1, 1}


p(x|θ(+) ), p(x|θ(−) )

I


Define a discrimination function p(xn |θ

)

I

p(y =1) define L(xn , θ) ≡ log p(xn |θ (+) ) + log p(y =−1)

I

class 1 if L(xN+1 , θ) ≥ 0

I

class -1 otherwise

(−)

16 / 43

MED objective Apply the max-margin principle to the decision rule Log Distributions

5 class -1

0

class 1

5 decision rule margin

10

decision boundary

15 20 25 30 35 3

training example

2

1

0

1

2

3

17 / 43

Maximum Entropy Discrimination (MED) [4] Recall the generative example Generating Distributions class -1

0.8

class 1

0.6

0.4

0.2

0.0

training example

3

2

decision boundary

1

0

1

2

3

18 / 43

MED objective But over an expected margin Expected Discrimination Function Magnitude 0

20

40

60

training example

80 3

2

1

0

1

2

3

19 / 43

MED objective

Goal I

Find a distribution p(θ(+) , θ(−) ) that maintains a margin

20 / 43

MED objective

Goal I


Ensure the expected class of all training examples is correct by a margin, γ I

Ep(θ(+) ,θ(−) ) [yn L(xn , θ)] ≥ γ

∀i ∈ 1 : N

20 / 43

MED objective

Goal I


Ensure the expected class of all training examples is correct by a margin, γ I

Ep(θ(+) ,θ(−) ) [yn L(xn , θ)] ≥ γ

∀i ∈ 1 : N

Penalize larger “distances” from a prior, p0 (θ(+) , θ(−) ) I

KL( p(θ(+) , θ(−) ) || p0 (θ(+) , θ(−) ) )

20 / 43

SVM-MED comparison

SVM objective min 12 ||w||2 such w,b yn (w> xn − b) ≥ I

MED objective that

min KL( p(θ) || p0 (θ) ) such that p(θ)

1

∀n ∈ 1 : N

Regularizer: size of w

Ep(θ) [yn L(xn , θ)] ≥ γ I

∀n ∈ 1 : N

Regularizer: KL divergence from prior

21 / 43

SVM-MED comparison

SVM objective min 12 ||w||2 such w,b yn (w> xn − b) ≥

MED objective that


1

∀n ∈ 1 : N

Ep(θ) [yn L(xn , θ)] ≥ γ

∀n ∈ 1 : N

I

Regularizer: size of w

I

I

Margin: min distance from the separating hyperplane

Regularizer: KL divergence from prior

I

Margin: expected category under the decision rule

21 / 43

Exercises

See worksheet

22 / 43

Topic Models

I

Topic models are generative models that specify how a document (or sequence) can be decomposed into a set of abstract topics.

Figure: Examples of how a document can be decomposed into topics. The ”data” topic is probabilistically composed of words related to data.

23 / 43

Topic Models

I

Topic models are generative models that specify how a document (or sequence) can be decomposed into a set of abstract topics.

Figure: Examples of how a document can be decomposed into topics. The ”goal” topic is probabilistically composed of words related to goals.

23 / 43

Latent Dirichlet Allocation (LDA)

I

Latent Dirichlet Allocation (LDA)[2] is a well-known topic model describing counts of words in a set of documents.

I

we call vectors of words counts “bag-of-words” (BOW) vectors

I

β vectors describe probabilities of a word given a topic

I

θ vectors describe probabilities of a topic given a document

Figure: LDA plate model 24 / 43

Using LDA to classify Using a bag-of-words representation... I

[nw1 , nw2 , . . . , nwM ]>

I

we can classify a document with a discriminative classifier

LDA assigns each word in the corpus to a topic (or an expected topic) I

counting topics gives a bag-of-topics vector

I

bag-of-topics [nt1 , nt2 , . . . , ntK ]>

I

similar to classification after dimensionality reduction with PCA (LSA)

25 / 43

Using LDA to classify II But the LDA model performs badly for classification!

Figure: tSNE [9] embedding of the topic space to two dimensions from [5] depicts overlap between documents of different classes. 26 / 43

MedLDA[10] MedLDA combines MED and LDA I

BUT MedLDA is not fully faithful to MED framework!

MED objective I


Ep(θ) [L(xi , θ)] ≥ γ

∀i ∈ 1 : N

MedLDA objective I

P min KL(p(w)||p0 (w)) − i log p(xi |α, β) such that p(w),α,β E yi w> ¯zi ≥ 1 ∀i ∈ 1 : N

I

p(w) - a distribution over separating hyperplanes

I

¯z is the bag-of-topics vector normalized to sum to 1

I

α - document-topic vector prior parameter

I

β - topic-word vector prior parameter

27 / 43

MedLDA objective detail MedLDA approximate objective I

ˆ || p(x, z|α, β) ) min KL( p(w) || p0 (w) ) + KL( q(z; α ˆ , β) s.t. E yi w> z¯i ≥ 1 ∀i ∈ 1 : N

p(w),α,β

I

approximation needed to solve optimization problem ˆ is specified - fully factorized distribution form of q(z; α ˆ , β)

I

approximation matches the MED objective

I

I I I I

ˆ p(θ) = p(w)q(z; α ˆ , β) L(xi , θ) ≡ w> ¯z prior on w is specified p0 (w) = N (0, I) ˆ is the posterior distribution p(z|α, β) MED “prior” on q(z; α ˆ , β) I

p(z|α, β) =

Pp(x,z|α,β) z p(x,z|α,β)

28 / 43

MedLDA[10] variations Four variations of MedLDA 1. either supervised or basic LDA I

I

sLDA version alters distribution over bag-of-words vectors (documents) p(x, z|α, β) → p(y , x, z|w, α, β)

2. either regression or classification I I

classification dual form equivalent to SVM regression dual form equivalent to Support Vector Regression (SVR)

I

MedLDA paper covers 3/4 cases

I

omits sLDA w/ classification I

bounds on the softmax function (recall sLDA GLM types) make inference inefficient 29 / 43

MedLDA training algorithm outline Given: I

a training set consisting of I I

bag-of-words vectors, x1:N ∈ X labels, y1:N ∈ {−1, 1}

I

an LDA model with posterior p(z|x, α, β)

I

a discrimination function that maps E [¯z] to {−1, 1} given E [w]

30 / 43

MedLDA training algorithm outline Given: I

a training set consisting of I I

bag-of-words vectors, x1:N ∈ X labels, y1:N ∈ {−1, 1}

I

an LDA model with posterior p(z|x, α, β)

I

a discrimination function that maps E [¯z] to {−1, 1} given E [w]

Repeat until change in objective function < : 1. compute E [¯zi ] ∀i under the LDA variational approximation 2. optimize discriminative parameters, E [w], for the set of expected ¯zi 3. optimize generative parameters, α, β, given E [w] and x1:N

30 / 43

MedLDA Lagrangian

MedLDA objective I

ˆ || p(z|α, β) ) s.t. KL( p(w) || p0 (w) ) + KL( q(z|ˆ α, β) > yi E [w ¯z] ≥ 1 ∀i ∈ 1 : N

MedLDA Lagrangian I

ˆ || p(z|α, β) ) − KL( p(w ) || p0 (w ) ) + α, β) KL( q(z|ˆ P > z] − 1 i µi yi E [w ¯

I

where µi are Lagrange multpliers

31 / 43

MedLDA inference detail MedLDA dual I

minimize with respect to p(w) I

other parameters in the objective do not depend on Lagrange Multipliers, µ

MedLDA Lagrangian I I I

h i P Ep(w) log pp(w) − i µi yi E w> ¯z + const 0 (w) move constraint terms inside the expectation p(w) + const Ep(w) log p (w) exp P µ y w> E [¯z] ( i i( i )) 0

Use Gibbs inequality to find minimum I I

Gibbs inequality: KL( q || p ) ≥ 0, equality when q = p P > z] p(w) = Z1 p0 (w) exp i µi yi w E [¯ 32 / 43

MedLDA dual cont. Min with respect to primal parameters I

p(w) =

1 Z p0 (w) exp

P

i

µi yi w> E [¯z]

MedLDA dual I

substitute min value for p(w) back into the objective to obtain dual

I

ability to solve the dual problem depends on whether normalizing constant, Z , is computable!

I

dual form of the objective for p0 (w) = N (0, I) is equivalent to SVM dual form

I

more efficient to solve the MedLDA QP than to optimize the sLDA bound! 33 / 43

More inference detail

Parameter updates are mostly the same as basic LDA I

recall difference in sLDA updates from LDA updates I

I

I

variational parameter φd,n approximates the expectation of topic zd,n

in MedLDA the φd,n update has a closed-form and depends on the Lagrange Multipliers from the dual solution ∂E [A(η > ¯ z)] MedLDA f (µ) has an effect similar to sLDA ∂φd,n

34 / 43

MED topic models (MedTM) [10] MedTM I

generalization from LDA to arbitrary topic models

I

not completely faithful to original MED sometimes learn a discriminative portion of the objective without a using a probability distribution over weights

I

I

I

w instead of p(w )

Examples I

Conditional Topic Random Field [12] I

I

adds dependence between topics and a per-document input variable to MedLDA

Upstream Scene Understanding Models [11] I

discriminative component is upstream in the model

35 / 43

MED/MedTM key points

I

combines positive aspects from both generative and discriminative models I

I

allows simultaneous learning of discrimination function and mapping to latent space avoids pitfall of classifying with generative model

I

optimization reduces to QP, which is easier to solve than logistic regression

I

flexible - MED can be used with different topic models

36 / 43

Modifications to LDA for classification

Two main approaches to improve classification performance: I

Upstream models I

I

distribution of topic priors conditioned on document category this presentation won’t focus on upstream models

Figure: Example: Dirichlet Multinomial Regression [6] 37 / 43

Modifications to LDA for classification

Two main approaches to improve classification performance: I

Downstream models I

distribution of document category conditioned on topic assignments

Figure: Example: Supervised LDA (sLDA)[1]

37 / 43

A closer look at sLDA sLDA I I

downstream model probability of response variable conditioned on a vector of normalized topic counts I I

p(yn |¯z, η, δ) ¯z = [¯ z1 , . . . , z¯k ]> , z¯k = Pzk zk k

I

can be used for regression or classification

I

variational algorithm with mean field (fully factorized) approximation for inference

Figure: sLDA plate model 38 / 43

Generalized Linear Model (GLM) sLDA response variables modelled by GLM I

GLM(y ; ¯z, η, δ) ≡ h(y , δ) exp I I I I

I

I I I

η>¯ zy −A(η > ¯ z) δ

o

h(y , δ) - base measure y - sufficient statistic A(η > ¯z) - log normalizer (¯z is the bag-of-topics vector normalized to sum to 1)

Normal Distribution I

n

for regression y ∈R 1 h(y , δ) = √2πδ exp{−y 2 /2} A(η > ¯z) = −(η > ¯z)2 /2

I

Softmax I I I I

for classification y ∈ {0, 1} h(y , δ) = 1 A(η > ¯z) = log 1 + exp η > ¯z

39 / 43

sLDA inference

Parameter updates are mostly the same as basic LDA I

sLDA variational algorithm I

I I

I

Eq(z) [zd,n ] approximates the expectation of topic zd,n under the true posterior solve for φd,n ≡ Eq(z) [zd,n ] directly ∂E [A(η > ¯ z)] for a particular GLM φd,n update depends on ∂φd,n

for the softmax (classification) case I I

A(η > ¯z) ≡ log 1 + exp η > ¯z two bounds needed to compute approximate additional upper E log 1 + exp η > ¯z !

40 / 43

Posterior Regularization

I

only a single distribution over feature vectors - no weights

I

variables that were latent in MED are observed in the training set

I

different dual form

41 / 43

References I [1]

D.M. Blei and J. McAuliffe. Supervised topic models. Advances in Neural Information Processing Systems, 20:121–128, 2008.

[2]

D.M. Blei, A.Y. Ng, and M.I. Jordan. Latent dirichlet allocation. The Journal of Machine Learning Research, 3:993–1022, 2003.

[3]

C. Cortes and V. Vapnik. Support-vector networks. Machine learning, 20(3):273–297, 1995.

[4]

T. Jaakkola, M. Meila, and T. Jebara. Maximum entropy discrimination. 1999.

[5]

S. Lacoste-Julien, F. Sha, and M.I. Jordan. DiscLDA: Discriminative learning for dimensionality reduction and classification. Advances in Neural Information Processing Systems 21 (NIPS08), 2008.

[6]

D. Mimno and A. McCallum. Topic models conditioned on arbitrary features with dirichlet-multinomial regression. In Proc. of the 24th Conference on Uncertainty in Artificial Intelligence. Citeseer, 2008.

[7]

H. Rangwala and G. Karypis. Profile-based direct kernels for remote homology detection and fold recognition. Bioinformatics, 21(23):4239, 2005.

[8]

G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval* 1. Information processing & management, 24(5):513–523, 1988.

42 / 43

References II

[9]

L. van der Maaten and G. Hinton. Visualizing data using t-SNE. Journal of Machine Learning Research, 9(2579-2605):85, 2008.

[10] J. Zhu, A. Ahmed, and E.P. Xing. MedLDA: A General Framework of Maximum Margin Supervised Topic Models. stat, 1050:30, 2009. [11] J. Zhu, L.J. Li, F.F. Li, and E.P. Xing. Large Margin Learning of Upstream Scene Understanding Models. [12] J. Zhu and E.P. Xing. Conditional Topic Random Fields. In International Conference on Machine Learning (To appear). Citeseer, 2010.

43 / 43