Bayesian Rose Trees - Semantic Scholar

Bayesian Rose Trees Yee Whye Teh Charles Blundell and Katherine Heller Gatsby Computational Neuroscience Unit University College London

March 2010 Edinburgh

1 / 25

Representational Structures I

Clustering.

I

Hierarchical representations or trees.

I

Overlapping clusters.

I

Low dimensional embeddings.

I

Distributed representations or latent variable models.

2 / 25

Hierarchical Clustering

I

Linkage algorithms.

I

Maximum likelihood, MAP, maximum parsimony [Vinokourov and Girolami 2000, Segal and Koller 2002, Friedman 2003].

I

Bayesian models [Williams 2000, Neal 2003, Teh et al. 2008].

I

Bayesian hierarchical clustering (BHC) [Heller and Ghahramani 2005].

3 / 25

Non-binary Hierarchical Clusterings −250

−200

−150

−100

−50

0

data item

1

2

3

feature 4 / 25

Non-binary Hierarchical Clusterings −180

−160

−140

−120

−80

−100

−60

−40

−20

0

data item

1

2

3

feature 4 / 25

Bayesian Rose Trees

I

Allows for non-binary trees if this is supported by data.

I

Computational efficiency.

I

Likelihood-based, probabilistic approach.

I

most likely tree should offer a simple explanation of the data.

5 / 25

Tree-Consistent Partitions

a b c d e f a b ckdkekf

a b c akbkc

akbkckdkekf a

b

c

d

e

f

An internal node means: Data at its leaves are more similar. Each internal node denotes: 1. a cluster of its leaves 2. its children further partition the cluster into smaller subclusters. A Bayesian rose tree represents a set of partitions of the data. part(T ) = {leaves(T )} ∪ {e1 ke2 ke3 k · · · : Tk ∈ ch(T ), ek ∈ part(Tk )} [Heller and Ghahramani 2005] 6 / 25



a b c akbkc

akbkckdkekf a

b

c

d

e

f




a b c akbkc

akbkckdkekf a

b

c

d

e

f


Likelihood of Clusters, Partitions and Trees Cluster: a b ckdkekf

A cluster is a set of data items. We use an exponential family distribution to model the cluster: ! X p(D|θ) = exp θ> s(x) − |D|A(θ) x∈D

Using a conjugate prior for θ, we can marginalize out θ: Z q(D) = p(D|θ)p(θ)dθ Example: Product of Beta-Bernoulli’s: q(D) =

d Y i=1

p(Di |αi , βi ) =

d Y Beta(αi + niD , βi + N D − niD ) Beta(αi , βi ) i=1

7 / 25

Likelihood of Clusters, Partitions and Trees Partition: a b ckdkekf

A partition is a separation of data set into clusters. We model each cluster independently, so the likelihood of a partition is: Y r ({D1 kD2 k . . . }) = q(Dj ) j

Example: r (a b ckdkekf ) = q(a b c)q(d)q(e)q(f )

8 / 25

Likelihood of Clusters, Partitions and Trees Tree: {a b c d e f , a b ckdkekf , akbkckdkekf }

A tree is treated as a mixture of partitions. The likelihood of a tree will be a convex combination of partition likelihoods: X s(T ) = mT (P)r (P) P∈part(T )

Example: s(T ) =mT (a b c d e f )r (a b c d e f )+ mT (a b ckdkekf )r (a b ckdkekf )+ mT (akbkckdkekf )r (akbkckdkekf )

9 / 25


To make computations tractable, we will define the tree likelihood in a recursive fashion: X s(T ) = mT (P)r (P) P∈part(T )

= πT q(leaves(T )) +(1 − πT ) | {z } cluster of leaves

Y

s(Ti )

Ti ∈ch(T )

|

{z

}

partitions of children

10 / 25


Example:


a b c akbkc

akbkckdkekf a

b

c

d

e

f

s(Tabc ) =πabc q(Dabc ) + (1 − πabc )s(Ta )s(Tb )s(Tc ) =πabc q(Dabc ) + (1 − πabc )q(xa )q(xb )q(xc ) s(Tabcdef ) =πabcdef q(Dabcdef ) + (1 − πabcdef )s(Tabc )q(xd )q(xe )q(xf ) 11 / 25


Example:


a b c akbkc

akbkckdkekf a

b

c

d

e

f



Example:


a b c akbkc

akbkckdkekf a

b

c

d

e

f


An End to Needless Cascades Define mixing proportions with parameter 0 < γ < 1: πT = 1 − (1 − γ)|ch(T )|−1 Suppose r (a b ckd) > r (a bkckd) [+ other partitions of a, b, c].

m(S, T ) γ (1 − γ)γ (1 − γ)(1 − γ)γ (1 − γ)(1 − γ)(1 − γ)

partition S a b c d a b ckd a bkckd akbkckd

Cascading binary tree

a

b

c

d

Collapsed rose tree

m(S, T ) γ 2 (1 − γ) 1 − (1 − γ) (1 − γ)3 a

b

c

partition S a b c d a b ckd akbkckd

d 12 / 25

Complexity of Maximising s(D|T ) There are too many rose trees T for an exhausitive search for the highest s(T ). Binary trees 2O(L log L) With L leaves there are: Rose trees 2O(L log L+L) 10234 10216 10198 10180 10162 10144 10126 10108 1090 1072 1054 1036 1018 100 0

1016

Number of rose trees Number of binary trees

1014

Ratio of rose to binary trees

1012 1010 108 106 104 102 20

40 60 80 Number of leaves

100

120

100 0

20

40 60 80 Number of leaves

100

120

13 / 25

Construction by Greedy Model Selection 1. Let Ti = {xi } ∀i. 2. For every ordered pair of trees (Ti , Tj ) and possible joined/absorbed tree Tm , pick the Tm with the highest L(Tm ). L(Tm ) = log

s(Tm ) s(Ti )s(Tj )

3. Join/absorb Ti , Tj into Tm . 4. Repeat 2 and 3 until one tree remains. 5. Repeatedly contract edges if this increases s(T ). Join (Tm )

Absorb (Tm ) Tj

Ti

Tj Ta

Ta

Tb

Tc

Td

Te

Tb

Tc Td Te

14 / 25

Bayesian Hierarchical Clustering Relationship between BRT and BHC: I

BHC produces binary trees; BRT can produce non-binary trees.

I

BRT and one version of BHC interpret trees as mixtures over partitions. In other version, BHC interpreted as approximate inference in a DP mixture:

I

I I

I

I

Uses a different πT related to DP clustering prior. BHC includes many partitions in its model as this encourages a tighter bound on the marginal probability under the DP mixture. Unfortunately this leads to overly complicated models with many more partitions than necessary. We found that this tends to produce trees with inferior likelihoods.

[Heller and Ghahramani 2005]

15 / 25

Results (anecdotal) BHC (DP)

BHC (fixed) log likelihood -1266

BRT

908,188,506 partitions

log likelihood -1258, 1,441 partitions

log likelihood -1418 468,980,051 partitions chisel clamp drill pliers scissors sledgehammer hammer wrench screwdriver crowbar tomahawk axe hoe shovel rake helicopter submarine ship yacht truck van car bus motorcycle jet train jeep wheelbarrow bike tricycle cucumber lettuce carrot potato radish onions orange tangerine grapefruit lemon apple nectarine strawberry grape pineapple dolphin seal chicken duck deer cat squirrel rat mouse sheep cow pig horse lion tiger

chisel clamp drill pliers scissors sledgehammer hammer wrench screwdriver crowbar tomahawk axe hoe shovel rake helicopter submarine ship yacht wheelbarrow bike tricycle truck van car bus motorcycle jet train jeep cucumber lettuce carrot potato radish onions orange tangerine grapefruit lemon apple nectarine strawberry grape pineapple dolphin seal chicken duck deer cat squirrel rat mouse sheep cow pig horse

duck chicken seal dolphin mouse rat squirrel cat cow sheep pig deer horse tiger lion lettuce cucumber carrot potato radish onions tangerine orange grapefruit lemon apple grape strawberry nectarine pineapple drill clamp pliers scissors chisel axe tomahawk crowbar screwdriver wrench hammer sledgehammer shovel hoe rake yacht ship submarine helicopter train jet car van truck bus motorcycle bike wheelbarrow tricycle jeep

lion tiger

16 / 25

van car bus motorcycle jet train jeep wheelbarrow bike tricycle

tricycle truck van car bus motorcycle jet train jeep

Results (anecdotal) BHC (DP) cucumber lettuce carrot potato radish onions orange tangerine grapefruit lemon apple nectarine strawberry grape pineapple dolphin seal chicken duck deer cat squirrel rat mouse sheep cow pig horse lion tiger

BHC (fixed) cucumber lettuce carrot potato radish onions orange tangerine grapefruit lemon apple nectarine strawberry grape pineapple dolphin seal chicken duck deer cat squirrel rat mouse sheep cow pig horse lion tiger

BRT duck chicken seal dolphin mouse rat squirrel cat cow sheep pig deer horse tiger lion lettuce cucumber carrot potato radish onions tangerine orange grapefruit lemon apple grape strawberry nectarine pineapple drill clamp pliers scissors chisel axe tomahawk crowbar screwdriver wrench hammer

17 / 25

Results (quantitative) Log likelihood: Data set toy spambase digits024 digits newsgroups

BHC (DP) −230.278 ± 0.000 −2301.59 ± 141.996 −4106.458 ± 77.272 −4369.455 ± 81.491 −11439.463 ± 24.428

BHC (fixed) −169.395 ± 0.000 −1920.19 ± 123.075 −3708.448 ± 66.958 −3900.203 ± 72.093 −10710.756 ± 26.767

1.0

Hierarchical F2−measure BHC (DP) BHC (fixed) BRT

0.6 0.4 0.2 0.0

0.0

0.2

0.4

0.6

0.8

BHC (DP) BHC (fixed) BRT

0.8

1.0

Purity

BRT −167.301 ± 0.000 −1905.626 ± 122.5 −3692.174 ± 65.499 −3881.593 ± 71.553 −10687.927 ± 28.157

toy

spambase digits024

digits

newsgroups

toy

spambase digits024

digits

newsgroups

18 / 25

Mixtures of Gaussian Process Experts Mixtures of GPs are simple ways to construct nonparametric density regression models. A type of dependent Dirichlet process mixtures. MCMC inference can be very time consuming. 150 100 20 40 60 80 100 120 140 160 180

50

model GP

lik −1698

#parts 1

#nodes 1

0 50 100 150 200 10

20

30

40

50

150 100 8 16 24 32 40 48 56

50 0

BHC

−691

47 ∗ 109

132

50 100 150 200 10

20

30

40

50

150 100 8 16 24 32 40 48 56 64 72

50

BRT

−509

32 ∗ 105

0

53

50 100 150 200 10

20

30

40

50

[MacEachern 1999, Rasmussen and Ghahramani 2002] 19 / 25

Mixtures of Gaussian Process Experts 3 2 1 0

model GP BHC BRT

lik −300 −116 26

#parts 1 53 ∗ 1014 4

1 2 32 3

0

2

4

6

0

2

4

6

8

2 1 0 1 2 32

8

20 / 25

Discussion

A hierarchical clustering model that: I

allows arbitary branching structure.

I

uses this flexibility to find simpler models better explaining data.

I

Finding good trees in O(L2 log L) time (same as BHC).

There are other (unexplored wrt hierarchical clustering) models of non-binary trees such as Λ-coalescents and Gibbs fragmentation trees. [Pitman 1999, McCullagh et al. 2008]

21 / 25

Thanks

22 / 25

References I Friedman, N. (2003). Pcluster: Probabilistic agglomerative clustering of gene expression profiles. Technical Report Technical Report 2003-80, Hebrew University. Heller, K. A. and Ghahramani, Z. (2005). Bayesian hierarchical clustering. In Proceedings of the International Conference on Machine Learning, volume 22. MacEachern, S. (1999). Dependent nonparametric processes. In Proceedings of the Section on Bayesian Statistical Science. American Statistical Association. McCullagh, P., Pitman, J., and Winkel, M. (2008). Gibbs fragmentation trees. Bernoulli, 14(4):988–1002. Neal, R. M. (2003). Density modeling and clustering using Dirichlet diffusion trees. In Bayesian Statistics, volume 7, pages 619–629. 23 / 25

References II Pitman, J. (1999). Coalescents with multiple collisions. Annals of Probability, 27:1870–1902. Rasmussen, C. E. and Ghahramani, Z. (2002). Infinite mixtures of Gaussian process experts. In Advances in Neural Information Processing Systems, volume 14. Segal, E. and Koller, D. (2002). Probabilistic hierarchical clustering for biological data. In RECOMB ’02: Proceedings of the sixth annual international conference on Computational biology, pages 273–280, New York, NY, USA. ACM. Teh, Y. W., Daume III, H., and Roy, D. M. (2008). Bayesian agglomerative clustering with coalescents. In Advances in Neural Information Processing Systems, volume 20. Vinokourov, A. and Girolami, M. (2000). A probabilistic hierarchical clustering method for organizing collections of text documents. Pattern Recognition, International Conference on, 2:2182. 24 / 25

References III

Williams, C. K. I. (2000). A MCMC approach to hierarchical mixture modelling. In Advances in Neural Information Processing Systems, volume 12.

25 / 25