The Geometry of Distributions II Information Distances - Computational ...

The Geometry of Distributions II Information Distances: Properties And Algorithms

Suresh Venkatasubramanian University of Utah

Lecture Plan

Information geometry

Algorithms for information distances

Spatially-aware information distances

Review: The Rogues’ Club Bregman divergence For convex φ : Rd → R Dφ (p, q) = φ(p) − φ(q) − h∇φ(q), p − qi f-divergence For convex f : R → R, f (1) = 0, Df (p, q) =

qi

∑ pi f ( pi ) i

α-divergence For |α| < 1, 4 Dα (p, q) = [1 − 1 − α2

Z

p(1−α)/2 q(1+łpha)/2 ]

Review: The Rogues’ Club Bregman divergence For convex φ : Rd → R Dφ (p, q) = φ(p) − φ(q) − h∇φ(q), p − qi

α-divergence For |α| < 1, 4 Dα (p, q) = [1 − 1 − α2

Z

p(1−α)/2 q(1+łpha)/2 ]

Bregman Divergences

Properties: Asymmetry

Dφ (p, q) = φ(p) − φ(q) − h∇φ(q), p − qi In general,

D(p,q)

p

q

Dφ (p, q) 6= Dφ (q, p) However, let

D(q,p)

φ∗ (u) = suphu, pi − φ(p) p

q* D*(q*, p*)

then Dφ (p, q) = Dφ∗ (q∗ , p∗ ) where x∗ = ∇φ(x)

p*

Properties II: Generalized Cosine Rule Generalized Cosine Rule: Dφ (p, q) + Dφ (q, r) = Dφ (p, r) + h∇φ(r) − ∇φ(q), p − qi Compare with standard cosine rule:

kp − qk2 kq − rk2 kp − rk2 + = + hr − q, p − qi 2 2 2 In general, Dφ (p, q) behaves like square of a distance (“energy”, not “distance”) q Dφ (p, q) is not a metric in general. Dφ (p, q) is not a metric.

Bregman bisectors are halfplanes

p

r

D( r,q

)

Dφ (r, p) = Dφ (r, q)

D(

r,p

)=

hr, ∇φ(q) − ∇φ(p)i = f (p, q)

Most combinatorial geometry carries over (Voronoi, Delaunay, arrangements, duality)[BNN10]

q

Euclidean Approximations

Dφ (p, q) = φ(p) − φ(q) − h∇φ(q), p − qi

D(p,q)

q

p


Dφ (p, q) = φ(p) − φ(q) − h∇φ(q), p − qi

D(p,q)

q

p

Dφ (p, q) = s> ∇2 φ(r)s, s = p − q, r ∈ [p, q]


Dφ (p, q) = φ(p) − φ(q) − h∇φ(q), p − qi

= s> ∇2 φ(r)s, s = p − q, r ∈ [p, q] As q → p, this becomes the Mahalanobis distance, for positive definite A: q kp − qkA = (p − q) > A(p − q)


R

∗ λmax = max λmax (∇2 φ(x)) x∈R

∗ λmin = min λmin (∇2 φ(x))

µ=

x∈R ∗ λmax ∗ λmin

µ-similarity A distance measure is µ-similar if d(a, b) + d(b, c) ≥

1 d(a, c) µ

Lemma For a given convex function φ, there exists a bounded region R(µ) in which Dφ is µ-similar. Proof follows by exploiting the connection to the Mahalanobis distance, and looking at the variation in eigenvalues of ∇2 φ. Euclidean clustering algorithm ⇒ µ-similar Bregman clustering algorithm. [ABS10, AB10, ABS11, CM08, MR09]

Dimensionality Reduction

Dimensionality Reduction For Distributions[APV10, KPV10]

Problem Given n points on the simplex ∆d , find a mapping to ∆k , k d such that the “distance structure” of the points is preserved.

The Hellinger Distance

d2H (p, q) =

∑(

√

pi −

√

qi ) 2

i

Hellinger distance: Natural geometric interpretation Local approximation of Fisher information, and closely related to Jensen-Shannon and Bhattacharya distance.

Hellinger Is Chordal Distance on Sphere

√ √ √ (p1 , p2 , . . . pd ) → ( p1 , p2 , . . . , pd ) Hellinger distance on simplex ↔ Euclidean distance on sphere. Dimensionality reduction for simplex ↔ Dimensionality reduction on “sphere”

The Plan

Preserving Distances Approximately

Let f : ∆d → ∆k . Given set X = {p1 , . . . , pn } ⊂ ∆d , and e > 0 dH (f (pi ), f (pj )) ≤ dH (pi , pj ) ≤ (1 + e)dH (f (pi ), f (pj )) for all i, j. This measures “distortion”: the ratio of distances. Bounds are worst-case.

Preserving Distances Approximately

Problem Given n points on the simplex ∆d , find a mapping f : ∆d → ∆k such that k is “small” All distances are preserved approximately to within 1 + e.

The Plan

The Plan

Projections on the sphere

Theorem ([JL84]) Given n points in Rd and e > 0, there exists a mapping f : Rd → Rk , with k = O(log n/e2 ) such that all distances are approximately preserved.

Projections on the sphere

Theorem ([JL84]) Given n points in Rd and e > 0, there exists a mapping f : Rd → Rk , with k = O(log n/e2 ) such that all distances are approximately preserved.

Theorem Given n points on Sd and e > 0, there exists a mapping f : Sd → Sk , with k = O(log n/e2 ) such that all distances are approximately preserved.

Algorithm

1

Generate a random projection matrix A ∈ Mk×d and apply to p:       | |    pˆ  = · · · A ···  p | |

2

Normalize the resulting vector: f (p) ,

pˆ kpˆ k

Proof Idea I Projection to Euclidean space preserves distances approximately: e d(pˆ i , pˆ j ) ' d(pi , pj ) Norms are preserved:

kpk = d(p, 0) Normalizing vectors introduces error: xmin j yj

xmin i yi

xmax j

xmax i

origin

Proof Idea II

Close-by points are not distorted too much. Projection also preserves norms (vectors aren’t too long) “Simulate” addition of extra points to ensure that long distances are preserved.

The Plan

The Plan

From The Sphere To The Simplex

√ √ √ (p1 , p2 , . . . pd ) → ( p1 , p2 , . . . , pd ) Mapping from simplex to sphere only invertible for points in positive orthant ! Need projection that Preserves all distances approximately Takes points on positive orthant to points on (lower-dimensional) positive orthant.

A Restricted Result

Fix center c = (1/d, 1/d, . . . , 1/d)> . r P = {x ∈

Sd+

| hx, ci ≥

d−1 } d

1 R = {r ∈ Sd | hr, ci = √ } d

A Restricted Result

Fix center c = (1/d, 1/d, . . . , 1/d)> . r P = {x ∈

Sd+

| hx, ci ≥

d−1 } d

1 R = {r ∈ Sd | hr, ci = √ } d

A Restricted Result

Theorem If all points lie in P, and projection vectors are chosen randomly from R, then all projected points lie in positive orthant and distances are preserved.

Proof Idea

Lemma If r ∈R R and u is any unit vector, then u · r is “subgaussian”: if T ∼ N (0, 1/d), then E[(u · r)2m ] ≤ E[T2m ] Coupled with a result due to Achlioptas[Ach03]:

Lemma If r is as above, then picking r1 , . . . , rk randomly and projecting onto these preserves distances approximately.

Challenges

Bregman Near Neighbors[Cay08, ZOPT09]




Need ball packing bounds as well as relaxed triangle inequality for approximation.

Jensen-Shannon Dimensionality Reduction

dJS (p, q) = (1/2) ∑ pi log i

2qi 2pi + qi log pi + qi pi + qi

This is known to be the square of a metric (by reproducing kernel arguments) [BF]. There exists positive definite kernel K such that dJS (p, q) = K(p, p) + K(q, q) − 2K(p, q) Can we do dimensionality reduction ? (see next lecture)

Spatial Sensitivity

Spatial Sensitivity

Stay Tuned!

References I Marcel R. Ackermann and Johannes Blömer. Bregman clustering for separable instances. In Haim Kaplan, editor, SWAT, volume 6139 of Lecture Notes in Computer Science, pages 212–223. Springer, 2010. Marcel R. Ackermann, Johannes Blömer, and Christian Sohler. Clustering for metric and nonmetric distance measures. ACM Transactions on Algorithms, 6(4), 2010. Marcel R. Ackermann, Johannes Blömer, and Christoph Scholz. Hardness and non-approximability of bregman clustering problems. Electronic Colloquium on Computational Complexity (ECCC), 18:15, 2011.

References II

Dimitris Achlioptas. Database-friendly random projections: Johnson-lindenstrauss with binary coins. J. Comput. Syst. Sci., 66(4):671–687, 2003. Arvind Agarwal, Jeff M. Phillips, and Suresh Venkatasubramanian. Universal multi-dimensional scaling. In Bharat Rao, Balaji Krishnapuram, Andrew Tomkins, and Qiang Yang, editors, KDD, pages 1149–1158. ACM, 2010. J.D. Boissonnat, F. Nielsen, and R. Nock. Bregman voronoi diagrams. Discrete and Computational Geometry, 44(2):281–307, 2010.

References III Lawrence Cayton. Fast nearest neighbor retrieval for bregman divergences. In William W. Cohen, Andrew McCallum, and Sam T. Roweis, editors, ICML, volume 307 of ACM International Conference Proceeding Series, pages 112–119. ACM, 2008. Kamalika Chaudhuri and Andrew McGregor. Finding metric structure in information theoretic clustering. In Rocco A. Servedio and Tong Zhang, editors, COLT, pages 391–402. Omnipress, 2008. William Johnson and Joram Lindenstrauss. Extensions of Lipschitz mappings into a Hilbert space. In Conference in modern analysis and probability (New Haven, Conn., 1982), volume 26 of Contemporary Mathematics, pages 189–206. American Mathematical Society, 1984.

References IV

Rasmus J. Kyng, Jeff M. Phillips, and Suresh Venkatasubramanian. Johnson-Lindenstrauss dimensionality reduction on the simplex. In 20th Fall Workshop on Computational Geometry, October 2010. Bodo Manthey and Heiko Röglin. Worst-case and smoothed analysis of -means clustering with bregman divergences. In Yingfei Dong, Ding-Zhu Du, and Oscar H. Ibarra, editors, ISAAC, volume 5878 of Lecture Notes in Computer Science, pages 1024–1033. Springer, 2009.

References V

Zhenjie Zhang, Beng Chin Ooi, Srinivasan Parthasarathy, and Anthony K. H. Tung. Similarity search on bregman divergence: Towards non-metric indexing. PVLDB, 2(1):13–24, 2009.