The Geometry of Distributions II Information Distances: Properties And Algorithms
Suresh Venkatasubramanian University of Utah
Lecture Plan
Information geometry
Algorithms for information distances
Spatially-aware information distances
Review: The Rogues’ Club Bregman divergence For convex φ : Rd → R Dφ (p, q) = φ(p) − φ(q) − h∇φ(q), p − qi f-divergence For convex f : R → R, f (1) = 0, Df (p, q) =
qi
∑ pi f ( pi ) i
α-divergence For |α| < 1, 4 Dα (p, q) = [1 − 1 − α2
Z
p(1−α)/2 q(1+łpha)/2 ]
Review: The Rogues’ Club Bregman divergence For convex φ : Rd → R Dφ (p, q) = φ(p) − φ(q) − h∇φ(q), p − qi
α-divergence For |α| < 1, 4 Dα (p, q) = [1 − 1 − α2
Z
p(1−α)/2 q(1+łpha)/2 ]
Bregman Divergences
Properties: Asymmetry
Dφ (p, q) = φ(p) − φ(q) − h∇φ(q), p − qi In general,
D(p,q)
p
q
Dφ (p, q) 6= Dφ (q, p) However, let
D(q,p)
φ∗ (u) = suphu, pi − φ(p) p
q* D*(q*, p*)
then Dφ (p, q) = Dφ∗ (q∗ , p∗ ) where x∗ = ∇φ(x)
p*
Properties II: Generalized Cosine Rule Generalized Cosine Rule: Dφ (p, q) + Dφ (q, r) = Dφ (p, r) + h∇φ(r) − ∇φ(q), p − qi Compare with standard cosine rule:
kp − qk2 kq − rk2 kp − rk2 + = + hr − q, p − qi 2 2 2 In general, Dφ (p, q) behaves like square of a distance (“energy”, not “distance”) q Dφ (p, q) is not a metric in general. Dφ (p, q) is not a metric.
Bregman bisectors are halfplanes
p
r
D( r,q
)
Dφ (r, p) = Dφ (r, q)
D(
r,p
)=
hr, ∇φ(q) − ∇φ(p)i = f (p, q)
Most combinatorial geometry carries over (Voronoi, Delaunay, arrangements, duality)[BNN10]
q
Euclidean Approximations
Dφ (p, q) = φ(p) − φ(q) − h∇φ(q), p − qi
D(p,q)
q
p
Euclidean Approximations
Dφ (p, q) = φ(p) − φ(q) − h∇φ(q), p − qi
D(p,q)
q
p
Dφ (p, q) = s> ∇2 φ(r)s, s = p − q, r ∈ [p, q]
Euclidean Approximations
Dφ (p, q) = φ(p) − φ(q) − h∇φ(q), p − qi
= s> ∇2 φ(r)s, s = p − q, r ∈ [p, q] As q → p, this becomes the Mahalanobis distance, for positive definite A: q kp − qkA = (p − q) > A(p − q)
Euclidean Approximations
R
∗ λmax = max λmax (∇2 φ(x)) x∈R
∗ λmin = min λmin (∇2 φ(x))
µ=
x∈R ∗ λmax ∗ λmin
µ-similarity A distance measure is µ-similar if d(a, b) + d(b, c) ≥
1 d(a, c) µ
Lemma For a given convex function φ, there exists a bounded region R(µ) in which Dφ is µ-similar. Proof follows by exploiting the connection to the Mahalanobis distance, and looking at the variation in eigenvalues of ∇2 φ. Euclidean clustering algorithm ⇒ µ-similar Bregman clustering algorithm. [ABS10, AB10, ABS11, CM08, MR09]
Dimensionality Reduction
Dimensionality Reduction For Distributions[APV10, KPV10]
Problem Given n points on the simplex ∆d , find a mapping to ∆k , k d such that the “distance structure” of the points is preserved.
The Hellinger Distance
d2H (p, q) =
∑(
√
pi −
√
qi ) 2
i
Hellinger distance: Natural geometric interpretation Local approximation of Fisher information, and closely related to Jensen-Shannon and Bhattacharya distance.
Hellinger Is Chordal Distance on Sphere
√ √ √ (p1 , p2 , . . . pd ) → ( p1 , p2 , . . . , pd ) Hellinger distance on simplex ↔ Euclidean distance on sphere. Dimensionality reduction for simplex ↔ Dimensionality reduction on “sphere”<