The Duality of Strong Convexity and Strong Smoothness - CS - Huji

The Duality of Strong Convexity and Strong Smoothness Applications to Machine Learning Shai Shalev-Shwartz Toyota Technological Institute at Chicago

Talk at AFOSR workshop, January, 2009

January 2009

Shai Shalev-Shwartz (TTI-C)

Duality for ML

Jan’09

1 / 29

Outline

Lemma f is strongly convex w.r.t. k · k ⇐⇒ f ? is strongly smooth w.r.t. k · k? Applications: Rademacher Bounds (⇒ Generalization Bounds)


Duality for ML

Jan’09

2 / 29

Outline

Lemma f is strongly convex w.r.t. k · k ⇐⇒ f ? is strongly smooth w.r.t. k · k? Applications: Rademacher Bounds (⇒ Generalization Bounds) Low regret online algorithms (⇒ runtime of SGD/SMD)


Duality for ML

Jan’09

2 / 29

Outline

Lemma f is strongly convex w.r.t. k · k ⇐⇒ f ? is strongly smooth w.r.t. k · k? Applications: Rademacher Bounds (⇒ Generalization Bounds) Low regret online algorithms (⇒ runtime of SGD/SMD) Boosting


Duality for ML

Jan’09

2 / 29

Outline

Lemma f is strongly convex w.r.t. k · k ⇐⇒ f ? is strongly smooth w.r.t. k · k? Applications: Rademacher Bounds (⇒ Generalization Bounds) Low regret online algorithms (⇒ runtime of SGD/SMD) Boosting Sparsity and `1 norm


Duality for ML

Jan’09

2 / 29

Outline

Lemma f is strongly convex w.r.t. k · k ⇐⇒ f ? is strongly smooth w.r.t. k · k? Applications: Rademacher Bounds (⇒ Generalization Bounds) Low regret online algorithms (⇒ runtime of SGD/SMD) Boosting Sparsity and `1 norm Concentration inequalities


Duality for ML

Jan’09

2 / 29

Outline

Lemma f is strongly convex w.r.t. k · k ⇐⇒ f ? is strongly smooth w.r.t. k · k? Applications: Rademacher Bounds (⇒ Generalization Bounds) Low regret online algorithms (⇒ runtime of SGD/SMD) Boosting Sparsity and `1 norm Concentration inequalities Matrix regularization (Multi-task, group Lasso, dynamic bounds)


Duality for ML

Jan’09

2 / 29

Motivating Problem – Generalization Bounds Linear predictor is a mapping x 7→ φ(hw, xi) E.g. x 7→ hw, xi or x 7→ sgn(hw, xi)

Loss of w on (x, y) is assessed by `(hw, xi, y) Goal: minimize expected loss L(w) = E(x,y) [`(hw, xi, y)] P ˆ Instead, minimize empirical loss L(w) = 1 n `(hw, xi i, yi ) i=1

n

Bartlett and Mendelson [2002]: If ` Lipschitz and bounded, w.p. at least 1 − δ 2 ˆ ∀w ∈ S, L(w) ≤ L(w) + Rn (S) + n

r

log(1/δ) 2n

where " def

Rn (S) = E

iid

∼ {±1}n


Duality for ML

sup

n X

# i hu, xi i

u∈S i=1 Jan’09

3 / 29

Background – Fenchel Conjugate Two equivalent representations of a convex function Set of Tangents

Set of Points


Duality for ML

Jan’09

4 / 29

Background – Fenchel Conjugate Two equivalent representations of a convex function Set of Tangents

Set of Points


Duality for ML

Jan’09

4 / 29

Background – Fenchel Conjugate Two equivalent representations of a convex function Tangent (θ, −f ? (θ))

Point (w, f (w))

slo

f (w)

pe

w


=

θ

−f ? (θ)

Duality for ML

Jan’09

5 / 29

Background – Fenchel Conjugate Two equivalent representations of a convex function Tangent (θ, −f ? (θ))

Point (w, f (w))

slo

f (w)

pe

=

θ

w

−f ? (θ)

?

f (θ) = max hw, θi − f (w) w


Duality for ML

Jan’09

5 / 29

Background – Fenchel Conjugate The definition immediately implies Fenchel-Young inequality: ∀u,

f ? (θ) = max hw, θi − f (w) w

≥ hu, θi − f (u)


Duality for ML

Jan’09

6 / 29



≥ hu, θi − f (u) If f is closed and convex then f ?? = f


Duality for ML

Jan’09

6 / 29



≥ hu, θi − f (u) If f is closed and convex then f ?? = f By the way, this implies Jensen’s inequality: f (E[w]) = max hθ, E[w]i − f ? (θ) θ

= max E [hθ, wi − f ? (θ)] θ

≤ E[ max hθ, wi − f ? (θ) ] = E[f (w)] θ


Duality for ML

Jan’09

6 / 29

Background – Fenchel Conjugate Examples: f (w)

f ? (θ)

1 2 2 kwk

1 2 2 kθk?

kwk

Indicator of unit k · k? ball P θi log ie

P

i wi log(wi )

Indicator of prob. simplex

maxi θi

c g(w) for c > 0

c g ? (θ/c)

inf x g1 (w) + g2 (w − x)

g1? (θ) + g2? (θ)


Duality for ML

Jan’09

7 / 29

Background – Fenchel Conjugate Examples:

⇒ ⇒

f (w)

f ? (θ)

1 2 2 kwk

1 2 2 kθk?

kwk


P

i wi log(wi )


maxi θi

c g(w) for c > 0

c g ? (θ/c)

inf x g1 (w) + g2 (w − x)

g1? (θ) + g2? (θ)

(used for boosting) Shai Shalev-Shwartz (TTI-C)

Duality for ML

Jan’09

7 / 29

Background – Fenchel Conjugate Examples: f (w)

f ? (θ)

1 2 2 kwk

1 2 2 kθk?

kwk


P

i wi log(wi )

⇒


maxi θi

c g(w) for c > 0

c g ? (θ/c)

inf x g1 (w) + g2 (w − x)

g1? (θ) + g2? (θ)

(infimal convolution theorem) Shai Shalev-Shwartz (TTI-C)

Duality for ML

Jan’09

7 / 29

f is strongly convex ⇐⇒ f ? is strongly smooth The following properties are equivalent: f (w) is σ-strongly convex w.r.t. k · k

f ? (w) is

1 σ -strongly


smooth w.r.t. k · k?

Duality for ML

Jan’09

8 / 29

f is strongly convex ⇐⇒ f ? is strongly smooth The following properties are equivalent: f (w) is σ-strongly convex w.r.t. k · k, that is ∀w, u, θ ∈ ∂f (u), f (w) − f (u) − hθ, w − ui ≥ f ? (w) is

1 σ -strongly

σ ku − wk2 . 2

smooth w.r.t. k · k?

f (w) ≥

σ ku 2

− wk2

f (u) u

w


Duality for ML

Jan’09

8 / 29

f is strongly convex ⇐⇒ f ? is strongly smooth The following properties are equivalent: f (w) is σ-strongly convex w.r.t. k · k, that is ∀w, u, θ ∈ ∂f (u), f (w) − f (u) − hθ, w − ui ≥ f ? (w) is

1 σ -strongly

σ ku − wk2 . 2

smooth w.r.t. k · k? , that is

∀w, u, f ? (w) − f ? (u) − h∇f ? (u), w − ui ≤

1 ku − wk2? . 2σ

f ? (w)

f (w) ≥

σ ku 2

− wk2

≤

1 ku 2σ

− wk2?

f ? (u)

f (u) u

w


u Duality for ML

w Jan’09

8 / 29

f is strongly convex ⇐⇒ f ? is strongly smooth

Examples: f (w)

f ? (θ)

w.r.t. norm

σ

1 2 2 kwk2

1 2 2 kθk2

k · k2

1


Duality for ML

Jan’09

9 / 29


Examples: f (w)

f ? (θ)

w.r.t. norm

σ

1 2 2 kwk2

1 2 2 kθk2

k · k2

1

1 2 2 kwkq

1 2 2 kθkp

k · kq

(q − 1)

(where q ∈ (1, 2] and


1 q

+

1 p

= 1)

Duality for ML

Jan’09

9 / 29


Examples: f (w)

f ? (θ)

w.r.t. norm

σ

1 2 2 kwk2

1 2 2 kθk2

k · k2

1

1 2 2 kwkq

1 2 2 kθkp

k · kq

(q − 1)

k · k1

1

P

i wi log(wi )


log

θi ie

P

Duality for ML

Jan’09

9 / 29

Importance Theorem (1) Let f be σ strongly convex w.r.t. k · k Assume f ? (0) = 0 (for simplicity) v1 , . . . , vn be arbitrary sequence of vectors P Denote wt = ∇f ? ( j