Rademacher Bounds (â Generalization Bounds). Low regret online algorithms (â runtime of SGD/SMD). Shai Shalev-Shwart
The Duality of Strong Convexity and Strong Smoothness Applications to Machine Learning Shai Shalev-Shwartz Toyota Technological Institute at Chicago
Talk at AFOSR workshop, January, 2009
January 2009
Shai Shalev-Shwartz (TTI-C)
Duality for ML
Jan’09
1 / 29
Outline
Lemma f is strongly convex w.r.t. k · k ⇐⇒ f ? is strongly smooth w.r.t. k · k? Applications: Rademacher Bounds (⇒ Generalization Bounds)
Shai Shalev-Shwartz (TTI-C)
Duality for ML
Jan’09
2 / 29
Outline
Lemma f is strongly convex w.r.t. k · k ⇐⇒ f ? is strongly smooth w.r.t. k · k? Applications: Rademacher Bounds (⇒ Generalization Bounds) Low regret online algorithms (⇒ runtime of SGD/SMD)
Shai Shalev-Shwartz (TTI-C)
Duality for ML
Jan’09
2 / 29
Outline
Lemma f is strongly convex w.r.t. k · k ⇐⇒ f ? is strongly smooth w.r.t. k · k? Applications: Rademacher Bounds (⇒ Generalization Bounds) Low regret online algorithms (⇒ runtime of SGD/SMD) Boosting
Shai Shalev-Shwartz (TTI-C)
Duality for ML
Jan’09
2 / 29
Outline
Lemma f is strongly convex w.r.t. k · k ⇐⇒ f ? is strongly smooth w.r.t. k · k? Applications: Rademacher Bounds (⇒ Generalization Bounds) Low regret online algorithms (⇒ runtime of SGD/SMD) Boosting Sparsity and `1 norm
Shai Shalev-Shwartz (TTI-C)
Duality for ML
Jan’09
2 / 29
Outline
Lemma f is strongly convex w.r.t. k · k ⇐⇒ f ? is strongly smooth w.r.t. k · k? Applications: Rademacher Bounds (⇒ Generalization Bounds) Low regret online algorithms (⇒ runtime of SGD/SMD) Boosting Sparsity and `1 norm Concentration inequalities
Shai Shalev-Shwartz (TTI-C)
Duality for ML
Jan’09
2 / 29
Outline
Lemma f is strongly convex w.r.t. k · k ⇐⇒ f ? is strongly smooth w.r.t. k · k? Applications: Rademacher Bounds (⇒ Generalization Bounds) Low regret online algorithms (⇒ runtime of SGD/SMD) Boosting Sparsity and `1 norm Concentration inequalities Matrix regularization (Multi-task, group Lasso, dynamic bounds)
Shai Shalev-Shwartz (TTI-C)
Duality for ML
Jan’09
2 / 29
Motivating Problem – Generalization Bounds Linear predictor is a mapping x 7→ φ(hw, xi) E.g. x 7→ hw, xi or x 7→ sgn(hw, xi)
Loss of w on (x, y) is assessed by `(hw, xi, y) Goal: minimize expected loss L(w) = E(x,y) [`(hw, xi, y)] P ˆ Instead, minimize empirical loss L(w) = 1 n `(hw, xi i, yi ) i=1
n
Bartlett and Mendelson [2002]: If ` Lipschitz and bounded, w.p. at least 1 − δ 2 ˆ ∀w ∈ S, L(w) ≤ L(w) + Rn (S) + n
r
log(1/δ) 2n
where " def
Rn (S) = E
iid
∼ {±1}n
Shai Shalev-Shwartz (TTI-C)
Duality for ML
sup
n X
# i hu, xi i
u∈S i=1 Jan’09
3 / 29
Background – Fenchel Conjugate Two equivalent representations of a convex function Set of Tangents
Set of Points
Shai Shalev-Shwartz (TTI-C)
Duality for ML
Jan’09
4 / 29
Background – Fenchel Conjugate Two equivalent representations of a convex function Set of Tangents
Set of Points
Shai Shalev-Shwartz (TTI-C)
Duality for ML
Jan’09
4 / 29
Background – Fenchel Conjugate Two equivalent representations of a convex function Tangent (θ, −f ? (θ))
Point (w, f (w))
slo
f (w)
pe
w
Shai Shalev-Shwartz (TTI-C)
=
θ
−f ? (θ)
Duality for ML
Jan’09
5 / 29
Background – Fenchel Conjugate Two equivalent representations of a convex function Tangent (θ, −f ? (θ))
Point (w, f (w))
slo
f (w)
pe
=
θ
w
−f ? (θ)
?
f (θ) = max hw, θi − f (w) w
Shai Shalev-Shwartz (TTI-C)
Duality for ML
Jan’09
5 / 29
Background – Fenchel Conjugate The definition immediately implies Fenchel-Young inequality: ∀u,
f ? (θ) = max hw, θi − f (w) w
≥ hu, θi − f (u)
Shai Shalev-Shwartz (TTI-C)
Duality for ML
Jan’09
6 / 29
Background – Fenchel Conjugate The definition immediately implies Fenchel-Young inequality: ∀u,
f ? (θ) = max hw, θi − f (w) w
≥ hu, θi − f (u) If f is closed and convex then f ?? = f
Shai Shalev-Shwartz (TTI-C)
Duality for ML
Jan’09
6 / 29
Background – Fenchel Conjugate The definition immediately implies Fenchel-Young inequality: ∀u,
f ? (θ) = max hw, θi − f (w) w
≥ hu, θi − f (u) If f is closed and convex then f ?? = f By the way, this implies Jensen’s inequality: f (E[w]) = max hθ, E[w]i − f ? (θ) θ
= max E [hθ, wi − f ? (θ)] θ
≤ E[ max hθ, wi − f ? (θ) ] = E[f (w)] θ
Shai Shalev-Shwartz (TTI-C)
Duality for ML
Jan’09
6 / 29
Background – Fenchel Conjugate Examples: f (w)
f ? (θ)
1 2 2 kwk
1 2 2 kθk?
kwk
Indicator of unit k · k? ball P θi log ie
P
i wi log(wi )
Indicator of prob. simplex
maxi θi
c g(w) for c > 0
c g ? (θ/c)
inf x g1 (w) + g2 (w − x)
g1? (θ) + g2? (θ)
Shai Shalev-Shwartz (TTI-C)
Duality for ML
Jan’09
7 / 29
Background – Fenchel Conjugate Examples:
⇒ ⇒
f (w)
f ? (θ)
1 2 2 kwk
1 2 2 kθk?
kwk
Indicator of unit k · k? ball P θi log ie
P
i wi log(wi )
Indicator of prob. simplex
maxi θi
c g(w) for c > 0
c g ? (θ/c)
inf x g1 (w) + g2 (w − x)
g1? (θ) + g2? (θ)
(used for boosting) Shai Shalev-Shwartz (TTI-C)
Duality for ML
Jan’09
7 / 29
Background – Fenchel Conjugate Examples: f (w)
f ? (θ)
1 2 2 kwk
1 2 2 kθk?
kwk
Indicator of unit k · k? ball P θi log ie
P
i wi log(wi )
⇒
Indicator of prob. simplex
maxi θi
c g(w) for c > 0
c g ? (θ/c)
inf x g1 (w) + g2 (w − x)
g1? (θ) + g2? (θ)
(infimal convolution theorem) Shai Shalev-Shwartz (TTI-C)
Duality for ML
Jan’09
7 / 29
f is strongly convex ⇐⇒ f ? is strongly smooth The following properties are equivalent: f (w) is σ-strongly convex w.r.t. k · k
f ? (w) is
1 σ -strongly
Shai Shalev-Shwartz (TTI-C)
smooth w.r.t. k · k?
Duality for ML
Jan’09
8 / 29
f is strongly convex ⇐⇒ f ? is strongly smooth The following properties are equivalent: f (w) is σ-strongly convex w.r.t. k · k, that is ∀w, u, θ ∈ ∂f (u), f (w) − f (u) − hθ, w − ui ≥ f ? (w) is
1 σ -strongly
σ ku − wk2 . 2
smooth w.r.t. k · k?
f (w) ≥
σ ku 2
− wk2
f (u) u
w
Shai Shalev-Shwartz (TTI-C)
Duality for ML
Jan’09
8 / 29
f is strongly convex ⇐⇒ f ? is strongly smooth The following properties are equivalent: f (w) is σ-strongly convex w.r.t. k · k, that is ∀w, u, θ ∈ ∂f (u), f (w) − f (u) − hθ, w − ui ≥ f ? (w) is
1 σ -strongly
σ ku − wk2 . 2
smooth w.r.t. k · k? , that is
∀w, u, f ? (w) − f ? (u) − h∇f ? (u), w − ui ≤
1 ku − wk2? . 2σ
f ? (w)
f (w) ≥
σ ku 2
− wk2
≤
1 ku 2σ
− wk2?
f ? (u)
f (u) u
w
Shai Shalev-Shwartz (TTI-C)
u Duality for ML
w Jan’09
8 / 29
f is strongly convex ⇐⇒ f ? is strongly smooth
Examples: f (w)
f ? (θ)
w.r.t. norm
σ
1 2 2 kwk2
1 2 2 kθk2
k · k2
1
Shai Shalev-Shwartz (TTI-C)
Duality for ML
Jan’09
9 / 29
f is strongly convex ⇐⇒ f ? is strongly smooth
Examples: f (w)
f ? (θ)
w.r.t. norm
σ
1 2 2 kwk2
1 2 2 kθk2
k · k2
1
1 2 2 kwkq
1 2 2 kθkp
k · kq
(q − 1)
(where q ∈ (1, 2] and
Shai Shalev-Shwartz (TTI-C)
1 q
+
1 p
= 1)
Duality for ML
Jan’09
9 / 29
f is strongly convex ⇐⇒ f ? is strongly smooth
Examples: f (w)
f ? (θ)
w.r.t. norm
σ
1 2 2 kwk2
1 2 2 kθk2
k · k2
1
1 2 2 kwkq
1 2 2 kθkp
k · kq
(q − 1)
k · k1
1
P
i wi log(wi )
Shai Shalev-Shwartz (TTI-C)
log
θi ie
P
Duality for ML
Jan’09
9 / 29
Importance Theorem (1) Let f be σ strongly convex w.r.t. k · k Assume f ? (0) = 0 (for simplicity) v1 , . . . , vn be arbitrary sequence of vectors P Denote wt = ∇f ? ( j