Big investments by Google, Facebook, Microsoft, etc. .... Brochu, de Freitas & Ghosh, NIPS 2007 [preference learning
A Tutorial on! Bayesian Optimization! for Machine Learning Ryan P. Adams! School of Engineering and Applied Sciences! Harvard University
http://hips.seas.harvard.edu
Machine Learning Meta-Challenges ‣
Increasing Model Complexity
More flexible models have more parameters.!
‣
More Sophisticated Fitting Procedures
Non-convex optimization has many knobs to turn.!
‣
Less Accessible to Non-Experts
Harder to apply complicated techniques.!
‣
Results Are Less Reproducible
Too many important implementation details are missing.
Example: Deep Neural Networks ‣
Resurgent interest in large neural networks.!
‣
When well-tuned, very successful for visual object identification, speech recognition, comp bio, ...!
‣
Big investments by Google, Facebook, Microsoft, etc.!
‣
Many choices: number of layers, weight regularization, layer size, which nonlinearity, batch size, learning rate schedule, stopping conditions
Example: Online Latent Dirichlet Allocation ‣
Hoffman, Blei, and Bach (2010): Approximate inference for large-scale text analysis with latent Dirichlet allocation.!
‣
When well-tuned, gives good empirical results.!
‣
Many choices: Dirichlet parameters, number of topics, ativevocabulary process size, learning rate schedule, batch size
Figure by David Blei
Example: Classification of DNA Sequences ‣
Objective: predict which DNA sequences will bind with which proteins.!
‣
Miller, Kumar, Packer, Goodman, and Koller (2012): Latent Structural Support Vector Machine!
‣
Choices: margin parameter, entropy parameter, convergence criterion
Search for Good Hyperparameters? ‣
Define an objective function.
Most often, we care about generalization performance.
Use cross validation to measure parameter quality.!
‣
How do people currently search? Black magic.
Grid search
Random search
Grad student descent!
‣
Painful!
Requires many training cycles.
Possibly noisy.
Can We Do Better? Bayesian Optimization ‣
Build a probabilistic model for the objective.
Include hierarchical structure about units, etc.!
‣
Compute the posterior predictive distribution.
Integrate out all the possible true functions.
We use Gaussian process regression.!
‣
Optimize a cheap proxy function instead.
The model is much cheaper than that true objective. The main insight:! Make the proxy function exploit uncertainty to balance exploration against exploitation.
The Bayesian Optimization Idea 3 2
current! best
1 0 −1 −2 −3 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
The Bayesian Optimization Idea 3 2
current! best
1 0 −1 −2 −3 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
The Bayesian Optimization Idea 3 2
current! best
1 0 −1 −2 −3 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
The Bayesian Optimization Idea 3 2
current! best
1 0 −1 −2 −3 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
The Bayesian Optimization Idea 3 2
current! best
1 0 −1 −2 −3 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
The Bayesian Optimization Idea 3 2
current! best
1 0 −1 −2 −3 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
The Bayesian Optimization Idea 3 2
current! best
1 0 −1 −2 −3 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Where should we evaluate next! in order to improve the most?
0.9
1
The Bayesian Optimization Idea 3 2
current! best
1 0 −1 −2 −3 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Where should we evaluate next! in order to improve the most?
0.9
1
The Bayesian Optimization Idea 3 2
current! best
1 0 −1 −2 −3 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Where should we evaluate next! in order to improve the most?
0.9
1
The Bayesian Optimization Idea 3 2
current! best
1 0 −1 −2 −3 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
One idea: “expected improvement”
0.9
1
Historical Overview ‣
Closely related to statistical ideas of optimal design of experiments, dating back to Kirstine Smith in 1918.!
‣
As response surface methods, they date back to Box and Wilson in 1951.!
‣
As Bayesian optimization, studied first by Kushner in 1964 and then Mockus in 1978.!
‣
Methodologically, it touches on several important machine learning areas: active learning, contextual bandits, Bayesian nonparametrics!
‣
Started receiving serious attention in ML in 2007, for example:
Brochu, de Freitas & Ghosh, NIPS 2007 [preference learning]
Krause, Singh & Guestrin, JMLR 2008 [optimal sensor placement]
Srinivas, Krause, Kakade & Seeger, ICML 2010 [regret bounds]
Brochu, Hoffman & de Freitas, UAI 2011 [portfolios]!
‣
Interest exploded when it was realized that Bayesian optimization provides an excellent tool for finding good ML hyperparameters.
Today’s Topics ‣
Review of Gaussian process priors!
‣
Bayesian optimization basics!
‣
Managing covariances and kernel parameters!
‣
Accounting for the cost of evaluation!
‣
Parallelizing training!
‣
Sharing information across related problems!
‣
Better models for nonstationary functions!
‣
Random projections for high-dimensional problems!
‣
Accounting for constraints!
‣
Leveraging partially-completed training runs
Gaussian Processes as Function Models Nonparametric prior on functions specified in terms of a positive definite kernel. 3 2 4 2
1
0 0 −2 −1
−4 2
−2 −3
2
0 −2
−1
0
1
2
3
0 −2
−2
the quick brown fox jumps over the lazy dog
strings (Haussler 1999)
permutations (Kondor 2008)
proteins (Borgwardt 2007)
Gaussian Processes ‣
Gaussian process (GP) is a distribution on functions.!
‣
Allows tractable Bayesian modeling of functions without specifying a particular finite basis.!
‣
Input space (where we’re optimizing) !X
‣
Model scalar functions !f : X ! R
‣
Positive definite covariance function! C : X ⇥ X ! R
‣
Mean function
m : X ! R
Gaussian Processes Any finite set of N points in X , {xn }N n=1 induces a homologous N -dimensional Gaussian distribution on RN, taken to be the distribution on {y
n = f (xn )}N n=1
Gaussian Processes Any finite set of N points in X , {xn }N n=1 induces a homologous N -dimensional Gaussian distribution on RN, taken to be the distribution on {y
n = f (xn )}N n=1 4 3 2 1 0 !1 !2
!3
!2
!1
0
1
2
3
Gaussian Processes Any finite set of N points in X , {xn }N n=1 induces a homologous N -dimensional Gaussian distribution on RN, taken to be the distribution on {y
n = f (xn )}N n=1 4 3 2 1 0 −1 −2 −3
−2
−1
0
1
2
3
Gaussian Processes ‣
Due to Gaussian form, closed-form solutions for many useful questions about finite data.!
‣
Marginal likelihood:! N ln 2⇡ 2
1 1 T 1 ln p(y | X, ✓) = ln |K ✓ | y K✓ y ! 2 2 M ‣ Predictive distribution at test points {xm }m=1 :! test y ⇠ N(m, ⌃) ! !
‣
m=
T 1 k✓ K ✓ y
⌃ = ✓
T 1 k✓ K ✓ k✓
We compute these matrices from the covariance: [k✓ ]n,m
[K ✓ ]n,n0 = C(xn , xn0 ; ✓) = C(xn , xm ; ✓) [✓ ]m,m0 = C(xm , xm0 ; ✓)
Examples of GP Covariances Squared-Exponential
0
C(x, x ) =
exp
(
1 2
D ✓ X
x0d
xd
d=1
⇥d
Matérn
◆2 )
21 ⌫ C(r) = ( )
“Neural Network”
0
C(x, x ) =
2
sin
1
8 < :
q
(1 +
x)(1 +
2x0 T
2 r ⇥
!⌫
K⌫
p
2 r ⇥
!
Periodic
2xT x0 2xT
p
x0 )
9 = ;
0
C(x, x ) = exp
(
2 sin
2
1 2 (x 2
0
x)
)
GPs Provide Closed-Form Predictions 3 2 1 0 −1 −2 −3 −3
−2
−1
0
1
2
3
Using Uncertainty in Optimization ?
‣
Find the minimum:! x = arg min f (x)
‣
We can evaluate the objective pointwise, but do not have an easy functional form or gradients. !
‣
After performing some evaluations, the GP gives us easy closed-form marginal means and variances.!
‣
Exploration: Seek places with high variance.!
‣
Exploitation: Seek places with low mean.!
‣
The acquisition function balances these for our proxy optimization to determine the next evaluation.
x2X
Closed-Form Acquisition Functions ‣
The GP posterior gives a predictive mean function µ(x) 2 and a predictive marginal variance function ! (x) !
‣
f (xbest ) µ(x) (x) = (x)
Probability of Improvement (Kushner 1964):! aPI (x) =
‣
( (x))
Expected Improvement (Mockus 1978):! aEI (x) = (x)( (x) ( (x)) + N( (x) ; 0, 1))
‣
GP Upper Confidence Bound (Srinivas et al. 2010): aLCB (x) = µ(x)
(x)
Probability of Improvement 3 2 1 0 −1 −2 −3 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Expected Improvement 3 2 1 0 −1 −2 −3 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
GP Upper (Lower) Confidence Bound 3 2 1 0 −1 −2 −3 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Distribution Over Minimum (Entropy Search) 3 2 1 0 −1 −2 −3 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Illustrating Bayesian Optimization
Illustrating Bayesian Optimization
Illustrating Bayesian Optimization
Illustrating Bayesian Optimization
Illustrating Bayesian Optimization
Illustrating Bayesian Optimization
Illustrating Bayesian Optimization
Why Doesn’t Everyone Use This? These ideas have been around for decades.! Why is Bayesian optimization in broader use? ‣
Fragility and poor default choices.
Getting the function model wrong can be catastrophic.!
‣
There hasn’t been standard software available.
It’s a bit tricky to build such a system from scratch.!
‣
Experiments are run sequentially.
We want to take advantage of cluster computing.!
‣
Limited scalability in dimensions and evaluations.
We want to solve big problems.
Fragility and Poor Default Choices Ironic Problem:! Bayesian optimization has its own hyperparameters! ‣
Covariance function selection
This turns out to be crucial to good performance.
The default choice for regression is way too smooth.
Instead: use adaptive Matèrn 3/5 kernel.!
‣
Gaussian process hyperparameters
Typical empirical Bayes approach can fail horribly.
Instead: use Markov chain Monte Carlo integration.
Slice sampling means no additional parameters!
Covariance Function Choice 21 ⌫ C(r) = ( ) 3
p
2 r ⇥
⌫ = 1/2
2
1
1
0
0
−1
−1
−2
−2
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
⌫ = 5/2
3
−3 0
2
1
1
0
0
−1
−1
−2
−2
0.1
0.2
0.3
0.4
0.5
0.6
K⌫
2 r ⇥
0.7
0.8
0.9
1
−3 0
!
⌫ = 3/2
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.7
0.8
0.9
1
⌫=1
3
2
−3 0
p
3
2
−3 0
!⌫
0.1
0.2
0.3
0.4
0.5
0.6
Choosing Covariance Functions Structured SVM for Protein Motif Finding
Miller et al (2012) 0.28 Matern 52 ARD ExpQuad ISO ExpQuad ARD Matern 32 ARD
Classification Error
0.275 0.27 0.265 0.26 0.255 0.25 0.245 0.24
0
10
20
30
40 50 60 70 80 90 100 Function evaluations Snoek, Larochelle & RPA, NIPS 2012
MCMC for GP Hyperparameters ‣
Covariance hyperparameters are often optimized rather than marginalized, typically in the name of convenience and efficiency.!
‣
Slice sampling of hyperparameters (e.g., Murray and Adams 2010) is comparably fast and easy, but accounts for uncertainty in length scale, mean, and amplitude.!
‣
Integrated Acquisition Function:! ! ! ! !
‣
a ˆ(x) =
Z
a(x ;
N ✓) p(✓ | {xn , yn }n=1 ) d✓
K X 1 (k) ⇡ a(x ; ✓ ) K k=1
✓(k) ⇠ p(✓ | {xn , yn }N n=1 )
For a theoretical discussion of the implications of inferring hyperparameters with BayesOpt, see recent work by Wang and de Freitas (http://arxiv.org/abs/1406.7758) Snoek, Larochelle & RPA, NIPS 2012
Integrating Out GP Hyperparameters Posterior samples with three different length scales
Length scale specific expected improvement
Integrated expected improvement
MCMC for Hyperparameters Logistic regression for handwritten digit recognition
Classification Error
0.25 GP EI MCMC GP EI Conventional Tree Parzen Algorithm
0.2
0.15
0.1
0.05
0
10
20
30
40 50 60 70 80 90 100 Function Evaluations Snoek, Larochelle & RPA, NIPS 2012
Cost-Sensitive Bayesian Optimization ‣
Practical problems often have varying costs, e.g., durations, hardware requirements, materiel.!
‣
We often have a budget or a deadline and would ideally like to account for these limited resources.!
‣
It may be sensible to first characterize the function with cheap evaluations before trying expensive ones.!
‣
These costs are probably unknown, but they can themselves be modeled with Gaussian processes.!
‣
One idea: expected improvement per second Snoek, Larochelle & RPA, NIPS 2012
Expected Improvement per Second
Expected Improvement per Second
Expected Improvement per Second
Expected Improvement per Second
Expected Improvement per Second
Expected Improvement per Second Structural SVM for Protein Motif Finding
Miler et al. (2012) 0.26
Min function value
GP EI MCMC GP EI per Second 0.255
0.25
0.245
0.24
0
5
10
15
Time (hours) Snoek, Larochelle & RPA, NIPS 2012
Expected Improvement per Second CIFAR10: Deep convolutional neural net (Krizhevsky)
Achieves 9.5% test error vs. 11% with hand tuning. 0.3 GP EI MCMC GP EI per Second State−of−the−art
Classification Error
0.28 0.26 0.24 0.22 0.2 0.18
0
5
10
15 20 25 30 Time (Hours) Snoek, Larochelle & RPA, NIPS 2012
we can easily compute the predicted expected inverse duration and use it to compute the xpected improvement per second as a function of x.
Parallelizing Bayesian Optimization
3.3. Monte Carlo Acquisition for Parallelizing Bayesian Optimization. With the advent f multi-core computing, it is natural to ask how we can our Bayesian optimization ‣ Classical Bayesian optimization isparallelize sequential.
rocedures. More generally than simply batch parallelism, however, we would like to be able Dowhat an experiment. Waitnext, until it while finishes. o decide x should be evaluated even a set ofRepeat.! points are being evaluated. Clearly, we cannot use the same acquisition function again, or we will repeat one of the pending xperiments. We would ideally perform roll-out of ourthings acquisition ‣ Compute clusters let usa do many at policy, once.
to choose a point hat appropriately balanced information gain and exploitation. However, such roll-outs are How do weInstead efficiently pick multiple experiments?! enerally intractable. we propose a sequential strategy that takes advantage of the ractable inference properties of the Gaussian process to compute Monte Carlo estimates of he acquisiton function under di↵erent possible from pending function evaluations. ‣ Fantasize outcomes from theresults Gaussian process!
Consider the situation in which N evaluations have completed, yielding data {xn , yn }N n=1 , J gives us{x coherent predictions.
nd in The whichprobabilistic J evaluations are model pending at locations j }j=1 . Ideally, we would choose a new oint based thepredictions expected acquisition function under all possible outcomes of these pending Use on the to imagine possible futures.
valuations:
Evaluate points that are good under the average future.!
7) a ˆ(x ; {xn , yn }, ✓, {xj }) = Z ! a(x ; {xn , yn }, ✓, {xj , yj }) p({yj }Jj=1 | {xj }Jj=1 , {xn , yn }N n=1 ) dy1 · · · dyJ . RJ
shrink theunder variance and integrate out the mean, This‣isAlternative: simply the expectation of a(x) a J-dimensional Gaussian distribution, whose mean and covariance can easily be computed. As in theJMLR covariance hyperparameter case, it as in Desautels, Krause & Burdick, 2014. s straightforward to use samples from this distribution to compute the expected acquisition Snoek, Larochelle & RPA, NIPS 2012
Parallelizing Bayesian Optimization Two pending experiments, with three joint fantasies Three expected improvement functions, one for each fantasy Integrating out the fantasies yields an overall expected improvement
Parallelized Bayesian Optimization Online Latent Dirichlet Allocation (250K documents)
Hoffman, Blei & Bach (2010) 1350 GP EI MCMC 3x GP EI MCMC 5x GP EI MCMC 10x GP EI MCMC
Out−of−Sample Perplexity
1340 1330 1320 1310 1300 1290 1280 1270 1260
0
1
2
3
4 5 6 7 8 Time (Days) Snoek, Larochelle & RPA, NIPS 2012
Multi-Task Bayesian Optimization ‣
We usually don’t just have one problem to solve.
Probably, we have a set of related objectives.!
‣
Examples:
Object classification for different image types.
Speech recognition for different languages/speakers.
Different folds of the same cross-validation setup.!
‣
How can we share information between problems?
Use a prior on vector-valued functions.
Reframe acquisitions to leverage output covariance.
Make decisions that look for bits about the minimum. Swersky, Snoek & RPA, NIPS 2013
Parallel-Task Bayesian Optimization Convolutional Neural Network with Identical Architectures
Google Street View! House Numbers:! 73K 32x32 Digits
STL-10! 5K 96x96! Natural Images
CIFAR-10! 50K 32x32! Natural Images Swersky, Snoek & RPA, NIPS 2013
Correlated Objective Functions dependent functions (1)
(2)
(3)
independent predictions
dependent predictions
Swersky, Snoek & RPA, NIPS 2013
Parallel-Task Bayesian Optimization
Min Function Value (Error Rate)
Visual object recognition: STL-10 and CIFAR-10! Achieves 70.7% test -- best previous was 64.5% STL−10 from scratch STL−10 from CIFAR−10
0.25 0.245 0.24 0.235 0.23 0
10
20 30 Function evaluations
40
50
Swersky, Snoek & RPA, NIPS 2013
Parallel-Task Bayesian Optimization
Min Function Value (Error Rate)
Recognizing house numbers, leveraging object recognition! Achieves 4.77% test error -- best non-dropout result. 0.07
SVHN from scratch SVHN from CIFAR−10 SVHN from subset of SVHN
0.065
0.06
0.055
0.05 10
20 30 Function Evaluations
40
50
Swersky, Snoek & RPA, NIPS 2013
Auxiliary-Task Bayesian Optimization Using a smaller problem to go fast on a bigger problem.
16 14 12 10 8 6
Time Taken by MTBO (Mins)
MTBO STBO
10
20
30
40 50 60 Time (Minutes)
70
80
90
Min Function Value (Classification Error %)
100 80 60 40
6.79
20 0 0
13.53 20
9.49
8.14
7.06
40 60 80 Time Taken by STBO (Mins)
100
1280
MTBO STBO
1278 1276 1274 1272 1270 1268
Time Taken by MTBO (Days)
Min Function Value (Error %)
18
Online Latent Dirichlet Allocation Min Function Value (Perplexity)
Digit Recognition
2
4
6 8 Time (Days)
10
12
Min Function Value (Perplexity)
10 8
1267
6 1270 4
1271
2 0 0
1272 1489 2
4 6 8 Time Taken by STBO (Days)
10
Swersky, Snoek & RPA, NIPS 2013
Bayesian Optimization with Unknown Constraints ‣
In general, you need to specify a range for each parameter you wish to optimize.!
‣
However, there may be regimes in which things break, that are not easily pre-specified.!
‣
Alternatively, you may have expensive constraints that you want to make optimal decisions about. min f (x) s.t. 8k gk (x) x
objective! function
0
constraint! functions
Gelbart, Snoek & RPA, UAI 2014
Noisy Constraints ‣
Realistically, lots of problems we care about have noisy constraints just like they have noisy objectives.!
‣
This is a problem because we can never be completely sure that we have a valid solution.!
‣
This can be addressed via probabilistic constraints: min E[f (x)] s.t. 8k Pr[gk (x) x
objective! function
0]
constraint! functions
1
k
confidence! threshold
Gelbart, Snoek & RPA, UAI 2014
Decoupled Constraints
‣
Just like in the multi-task case, it may be possible to evaluate the constraints separately from the objective.!
‣
The constraints may have vastly different costs.!
‣
We call these decoupled constraints.!
‣
We use entropy search to determine whether to evaluate the constraint or the objective.
Gelbart, Snoek & RPA, UAI 2014
LDA with Sparsity Constraints Constrained Unconstrained Gramacy and Lee (2010)
Min Function Value
9.5 9 8.5 8 7.5 10
20 30 Function Evaluations
40
50
Require that online LDA result in low-entropy topics.
(a) Online LDA
Neural Network with Memory Constraints Constrained Unconstrained
Min Function Value
0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 20
40 60 Function Evaluations
80
100
Decoupled: find a good neural network for MNIST, subject to (b) Neural Network stringent memory limitations.
Bayesian Optimization with Random Projections ‣
Bayesian optimization relies on having a good prior on functions — challenging in high dimensions!!
‣
Realistically, many problems have low effective dimensionality, i.e., the objective is primarily determined by a manifold in the input space.!
‣
Wang, Zoghi, Hutter, Matheson & de Freitas, IJCAI 2013 suggest a way to deal with many dimensions.!
‣
Approach: choose a random embedding of the input dimensions and then optimize in the subspace.!
‣
Straightforward to implement and offers some progress for large and difficult BayesOpt problems.
Input Warping for Nonstationary Objectives ‣
Good priors are key to Bayesian optimization.!
‣
Almost all widely-used kernels are stationary, i.e., covariance depends only on the distance.!
‣
For machine learning hyperparameters, this is a big deal; learning rate changes may have big effects in one regime and small ones in another.!
‣
Real problems are highly nonstationary, but it is difficult to strike a balance between flexibility and tractability.!
‣
One approach: extend the kernel to include a parametric bijection of each input, and integrate it out with MCMC.!
‣
The CDF of beta and Kumaraswamy distributions Input Warping for Bayesian Optimization are suitable bijections on the unit hypercube and allow interesting nonlinear effects. 1
1
1
1
0.8
0.8
0.8
0.8
0.6
0.6
0.6
0.6
0.4
0.4
0.4
0.4
0.2
0.2
0.2
0.2
0 0
0.2
0.4
0.6
(a) Linear
0.8
1
0 0
0.2
0.4
0.6
(b) Exponential
0.8
1
0 0
0.2
0.4
0.6
(c) Logarithmic
0.8
1
0 0
0.2
0.4
0.6
0.8
1
(d) Sigmoidal
Snoek, Swersky, Zemel & RPA, ICML 2014
average loss and standard are6.88 reported. The algorithm 7.3 ± 0.2 validation ±standard 0.6 40deviation ± algorithm 0.0 average validation loss8.2 and deviation are reported. The with the 9 7.3 ±of0.2 8.2 ±our 0.6algorithm40 6.88 ± 0.0in far few on some the benchmarks converges to a solution onInput some of the benchmarks our converges1266.2 to a solution in far fewer eval 272.6 ± 10.3 1271.5 ±for 3.5algorithm 50 ± 0.1 Warping Nonstationary Objectives
9 1272.6 ± 10.3 1271.5 ± 3.5 50 24.6 ± 0.9 24.2 ± 0.0 100 1 24.6 ± 0.9 24.2 ± 0.0 100 1280 EI MCMC − logspace GP EI MCMCGP − logspace EI MCMC GP EI MCMCGP − orig space − orig space GP − orig space Warped GP −Warped orig space
Min Function Value (Perplexity)
Logistic Regression
Min Function Value (Perplexity)
0.25
0.25
1280
1266.2 ± 0.1 24.1 ± 0.1 24.1 ± 0.1
Online LDA
0.248
0.248 GP EI MCMC GP EI MCMC MCMC 0.247 Warped GP EIWarped MCMC GP EI
Structured SVM
Min Function Value
Min Function Value
Min Function Value
Min Function Value
0.247 ameter0.2benchmarks proposed in Eggensperger et al. (2013). We compare 0.2 0.246 lued parameter benchmarks proposed in1275 Eggensperger et al. (2013). We compare 0.246 1275 (Hutter et al., 2011), the Tree Parzen Estimator (TPE) (Bergstra et al., 0.245 0.245et al., SMAC)0.15(Hutter et al., 2011), the Tree Parzen Estimator (TPE) (Bergstra 0.15 0.244 C, Spearmint and TPE are reproduced from Eggensperger et al. (2013). 0.244 or SMAC, Spearmint and TPE are reproduced from Eggensperger et al.0.243 (2013).0.243 1270 1270 gorithm ten run times the for given of evaluations, and the 0.1 0.1was run each algorithm was tenfor times the number given number of evaluations, and the0.242 0.242 algorithm with thewith lowest validation loss is loss shown in bold. We note that 1265 is shown in bold. We note 0.241 1265 0.241 that ed. The 0algorithm the lowest validation 05 510 10 15 20 15 20 50 30 10 2010 30 20 40 30 50 40 30 40 50 Function Evaluations Function Evaluations Function Evaluations Function Evaluations Fu n in far fewer evaluations than the protocol allows. a solution in far fewer evaluations than the protocol allows. (a) Logistic Regression (b) Online LDA LDA (c) Struc (a) Logistic Regression (b) Online (
CIFAR 10 Subset
0.244 0.243
Min Function Value
0.243
Min Function Value
Min Function Value
0.244
Min Function Value
GP EI MCMC FigureFigure 3. An0.248empirical comparison of0.8Bayesian optimization following the standa 3. An empirical comparison of0.8Bayesian optimization following the GP EI MCMC GP EI MCMC Warped GP EI MCMC GP EI MCMC GP EI MCMC 0.247 Warped GP EI MCMC CMC Warped GP EI MCMC 0.247 MCMC) Warped GP EIEI MCMC Warped GP EI 0.7 MCMC (Warped (GP MCMC) and our strategy (Warped GP EI MCMC) for modeling input n (GP EI and our strategy GP EI MCMC) for modeling 0.7 0.246 0.246 challenging problems involving the optimization of the hyperparameters of popula problems involving of the hyperparameters of 0.6 the optimization 0.245challenging 0.6 0.245 0.248
0.5
0.5
0.4 It is clear that the learned warpings are In Overall, in It is clear that the learned warpingsnon-trivial. are non-trivial. In Ove 5 10 50 515 1020 1525 2030 2535 3040 35 40 Function evaluations some some cases,cases, like with rates, rates, they agree with inGaussian evaluations like learning with learning theyFunction agree with inGaup (c) Structured SVM (d) 10 Subset tuition, while for others likeSVM dropout theyCifar yield surprising as every DA (c) Structured (d) Cifar 10 Subset tuition, while for others like dropout they yield surprising as ot e
0 tions
0.242
0.241 40
0.4
0.242
3050
0.241 40
50 60 70 80 90 100 30 40 50 60 70 80 Function evaluations Function evaluations
90
100
Freeze-Thaw Bayesian Optimization ‣
For ML BayesOpt, the objective itself is often the result of an incremental training procedure, e.g., SGD.!
‣
Humans often abort jobs that aren’t promising.!
‣
Freeze-Thaw Bayesian optimization tries to do the same thing.!
‣
It tries to project when a training procedure will give a good value, and only continues promising ones.!
‣
Requires a prior on training curves as a function of input hyperparameters. Introduces a new kernel for this.
(a) Exponential Decay Basis
(b) Samples
(c) Training Curve Samples
Swersky, Snoek & RPA, Under Review
Fig 1: Example functions from the exponential decay kernel. (a) Example functions from our basis set with ↵ = 1.0
Freeze-Thaw Bayesian Optimization 1.4 1.2
0.8
Error
1.0
0.6 0.4 0.2 0.0
0
10
20
30 Epochs
40
(a)
50
60
140 120 100 ed 80 tart S 60 n 40 io t 20 tera I 0
Swersky, Snoek & RPA, Under Review
yt+1 using Equation using Equation 20. 20.17: +1
Select xk , where k = argmaxk a(k) as the next model to run.
Freeze-Thaw Bayesian Optimization
y1 using Equation using Equation 21. 21. Logistic Regression
yPy ation, compute basket using Monte Carlo simulation Equation Freeze−Thaw n, compute Pmin min overover the the basket using Monte Carlo simulation andand Equation 19. 19. Warped GP EI MCMC H(P ) min )min // information gain. 9 0.1 // information gain. t 9.5
Perplexity
Classification Error Rate
0.11
0.09
(k) as the model to run. 0.08 as the nextnext model to run. 0.07 0
00 1000
Perplexity
Perplexity
9
8.5
8
8
7.5 0
400
600 Epochs
800
(a) Logistic Regression 0.8 0.8 0.7
0.7
1000
0
500
1000 Epochs
Prob. Matrix Factorization
(b) Online LD
Freeze−Thaw Freeze−Thaw Warped EI MCMC Warped GP EIGP MCMC
Fig 3: This figure shows the results of the empirical comparison 0.6 0.6 0.5 0.5 machine learning hyperparameter optimization prob 0.4 observed0.4over all training epochs evaluated by eac
7.5 0
200
Freeze−Thaw Freeze−Thaw Warped EI MCMC Warped GP EIGP MCMC
9
8.5
7.5
500 500
RMSE
MCMC C
Online LDA
8.5
8
RMSE
9.5
9.5
F W
0.3
0.3
0.2
0.2
Given a basket, the task now becomes one of choosing w always favor picking new Swersky, models Snoek rather&than running old o RPA, Under Review
1000 1000 1500 1500 Epochs Epochs
Online LDA (b) (b) Online LDA
0.1 0.10 2000 2000 0 200 200400 400600 600800 8001000 1000 1200 1200 Epochs Epochs
(c) PMF (c) PMF
New Directions for Bayesian Optimization
Finding new! organic materials
Optimizing robot! control systems
Improving turbine! blade design
Designing DNA for! protein affinities
Cheetah Gait - Sangbae Kim’s Lab @ MIT
Human-Tuned Gait
Cheetah Gait - Sangbae Kim’s Lab @ MIT
Human-Tuned Gait
Cheetah Gait - Sangbae Kim’s Lab @ MIT
Bayesian Optimized Gait
Cheetah Gait - Sangbae Kim’s Lab @ MIT
Bayesian Optimized Gait
“I just want to use Bayesian optimization.” ‣
The tool my group maintains for performing Bayesian optimization is called Spearmint and is available at https://github.com/HIPS/Spearmint!
‣
Spearmint is available for non-commercial use, and has most of the fancy research bits I’ve described today.!
‣
For turn-key super-easy Bayesian optimization, we’re launching a startup at whetlab.com. It provides a straightforward REST API and clients in Python, Matlab, R, etc.!
‣
Our intention is to make Whetlab cheap/free for ML researchers.
Shameless Plug: Whetlab Python Client import whetlab
!
# Define parameters to optimize parameters = { ‘learn_rate' : {'type':'float','min':0,'max':15,'size':1}, ‘weight_decay’ : {‘type’:'float','min':-5,'max':10,'size':1}, . . . . . . . . . ‘layer1_size' : {'type':'integer', 'min':1, 'max':1000, ‘size':1}}
!
name = ‘My Fancy Deep Network' description = 'Optimize the hyperparams of my network.’ outcome = {‘name’:’Cross-Validation Accuracy', 'type':'float'} scientist = whetlab.Experiment(name=name, description=description, parameters=parameters, outcome=outcome)
!
# Loop until you’re happy. # You can have multiple of these in parallel, too. for ii in xrange(1000):
! ! !
# Ask Whetlab scientist for a new experiment to try. job = scientist.suggest() # Perform experiment with your own code. . . . do training and cross-validation here . . . # Inform Whetlab scientist about the outcome. scientist.update(job, outcome)
Another Shameless Plug: Kayak ‣
Simple open source Python AutoDiff library.! ‣ Meant to be lightweight alternative to, e.g,. Theano.! ‣ Available at https://github.com/HIPS/Kayak! ‣ Not yet all that featureful. import kayak import numpy.random as np
!
# Create a batcher object. batcher = kayak.Batcher(batch_size, inputs.shape[0])
!
# Specify inputs and targets. X = kayak.Inputs(inputs, batcher) T = kayak.Targets(targets, batcher)
!
# First-layer weights and biases, with random initializations. W1 = kayak.Parameter( 0.1*npr.randn( inputs.shape[1], layer1_sz )) B1 = kayak.Parameter( 0.1*npr.randn(1, layer1_sz) )
!
# First hidden layer: ReLU + Dropout H1 = kayak.Dropout(kayak.HardReLU(kayak.ElemAdd(kayak.MatMult(X, W1), B1)), layer1_dropout)
! !
. . . various other bits in your architecture . . .
# Specify your loss function. loss = kayak.MatSum(kayak.LogMultinomialLoss(Y, T))
!
# Gradients are so easy! grad_W1 = loss.grad(W1) grad_B1 = loss.grad(B1)
Thanks Coming soon at whetlab.com
Jasper! Snoek
Kevin! Swersky
Hugo! Larochelle
Code: https://github.com/HIPS/Spearmint
Michael! Gelbart