A Tutorial on Bayesian Optimization for Machine Learning

41 downloads 226 Views 13MB Size Report
Big investments by Google, Facebook, Microsoft, etc. .... Brochu, de Freitas & Ghosh, NIPS 2007 [preference learning
A Tutorial on! Bayesian Optimization! for Machine Learning Ryan P. Adams! School of Engineering and Applied Sciences! Harvard University

http://hips.seas.harvard.edu

Machine Learning Meta-Challenges ‣

Increasing Model Complexity
 More flexible models have more parameters.!



More Sophisticated Fitting Procedures
 Non-convex optimization has many knobs to turn.!



Less Accessible to Non-Experts
 Harder to apply complicated techniques.!



Results Are Less Reproducible
 Too many important implementation details are missing.


Example: Deep Neural Networks ‣

Resurgent interest in large neural networks.!



When well-tuned, very successful for visual object identification, speech recognition, comp bio, ...!



Big investments by Google, Facebook, Microsoft, etc.!



Many choices: number of layers, weight regularization, layer size, which nonlinearity, batch size, learning rate schedule, stopping conditions

Example: Online Latent Dirichlet Allocation ‣

Hoffman, Blei, and Bach (2010): Approximate inference for large-scale text analysis with latent Dirichlet allocation.!



When well-tuned, gives good empirical results.!



Many choices: Dirichlet parameters, number of topics, ativevocabulary process size, learning rate schedule, batch size

Figure by David Blei

Example: Classification of DNA Sequences ‣

Objective: predict which DNA sequences will bind with which proteins.!



Miller, Kumar, Packer, Goodman, and Koller (2012): Latent Structural Support Vector Machine!



Choices: margin parameter, entropy parameter, convergence criterion

Search for Good Hyperparameters? ‣

Define an objective function.
 Most often, we care about generalization performance.
 Use cross validation to measure parameter quality.!



How do people currently search? Black magic.
 Grid search
 Random search
 Grad student descent!



Painful!
 Requires many training cycles.
 Possibly noisy.

Can We Do Better? Bayesian Optimization ‣

Build a probabilistic model for the objective.
 Include hierarchical structure about units, etc.!



Compute the posterior predictive distribution.
 Integrate out all the possible true functions.
 We use Gaussian process regression.!



Optimize a cheap proxy function instead.
 The model is much cheaper than that true objective. The main insight:! Make the proxy function exploit uncertainty to balance exploration against exploitation.

The Bayesian Optimization Idea 3 2

current! best

1 0 −1 −2 −3 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

The Bayesian Optimization Idea 3 2

current! best

1 0 −1 −2 −3 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

The Bayesian Optimization Idea 3 2

current! best

1 0 −1 −2 −3 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

The Bayesian Optimization Idea 3 2

current! best

1 0 −1 −2 −3 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

The Bayesian Optimization Idea 3 2

current! best

1 0 −1 −2 −3 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

The Bayesian Optimization Idea 3 2

current! best

1 0 −1 −2 −3 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

The Bayesian Optimization Idea 3 2

current! best

1 0 −1 −2 −3 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Where should we evaluate next! in order to improve the most?

0.9

1

The Bayesian Optimization Idea 3 2

current! best

1 0 −1 −2 −3 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Where should we evaluate next! in order to improve the most?

0.9

1

The Bayesian Optimization Idea 3 2

current! best

1 0 −1 −2 −3 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Where should we evaluate next! in order to improve the most?

0.9

1

The Bayesian Optimization Idea 3 2

current! best

1 0 −1 −2 −3 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

One idea: “expected improvement”

0.9

1

Historical Overview ‣

Closely related to statistical ideas of optimal design of experiments, dating back to Kirstine Smith in 1918.!



As response surface methods, they date back to Box and Wilson in 1951.!



As Bayesian optimization, studied first by Kushner in 1964 and then Mockus in 1978.!



Methodologically, it touches on several important machine learning areas: active learning, contextual bandits, Bayesian nonparametrics!



Started receiving serious attention in ML in 2007, for example:
 Brochu, de Freitas & Ghosh, NIPS 2007 [preference learning]
 Krause, Singh & Guestrin, JMLR 2008 [optimal sensor placement]
 Srinivas, Krause, Kakade & Seeger, ICML 2010 [regret bounds]
 Brochu, Hoffman & de Freitas, UAI 2011 [portfolios]!



Interest exploded when it was realized that Bayesian optimization provides an excellent tool for finding good ML hyperparameters.

Today’s Topics ‣

Review of Gaussian process priors!



Bayesian optimization basics!



Managing covariances and kernel parameters!



Accounting for the cost of evaluation!



Parallelizing training!



Sharing information across related problems!



Better models for nonstationary functions!



Random projections for high-dimensional problems!



Accounting for constraints!



Leveraging partially-completed training runs

Gaussian Processes as Function Models Nonparametric prior on functions specified in terms of a positive definite kernel. 3 2 4 2

1

0 0 −2 −1

−4 2

−2 −3

2

0 −2

−1

0

1

2

3

0 −2

−2

the quick brown fox jumps over the lazy dog

strings (Haussler 1999)

permutations (Kondor 2008)

proteins (Borgwardt 2007)

Gaussian Processes ‣

Gaussian process (GP) is a distribution on functions.!



Allows tractable Bayesian modeling of functions without specifying a particular finite basis.!



Input space (where we’re optimizing) !X



Model scalar functions !f : X ! R



Positive definite covariance function! C : X ⇥ X ! R



Mean function 
 m : X ! R

Gaussian Processes Any finite set of N points in X , {xn }N n=1 induces a homologous N -dimensional Gaussian distribution on RN, taken to be the distribution on {y 
 n = f (xn )}N n=1

Gaussian Processes Any finite set of N points in X , {xn }N n=1 induces a homologous N -dimensional Gaussian distribution on RN, taken to be the distribution on {y 
 n = f (xn )}N n=1 4 3 2 1 0 !1 !2

!3

!2

!1

0

1

2

3

Gaussian Processes Any finite set of N points in X , {xn }N n=1 induces a homologous N -dimensional Gaussian distribution on RN, taken to be the distribution on {y 
 n = f (xn )}N n=1 4 3 2 1 0 −1 −2 −3

−2

−1

0

1

2

3

Gaussian Processes ‣

Due to Gaussian form, closed-form solutions for many useful questions about finite data.!



Marginal likelihood:! N ln 2⇡ 2

1 1 T 1 ln p(y | X, ✓) = ln |K ✓ | y K✓ y ! 2 2 M ‣ Predictive distribution at test points {xm }m=1 :! test y ⇠ N(m, ⌃) ! !



m=

T 1 k✓ K ✓ y

⌃ = ✓

T 1 k✓ K ✓ k✓

We compute these matrices from the covariance: [k✓ ]n,m

[K ✓ ]n,n0 = C(xn , xn0 ; ✓) = C(xn , xm ; ✓) [✓ ]m,m0 = C(xm , xm0 ; ✓)

Examples of GP Covariances Squared-Exponential

0

C(x, x ) =

exp

(

1 2

D ✓ X

x0d

xd

d=1

⇥d

Matérn

◆2 )

21 ⌫ C(r) = ( )

“Neural Network”

0

C(x, x ) =

2

sin

1

8 < :

q

(1 +

x)(1 +

2x0 T

2 r ⇥

!⌫

K⌫

p

2 r ⇥

!

Periodic

2xT x0 2xT

p

x0 )

9 = ;

0

C(x, x ) = exp

(

2 sin

2

1 2 (x 2

0

x)

)

GPs Provide Closed-Form Predictions 3 2 1 0 −1 −2 −3 −3

−2

−1

0

1

2

3

Using Uncertainty in Optimization ?



Find the minimum:! x = arg min f (x)



We can evaluate the objective pointwise, but do not have an easy functional form or gradients. !



After performing some evaluations, the GP gives us easy closed-form marginal means and variances.!



Exploration: Seek places with high variance.!



Exploitation: Seek places with low mean.!



The acquisition function balances these for our proxy optimization to determine the next evaluation.

x2X

Closed-Form Acquisition Functions ‣

The GP posterior gives a predictive mean function µ(x) 2 and a predictive marginal variance function ! (x) !



f (xbest ) µ(x) (x) = (x)

Probability of Improvement (Kushner 1964):! aPI (x) =



( (x))

Expected Improvement (Mockus 1978):! aEI (x) = (x)( (x) ( (x)) + N( (x) ; 0, 1))



GP Upper Confidence Bound (Srinivas et al. 2010): aLCB (x) = µ(x)

 (x)

Probability of Improvement 3 2 1 0 −1 −2 −3 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Expected Improvement 3 2 1 0 −1 −2 −3 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

GP Upper (Lower) Confidence Bound 3 2 1 0 −1 −2 −3 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Distribution Over Minimum (Entropy Search) 3 2 1 0 −1 −2 −3 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Illustrating Bayesian Optimization

Illustrating Bayesian Optimization

Illustrating Bayesian Optimization

Illustrating Bayesian Optimization

Illustrating Bayesian Optimization

Illustrating Bayesian Optimization

Illustrating Bayesian Optimization

Why Doesn’t Everyone Use This? These ideas have been around for decades.! Why is Bayesian optimization in broader use? ‣

Fragility and poor default choices.
 Getting the function model wrong can be catastrophic.!



There hasn’t been standard software available.
 It’s a bit tricky to build such a system from scratch.!



Experiments are run sequentially.
 We want to take advantage of cluster computing.!



Limited scalability in dimensions and evaluations.
 We want to solve big problems.

Fragility and Poor Default Choices Ironic Problem:! Bayesian optimization has its own hyperparameters! ‣

Covariance function selection
 This turns out to be crucial to good performance.
 The default choice for regression is way too smooth.
 Instead: use adaptive Matèrn 3/5 kernel.!



Gaussian process hyperparameters
 Typical empirical Bayes approach can fail horribly.
 Instead: use Markov chain Monte Carlo integration.
 Slice sampling means no additional parameters!

Covariance Function Choice 21 ⌫ C(r) = ( ) 3

p

2 r ⇥

⌫ = 1/2

2

1

1

0

0

−1

−1

−2

−2

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

⌫ = 5/2

3

−3 0

2

1

1

0

0

−1

−1

−2

−2

0.1

0.2

0.3

0.4

0.5

0.6

K⌫

2 r ⇥

0.7

0.8

0.9

1

−3 0

!

⌫ = 3/2

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.7

0.8

0.9

1

⌫=1

3

2

−3 0

p

3

2

−3 0

!⌫

0.1

0.2

0.3

0.4

0.5

0.6

Choosing Covariance Functions Structured SVM for Protein Motif Finding
 Miller et al (2012) 0.28 Matern 52 ARD ExpQuad ISO ExpQuad ARD Matern 32 ARD

Classification Error

0.275 0.27 0.265 0.26 0.255 0.25 0.245 0.24

0

10

20

30

40 50 60 70 80 90 100 Function evaluations Snoek, Larochelle & RPA, NIPS 2012

MCMC for GP Hyperparameters ‣

Covariance hyperparameters are often optimized rather than marginalized, typically in the name of convenience and efficiency.!



Slice sampling of hyperparameters (e.g., Murray and Adams 2010) is comparably fast and easy, but accounts for uncertainty in length scale, mean, and amplitude.!



Integrated Acquisition Function:! ! ! ! !



a ˆ(x) =

Z

a(x ;

N ✓) p(✓ | {xn , yn }n=1 ) d✓

K X 1 (k) ⇡ a(x ; ✓ ) K k=1

✓(k) ⇠ p(✓ | {xn , yn }N n=1 )

For a theoretical discussion of the implications of inferring hyperparameters with BayesOpt, see recent work by Wang and de Freitas (http://arxiv.org/abs/1406.7758) Snoek, Larochelle & RPA, NIPS 2012

Integrating Out GP Hyperparameters Posterior samples with three different length scales

Length scale specific expected improvement

Integrated expected improvement

MCMC for Hyperparameters Logistic regression for handwritten digit recognition

Classification Error

0.25 GP EI MCMC GP EI Conventional Tree Parzen Algorithm

0.2

0.15

0.1

0.05

0

10

20

30

40 50 60 70 80 90 100 Function Evaluations Snoek, Larochelle & RPA, NIPS 2012

Cost-Sensitive Bayesian Optimization ‣

Practical problems often have varying costs, e.g., durations, hardware requirements, materiel.!



We often have a budget or a deadline and would ideally like to account for these limited resources.!



It may be sensible to first characterize the function with cheap evaluations before trying expensive ones.!



These costs are probably unknown, but they can themselves be modeled with Gaussian processes.!



One idea: expected improvement per second Snoek, Larochelle & RPA, NIPS 2012

Expected Improvement per Second

Expected Improvement per Second

Expected Improvement per Second

Expected Improvement per Second

Expected Improvement per Second

Expected Improvement per Second Structural SVM for Protein Motif Finding
 Miler et al. (2012) 0.26

Min function value

GP EI MCMC GP EI per Second 0.255

0.25

0.245

0.24

0

5

10

15

Time (hours) Snoek, Larochelle & RPA, NIPS 2012

Expected Improvement per Second CIFAR10: Deep convolutional neural net (Krizhevsky)
 Achieves 9.5% test error vs. 11% with hand tuning. 0.3 GP EI MCMC GP EI per Second State−of−the−art

Classification Error

0.28 0.26 0.24 0.22 0.2 0.18

0

5

10

15 20 25 30 Time (Hours) Snoek, Larochelle & RPA, NIPS 2012

we can easily compute the predicted expected inverse duration and use it to compute the xpected improvement per second as a function of x.

Parallelizing Bayesian Optimization

3.3. Monte Carlo Acquisition for Parallelizing Bayesian Optimization. With the advent f multi-core computing, it is natural to ask how we can our Bayesian optimization ‣ Classical Bayesian optimization isparallelize sequential.
 rocedures. More generally than simply batch parallelism, however, we would like to be able Dowhat an experiment. Waitnext, until it while finishes. o decide x should be evaluated even a set ofRepeat.! points are being evaluated. Clearly, we cannot use the same acquisition function again, or we will repeat one of the pending xperiments. We would ideally perform roll-out of ourthings acquisition ‣ Compute clusters let usa do many at policy, once.
to choose a point hat appropriately balanced information gain and exploitation. However, such roll-outs are How do weInstead efficiently pick multiple experiments?! enerally intractable. we propose a sequential strategy that takes advantage of the ractable inference properties of the Gaussian process to compute Monte Carlo estimates of he acquisiton function under di↵erent possible from pending function evaluations. ‣ Fantasize outcomes from theresults Gaussian process!
 Consider the situation in which N evaluations have completed, yielding data {xn , yn }N n=1 , J gives us{x coherent predictions.
 nd in The whichprobabilistic J evaluations are model pending at locations j }j=1 . Ideally, we would choose a new oint based thepredictions expected acquisition function under all possible outcomes of these pending Use on the to imagine possible futures.
 valuations:

Evaluate points that are good under the average future.!

7) a ˆ(x ; {xn , yn }, ✓, {xj }) = Z ! a(x ; {xn , yn }, ✓, {xj , yj }) p({yj }Jj=1 | {xj }Jj=1 , {xn , yn }N n=1 ) dy1 · · · dyJ . RJ

shrink theunder variance and integrate out the mean, This‣isAlternative: simply the expectation of a(x) a J-dimensional Gaussian distribution, whose mean and covariance can easily be computed. As in theJMLR covariance hyperparameter case, it as in Desautels, Krause & Burdick, 2014. s straightforward to use samples from this distribution to compute the expected acquisition Snoek, Larochelle & RPA, NIPS 2012

Parallelizing Bayesian Optimization Two pending experiments, with three joint fantasies Three expected improvement functions, one for each fantasy Integrating out the fantasies yields an overall expected improvement

Parallelized Bayesian Optimization Online Latent Dirichlet Allocation (250K documents)
 Hoffman, Blei & Bach (2010) 1350 GP EI MCMC 3x GP EI MCMC 5x GP EI MCMC 10x GP EI MCMC

Out−of−Sample Perplexity

1340 1330 1320 1310 1300 1290 1280 1270 1260

0

1

2

3

4 5 6 7 8 Time (Days) Snoek, Larochelle & RPA, NIPS 2012

Multi-Task Bayesian Optimization ‣

We usually don’t just have one problem to solve.
 Probably, we have a set of related objectives.!



Examples:
 Object classification for different image types.
 Speech recognition for different languages/speakers.
 Different folds of the same cross-validation setup.!



How can we share information between problems?
 Use a prior on vector-valued functions.
 Reframe acquisitions to leverage output covariance.
 Make decisions that look for bits about the minimum. Swersky, Snoek & RPA, NIPS 2013

Parallel-Task Bayesian Optimization Convolutional Neural Network with Identical Architectures

Google Street View! House Numbers:! 73K 32x32 Digits

STL-10! 5K 96x96! Natural Images

CIFAR-10! 50K 32x32! Natural Images Swersky, Snoek & RPA, NIPS 2013

Correlated Objective Functions dependent functions (1)

(2)

(3)

independent predictions

dependent predictions

Swersky, Snoek & RPA, NIPS 2013

Parallel-Task Bayesian Optimization

Min Function Value (Error Rate)

Visual object recognition: STL-10 and CIFAR-10! Achieves 70.7% test -- best previous was 64.5% STL−10 from scratch STL−10 from CIFAR−10

0.25 0.245 0.24 0.235 0.23 0

10

20 30 Function evaluations

40

50

Swersky, Snoek & RPA, NIPS 2013

Parallel-Task Bayesian Optimization

Min Function Value (Error Rate)

Recognizing house numbers, leveraging object recognition! Achieves 4.77% test error -- best non-dropout result. 0.07

SVHN from scratch SVHN from CIFAR−10 SVHN from subset of SVHN

0.065

0.06

0.055

0.05 10

20 30 Function Evaluations

40

50

Swersky, Snoek & RPA, NIPS 2013

Auxiliary-Task Bayesian Optimization Using a smaller problem to go fast on a bigger problem.

16 14 12 10 8 6

Time Taken by MTBO (Mins)

MTBO STBO

10

20

30

40 50 60 Time (Minutes)

70

80

90

Min Function Value (Classification Error %)

100 80 60 40

6.79

20 0 0

13.53 20

9.49

8.14

7.06

40 60 80 Time Taken by STBO (Mins)

100

1280

MTBO STBO

1278 1276 1274 1272 1270 1268

Time Taken by MTBO (Days)

Min Function Value (Error %)

18

Online Latent Dirichlet Allocation Min Function Value (Perplexity)

Digit Recognition

2

4

6 8 Time (Days)

10

12

Min Function Value (Perplexity)

10 8

1267

6 1270 4

1271

2 0 0

1272 1489 2

4 6 8 Time Taken by STBO (Days)

10

Swersky, Snoek & RPA, NIPS 2013

Bayesian Optimization with Unknown Constraints ‣

In general, you need to specify a range for each parameter you wish to optimize.!



However, there may be regimes in which things break, that are not easily pre-specified.!



Alternatively, you may have expensive constraints that you want to make optimal decisions about. min f (x) s.t. 8k gk (x) x

objective! function

0

constraint! functions

Gelbart, Snoek & RPA, UAI 2014

Noisy Constraints ‣

Realistically, lots of problems we care about have noisy constraints just like they have noisy objectives.!



This is a problem because we can never be completely sure that we have a valid solution.!



This can be addressed via probabilistic constraints: min E[f (x)] s.t. 8k Pr[gk (x) x

objective! function

0]

constraint! functions

1

k

confidence! threshold

Gelbart, Snoek & RPA, UAI 2014

Decoupled Constraints



Just like in the multi-task case, it may be possible to evaluate the constraints separately from the objective.!



The constraints may have vastly different costs.!



We call these decoupled constraints.!



We use entropy search to determine whether to evaluate the constraint or the objective.

Gelbart, Snoek & RPA, UAI 2014

LDA with Sparsity Constraints Constrained Unconstrained Gramacy and Lee (2010)

Min Function Value

9.5 9 8.5 8 7.5 10

20 30 Function Evaluations

40

50

Require that online LDA result in low-entropy topics.

(a) Online LDA

Neural Network with Memory Constraints Constrained Unconstrained

Min Function Value

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 20

40 60 Function Evaluations

80

100

Decoupled: find a good neural network for MNIST, subject to (b) Neural Network stringent memory limitations.

Bayesian Optimization with Random Projections ‣

Bayesian optimization relies on having a good prior on functions — challenging in high dimensions!!



Realistically, many problems have low effective dimensionality, i.e., the objective is primarily determined by a manifold in the input space.!



Wang, Zoghi, Hutter, Matheson & de Freitas, IJCAI 2013 suggest a way to deal with many dimensions.!



Approach: choose a random embedding of the input dimensions and then optimize in the subspace.!



Straightforward to implement and offers some progress for large and difficult BayesOpt problems.

Input Warping for Nonstationary Objectives ‣

Good priors are key to Bayesian optimization.!



Almost all widely-used kernels are stationary, i.e., covariance depends only on the distance.!



For machine learning hyperparameters, this is a big deal; learning rate changes may have big effects in one regime and small ones in another.!



Real problems are highly nonstationary, but it is difficult to strike a balance between flexibility and tractability.!



One approach: extend the kernel to include a parametric bijection of each input, and integrate it out with MCMC.!



The CDF of beta and Kumaraswamy distributions Input Warping for Bayesian Optimization are suitable bijections on the unit hypercube and allow interesting nonlinear effects. 1

1

1

1

0.8

0.8

0.8

0.8

0.6

0.6

0.6

0.6

0.4

0.4

0.4

0.4

0.2

0.2

0.2

0.2

0 0

0.2

0.4

0.6

(a) Linear

0.8

1

0 0

0.2

0.4

0.6

(b) Exponential

0.8

1

0 0

0.2

0.4

0.6

(c) Logarithmic

0.8

1

0 0

0.2

0.4

0.6

0.8

1

(d) Sigmoidal

Snoek, Swersky, Zemel & RPA, ICML 2014

average loss and standard are6.88 reported. The algorithm 7.3 ± 0.2 validation ±standard 0.6 40deviation ± algorithm 0.0 average validation loss8.2 and deviation are reported. The with the 9 7.3 ±of0.2 8.2 ±our 0.6algorithm40 6.88 ± 0.0in far few on some the benchmarks converges to a solution onInput some of the benchmarks our converges1266.2 to a solution in far fewer eval 272.6 ± 10.3 1271.5 ±for 3.5algorithm 50 ± 0.1 Warping Nonstationary Objectives

9 1272.6 ± 10.3 1271.5 ± 3.5 50 24.6 ± 0.9 24.2 ± 0.0 100 1 24.6 ± 0.9 24.2 ± 0.0 100 1280 EI MCMC − logspace GP EI MCMCGP − logspace EI MCMC GP EI MCMCGP − orig space − orig space GP − orig space Warped GP −Warped orig space

Min Function Value (Perplexity)

Logistic Regression

Min Function Value (Perplexity)

0.25

0.25

1280

1266.2 ± 0.1 24.1 ± 0.1 24.1 ± 0.1

Online LDA

0.248

0.248 GP EI MCMC GP EI MCMC MCMC 0.247 Warped GP EIWarped MCMC GP EI

Structured SVM

Min Function Value

Min Function Value

Min Function Value

Min Function Value

0.247 ameter0.2benchmarks proposed in Eggensperger et al. (2013). We compare 0.2 0.246 lued parameter benchmarks proposed in1275 Eggensperger et al. (2013). We compare 0.246 1275 (Hutter et al., 2011), the Tree Parzen Estimator (TPE) (Bergstra et al., 0.245 0.245et al., SMAC)0.15(Hutter et al., 2011), the Tree Parzen Estimator (TPE) (Bergstra 0.15 0.244 C, Spearmint and TPE are reproduced from Eggensperger et al. (2013). 0.244 or SMAC, Spearmint and TPE are reproduced from Eggensperger et al.0.243 (2013).0.243 1270 1270 gorithm ten run times the for given of evaluations, and the 0.1 0.1was run each algorithm was tenfor times the number given number of evaluations, and the0.242 0.242 algorithm with thewith lowest validation loss is loss shown in bold. We note that 1265 is shown in bold. We note 0.241 1265 0.241 that ed. The 0algorithm the lowest validation 05 510 10 15 20 15 20 50 30 10 2010 30 20 40 30 50 40 30 40 50 Function Evaluations Function Evaluations Function Evaluations Function Evaluations Fu n in far fewer evaluations than the protocol allows. a solution in far fewer evaluations than the protocol allows. (a) Logistic Regression (b) Online LDA LDA (c) Struc (a) Logistic Regression (b) Online (

CIFAR 10 Subset

0.244 0.243

Min Function Value

0.243

Min Function Value

Min Function Value

0.244

Min Function Value

GP EI MCMC FigureFigure 3. An0.248empirical comparison of0.8Bayesian optimization following the standa 3. An empirical comparison of0.8Bayesian optimization following the GP EI MCMC GP EI MCMC Warped GP EI MCMC GP EI MCMC GP EI MCMC 0.247 Warped GP EI MCMC CMC Warped GP EI MCMC 0.247 MCMC) Warped GP EIEI MCMC Warped GP EI 0.7 MCMC (Warped (GP MCMC) and our strategy (Warped GP EI MCMC) for modeling input n (GP EI and our strategy GP EI MCMC) for modeling 0.7 0.246 0.246 challenging problems involving the optimization of the hyperparameters of popula problems involving of the hyperparameters of 0.6 the optimization 0.245challenging 0.6 0.245 0.248

0.5

0.5

0.4 It is clear that the learned warpings are In Overall, in It is clear that the learned warpingsnon-trivial. are non-trivial. In Ove 5 10 50 515 1020 1525 2030 2535 3040 35 40 Function evaluations some some cases,cases, like with rates, rates, they agree with inGaussian evaluations like learning with learning theyFunction agree with inGaup (c) Structured SVM (d) 10 Subset tuition, while for others likeSVM dropout theyCifar yield surprising as every DA (c) Structured (d) Cifar 10 Subset tuition, while for others like dropout they yield surprising as ot e

0 tions

0.242

0.241 40

0.4

0.242

3050

0.241 40

50 60 70 80 90 100 30 40 50 60 70 80 Function evaluations Function evaluations

90

100

Freeze-Thaw Bayesian Optimization ‣

For ML BayesOpt, the objective itself is often the result of an incremental training procedure, e.g., SGD.!



Humans often abort jobs that aren’t promising.!



Freeze-Thaw Bayesian optimization tries to do the same thing.!



It tries to project when a training procedure will give a good value, and only continues promising ones.!



Requires a prior on training curves as a function of input hyperparameters. Introduces a new kernel for this.

(a) Exponential Decay Basis

(b) Samples

(c) Training Curve Samples

Swersky, Snoek & RPA, Under Review

Fig 1: Example functions from the exponential decay kernel. (a) Example functions from our basis set with ↵ = 1.0

Freeze-Thaw Bayesian Optimization 1.4 1.2

0.8

Error

1.0

0.6 0.4 0.2 0.0

0

10

20

30 Epochs

40

(a)

50

60

140 120 100 ed 80 tart S 60 n 40 io t 20 tera I 0

Swersky, Snoek & RPA, Under Review

yt+1 using Equation using Equation 20. 20.17: +1

Select xk , where k = argmaxk a(k) as the next model to run.

Freeze-Thaw Bayesian Optimization

y1 using Equation using Equation 21. 21. Logistic Regression

yPy ation, compute basket using Monte Carlo simulation Equation Freeze−Thaw n, compute Pmin min overover the the basket using Monte Carlo simulation andand Equation 19. 19. Warped GP EI MCMC H(P ) min )min // information gain. 9 0.1 // information gain. t 9.5

Perplexity

Classification Error Rate

0.11

0.09

(k) as the model to run. 0.08 as the nextnext model to run. 0.07 0

00 1000

Perplexity

Perplexity

9

8.5

8

8

7.5 0

400

600 Epochs

800

(a) Logistic Regression 0.8 0.8 0.7

0.7

1000

0

500

1000 Epochs

Prob. Matrix Factorization

(b) Online LD

Freeze−Thaw Freeze−Thaw Warped EI MCMC Warped GP EIGP MCMC

Fig 3: This figure shows the results of the empirical comparison 0.6 0.6 0.5 0.5 machine learning hyperparameter optimization prob 0.4 observed0.4over all training epochs evaluated by eac

7.5 0

200

Freeze−Thaw Freeze−Thaw Warped EI MCMC Warped GP EIGP MCMC

9

8.5

7.5

500 500

RMSE

MCMC C

Online LDA

8.5

8

RMSE

9.5

9.5

F W

0.3

0.3

0.2

0.2

Given a basket, the task now becomes one of choosing w always favor picking new Swersky, models Snoek rather&than running old o RPA, Under Review

1000 1000 1500 1500 Epochs Epochs

Online LDA (b) (b) Online LDA

0.1 0.10 2000 2000 0 200 200400 400600 600800 8001000 1000 1200 1200 Epochs Epochs

(c) PMF (c) PMF

New Directions for Bayesian Optimization

Finding new! organic materials

Optimizing robot! control systems

Improving turbine! blade design

Designing DNA for! protein affinities

Cheetah Gait - Sangbae Kim’s Lab @ MIT

Human-Tuned Gait

Cheetah Gait - Sangbae Kim’s Lab @ MIT

Human-Tuned Gait

Cheetah Gait - Sangbae Kim’s Lab @ MIT

Bayesian Optimized Gait

Cheetah Gait - Sangbae Kim’s Lab @ MIT

Bayesian Optimized Gait

“I just want to use Bayesian optimization.” ‣

The tool my group maintains for performing Bayesian optimization is called Spearmint and is available at https://github.com/HIPS/Spearmint!



Spearmint is available for non-commercial use, and has most of the fancy research bits I’ve described today.!



For turn-key super-easy Bayesian optimization, we’re launching a startup at whetlab.com. It provides a straightforward REST API and clients in Python, Matlab, R, etc.!



Our intention is to make Whetlab cheap/free for ML researchers.

Shameless Plug: Whetlab Python Client import whetlab

!

# Define parameters to optimize parameters = { ‘learn_rate' : {'type':'float','min':0,'max':15,'size':1}, ‘weight_decay’ : {‘type’:'float','min':-5,'max':10,'size':1}, . . . . . . . . . ‘layer1_size' : {'type':'integer', 'min':1, 'max':1000, ‘size':1}}

!

name = ‘My Fancy Deep Network' description = 'Optimize the hyperparams of my network.’ outcome = {‘name’:’Cross-Validation Accuracy', 'type':'float'} scientist = whetlab.Experiment(name=name, description=description, parameters=parameters, outcome=outcome)

!

# Loop until you’re happy. # You can have multiple of these in parallel, too. for ii in xrange(1000):

! ! !

# Ask Whetlab scientist for a new experiment to try. job = scientist.suggest() # Perform experiment with your own code. . . . do training and cross-validation here . . . # Inform Whetlab scientist about the outcome. scientist.update(job, outcome)

Another Shameless Plug: Kayak ‣

Simple open source Python AutoDiff library.! ‣ Meant to be lightweight alternative to, e.g,. Theano.! ‣ Available at https://github.com/HIPS/Kayak! ‣ Not yet all that featureful. import kayak import numpy.random as np

!

# Create a batcher object. batcher = kayak.Batcher(batch_size, inputs.shape[0])

!

# Specify inputs and targets. X = kayak.Inputs(inputs, batcher) T = kayak.Targets(targets, batcher)

!

# First-layer weights and biases, with random initializations. W1 = kayak.Parameter( 0.1*npr.randn( inputs.shape[1], layer1_sz )) B1 = kayak.Parameter( 0.1*npr.randn(1, layer1_sz) )

!

# First hidden layer: ReLU + Dropout H1 = kayak.Dropout(kayak.HardReLU(kayak.ElemAdd(kayak.MatMult(X, W1), B1)), layer1_dropout)

! !

. . . various other bits in your architecture . . .

# Specify your loss function. loss = kayak.MatSum(kayak.LogMultinomialLoss(Y, T))

!

# Gradients are so easy! grad_W1 = loss.grad(W1) grad_B1 = loss.grad(B1)

Thanks Coming soon at whetlab.com

Jasper! Snoek

Kevin! Swersky

Hugo! Larochelle

Code: https://github.com/HIPS/Spearmint

Michael! Gelbart