Gradient Boosted Regression Trees - Orbi (ULg)

Gradient Boosted Regression Trees

scikit

Peter Prettenhofer (@pprett)

Gilles Louppe (@glouppe)

DataRobot

Universit´e de Li`ege, Belgium

Motivation

Motivation

Outline

1 Basics

2 Gradient Boosting

3 Gradient Boosting in scikit-learn

4 Case Study: California housing

About us Peter • @pprett • Python & ML ∼ 6 years • sklearn dev since 2010

Gilles • @glouppe • PhD student (Li` ege,

Belgium) • sklearn dev since 2011

Chief tree hugger

Outline

1 Basics

2 Gradient Boosting

3 Gradient Boosting in scikit-learn

4 Case Study: California housing

Machine Learning 101 • Data comes as... • A set of examples {(xi , yi )|0 ≤ i < n samples}, with • Feature vector x ∈ Rn features , and • Response y ∈ R (regression) or y ∈ {−1, 1} (classification)

• Goal is to... • Find a function yˆ = f (x) • Such that error L(y , yˆ ) on new (unseen) x is minimal

Classification and Regression Trees [Breiman et al, 1984]

MedInc >> X, y = make_hastie_10_2(n_samples=10000) >>> est = GradientBoostingClassifier(n_estimators=200, max_depth=3) >>> est.fit(X, y) ... >>> # get predictions >>> pred = est.predict(X) >>> est.predict_proba(X)[0] # class probabilities array([ 0.67, 0.33])

Implementation • Written in pure Python/Numpy (easy to extend). • Builds on top of sklearn.tree.DecisionTreeRegressor (Cython). • Custom node splitter that uses pre-sorting (better for shallow trees).

Example from sklearn.ensemble import GradientBoostingRegressor est = GradientBoostingRegressor(n_estimators=2000, max_depth=1).fit(X, y) for pred in est.staged_predict(X): plt.plot(X[:, 0], pred, color=’r’, alpha=0.1)

10 8 6

ground truth RT max_depth=1 RT max_depth=3 GBRT max_depth=1 High bias - low variance

4

y

2 0 2 4 Low bias - high variance

6 80

2

4

x

6

8

10

Model complexity & Overfitting test_score = np.empty(len(est.estimators_)) for i, pred in enumerate(est.staged_predict(X_test)): test_score[i] = est.loss_(y_test, pred) plt.plot(np.arange(n_estimators) + 1, test_score, label=’Test’) plt.plot(np.arange(n_estimators) + 1, est.train_score_, label=’Train’)

2.0

Test Train

Error

1.5

1.0

Lowest test error

0.5 train-test gap 0.0 0

200

400

n_estimators

600

800

1000

Model complexity & Overfitting test_score = np.empty(len(est.estimators_)) for i, pred in enumerate(est.staged_predict(X_test)): test_score[i] = est.loss_(y_test, pred) plt.plot(np.arange(n_estimators) + 1, test_score, label=’Test’) plt.plot(np.arange(n_estimators) + 1, est.train_score_, label=’Train’)

2.0

Test Train

Regularization 1.5

GBRT provides a number of knobs to control overfitting Error

• Tree structure Lowest test error 1.0 • Shrinkage • Stochastic Gradient Boosting 0.5 train-test gap 0.0 0

200

400

n_estimators

600

800

1000

Regularization: Tree structure • The max depth of the trees controls the degree of features interactions • Use min samples leaf to have a sufficient nr. of samples per leaf.

Regularization: Shrinkage • Slow learning by shrinking tree predictions with 0 < learning rate >> est.feature_importances_ array([ 0.01, 0.38, ...])

MedInc AveRooms Longitude AveOccup Latitude AveBedrms Population HouseAge 0.00

0.02

0.04

0.06

0.08 0.10 0.12 Relative importance

0.14

0.16

0.18

Model interpretation What is the effect of a feature on the response? from sklearn.ensemble import partial_dependence import as pd

Partial dependence

-0.12

0.09

0.2

3

0.02

0.16

-0.05

Partial dependence

Partial dependence of house value on nonlocation features for the California housing dataset 0.6 0.6 0.4 0.4 0.2 0.2 0.0 0.0 0.2 0.2 0.4 0.4 1.5 3.0 4.5 6.0 7.5 2.0 2.5 3.0 3.5 4.0 4.5 10 20 30 40 50 60 MedInc AveOccup HouseAge 0.6 50 0.4 40 0.2 30 0.0 20 0.2 0.4 10 4 5 6 7 8 2.0 2.5 3.0 3.5 4.0 AveRooms AveOccup 0.6 0.4 0.2 0.0 0.2 0.4

HouseAge

Partial dependence

Partial dependence

features = [’MedInc’, ’AveOccup’, ’HouseAge’, ’AveRooms’, (’AveOccup’, ’HouseAge’)] fig, axs = pd.plot_partial_dependence(est, X_train, features, feature_names=names)

Model interpretation

Automatically detects spatial effects 0.97

0.57

0.66

0.49 0.41 partial dep. on median house value

partial dep. on median house value

0.34

0.33

-0.28

0.25

latitude

latitude

0.03

0.17

-0.60

0.09

-0.91

0.01

-0.07

-1.22 longitude

-1.54

-0.15 longitude

Summary

• Flexible non-parametric classification and regression technique • Applicable to a variety of problems • Solid, battle-worn implementation in scikit-learn

Thanks! Questions?

Test time Train time

Error

1.2 1.0 0.8 0.6 0.4 0.2 0.0 3.0 2.5 2.0 1.5 1.0 0.5 0.0 1.0 0.8 0.6 0.4 0.2 0.0

dataset bioresp

YahooLTRC

Spam

Solar

Madelon

Expedia

Example 10.2

Covtype

California

Boston

Arcene

Benchmarks

gbm sklearn-0.15

Tipps & Tricks 1

Input layout Use dtype=np.float32 to avoid memory copies and fortan layout for slight runtime benefit. X = np.asfortranarray(X, dtype=np.float32)

Tipps & Tricks 2

Feature interactions GBRT automatically detects feature interactions but often explicit interactions help. Trees required to approximate X1 − X2 : 10 (left), 1000 (right).

0.3

1.0

0.2 x-y

0.0

0.0

0.1

0.5

0.2

1.0

0.8

0.6

x

0.4

0.2

0.0 1.0

0.8

0.6

0.4 y

0.2

x-y

0.5

0.1

0.3 0.0 1.0

0.8

0.6

x

0.4

0.2

0.0 1.0

0.8

0.6

0.4 y

0.2

1.0 0.0

Tipps & Tricks 3

Categorical variables Sklearn requires that categorical variables are encoded as numerics. Tree-based methods work well with ordinal encoding: df = pd.DataFrame(data={’icao’: [’CRJ2’, ’A380’, ’B737’, ’B737’]}) # ordinal encoding df_enc = pd.DataFrame(data={’icao’: np.unique(df.icao, return_inverse=True)[1]}) X = np.asfortranarray(df_enc.values, dtype=np.float32)