An Introduction to Recursive Partitioning

Carolin Strobl, James Malley and Gerhard Tutz

An Introduction to Recursive Partitioning

Technical Report Number 55, 2009 Department of Statistics University of Munich http://www.stat.uni-muenchen.de

An Introduction to Recursive Partitioning: Rationale, Application and Characteristics of Classification and Regression Trees, Bagging and Random Forests

Carolin Strobl

James Malley

Department of Statistics

Center for Information Technology

Ludwig-Maximilians-Universität

National Institutes of Health

Munich, Germany

Bethesda, MD, USA

Gerhard Tutz Department of Statistics Ludwig-Maximilians-Universität Munich, Germany

Abstract Recursive partitioning methods have become popular and widely used tools for nonparametric regression and classification in many scientific fields. Especially random forests, that can deal with large numbers of predictor variables even in the presence of complex interactions, have been applied successfully in genetics, clinical medicine and bioinformatics within the past few years. High dimensional problems are common not only in genetics, but also in some areas of psychological research, where only few subjects can be measured due to time or cost constraints, yet a large amount of ) • Select some subjects. (Otherwise fitting will take a while, because all combinations of subjects need to be compared for parameter instabilities in their regression models.) > dat_sleep dat_sleep$Subject mymob 10.227

2 gene_11676 p = 0.27 ≤ 3.725 3 n=0 y = (0.429, 0.571)

1 brain_pH p = 0.005 ≤ 6.67

5 n=0 y = (0.867, 0.133)

> 3.725

≤ 6.16

4 n=0 y = (0.1, 0.9)

> 6.67

2 gene_9489 p = 0.071

3 n=0 y = (0.867, 0.133)

5 gene_13669 p = 0.124

> 6.16

≤ 6.014

4 n=0 y = (0.375, 0.625)

1 gene_2807 p = 0.002

6 n=0 y = (0, 1)

> 6.014 7 n=0 y = (0.375, 0.625)

1 simulated_gene_1 p = 0.003

≤ 6.742

> 6.742

2 n=0 y = (0.786, 0.214)

3 simulated_gene_1 p = 0.001 ≤ 12.279 4 gene_5236 p = 0.026 ≤ 4.876 5 n=0 y = (0.125, 0.875)

≤ 12.413

> 12.279 7 n=0 y = (0.75, 0.25)

> 4.876

≤ 8.204 3 n=0 y = (0.667, 0.333)

6 n=0 y = (0, 1)

> 12.413

2 gene_1440 p = 0.002

5 gene_8717 p = 0.026

> 8.204

≤ 8.023

4 n=0 y = (0.071, 0.929)

6 n=0 y = (1, 0)

> 8.023 7 n=0 y = (0.625, 0.375)

• Compute and plot the permutation importance of each predictor variable. > set.seed(2908) > myvarimp barplot(myvarimp[90:100], space=0.75, xlim=c(0,0.035), +

names.arg=rownames(myvarimp)[90:100], horiz=TRUE, cex.names=0.45,

+

cex=0.45, las=1)

gene_17678

gene_5491

gene_7193

gene_8674

gene_4087

gene_11676

gene_12588

gene_21180

gene_9569

simulated_gene_2

gene_10430

0.000

0.005

0.010

0.015

0.020

0.025

0.030

0.035

(Only a few genes are displayed here to save space. All but the first plot options are only for aesthetics.) • Prediction in terms of the predicted response class or the predicted class probabilities for some selected subjects.

An Introduction to Recursive Partitioning 6

> subjects y y_hat y_hat y_hat_oob sum(dat_genes$status==y_hat)/nrow(dat_genes) [1] 0.9016393 > sum(dat_genes$status==y_hat_oob)/nrow(dat_genes) [1] 0.6721311 > table(dat_genes$status, y_hat) y_hat Bipolar disorder Healthy control Bipolar disorder

28

2

Healthy control

4

27

An Introduction to Recursive Partitioning 7

> table(dat_genes$status, y_hat_oob) y_hat_oob Bipolar disorder Healthy control Bipolar disorder

20

10

Healthy control

10

21

References Hothorn, T., K. Hornik, and A. Zeileis (2006). Unbiased recursive partitioning: A conditional inference framework. Journal of Computational and Graphical Statistics 15 (3), 651–674. Leisch, F. (2002). Sweave: Dynamic generation of statistical reports. In W. Härdle and B. Rönz (Eds.), Proceedings in Computational Statistics, Heidelberg, pp. 575–580. Physika Verlag. Strobl, C., A.-L. Boulesteix, A. Zeileis, and T. Hothorn (2007). Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinformatics 8:25.