EEG data, that are too high-dimensional for the application of standard regression methods. An .... do not need to be sp
Carolin Strobl, James Malley and Gerhard Tutz
An Introduction to Recursive Partitioning
Technical Report Number 55, 2009 Department of Statistics University of Munich http://www.stat.uni-muenchen.de
An Introduction to Recursive Partitioning: Rationale, Application and Characteristics of Classification and Regression Trees, Bagging and Random Forests
Carolin Strobl
James Malley
Department of Statistics
Center for Information Technology
Ludwig-Maximilians-Universit¨at
National Institutes of Health
Munich, Germany
Bethesda, MD, USA
Gerhard Tutz Department of Statistics Ludwig-Maximilians-Universit¨at Munich, Germany
Abstract Recursive partitioning methods have become popular and widely used tools for nonparametric regression and classification in many scientific fields. Especially random forests, that can deal with large numbers of predictor variables even in the presence of complex interactions, have been applied successfully in genetics, clinical medicine and bioinformatics within the past few years. High dimensional problems are common not only in genetics, but also in some areas of psychological research, where only few subjects can be measured due to time or cost constraints, yet a large amount of ) • Select some subjects. (Otherwise fitting will take a while, because all combinations of subjects need to be compared for parameter instabilities in their regression models.) > dat_sleep dat_sleep$Subject mymob 10.227
2 gene_11676 p = 0.27 ≤ 3.725 3 n=0 y = (0.429, 0.571)
1 brain_pH p = 0.005 ≤ 6.67
5 n=0 y = (0.867, 0.133)
> 3.725
≤ 6.16
4 n=0 y = (0.1, 0.9)
> 6.67
2 gene_9489 p = 0.071
3 n=0 y = (0.867, 0.133)
5 gene_13669 p = 0.124
> 6.16
≤ 6.014
4 n=0 y = (0.375, 0.625)
1 gene_2807 p = 0.002
6 n=0 y = (0, 1)
> 6.014 7 n=0 y = (0.375, 0.625)
1 simulated_gene_1 p = 0.003
≤ 6.742
> 6.742
2 n=0 y = (0.786, 0.214)
3 simulated_gene_1 p = 0.001 ≤ 12.279 4 gene_5236 p = 0.026 ≤ 4.876 5 n=0 y = (0.125, 0.875)
≤ 12.413
> 12.279 7 n=0 y = (0.75, 0.25)
> 4.876
≤ 8.204 3 n=0 y = (0.667, 0.333)
6 n=0 y = (0, 1)
> 12.413
2 gene_1440 p = 0.002
5 gene_8717 p = 0.026
> 8.204
≤ 8.023
4 n=0 y = (0.071, 0.929)
6 n=0 y = (1, 0)
> 8.023 7 n=0 y = (0.625, 0.375)
• Compute and plot the permutation importance of each predictor variable. > set.seed(2908) > myvarimp barplot(myvarimp[90:100], space=0.75, xlim=c(0,0.035), +
names.arg=rownames(myvarimp)[90:100], horiz=TRUE, cex.names=0.45,
+
cex=0.45, las=1)
gene_17678
gene_5491
gene_7193
gene_8674
gene_4087
gene_11676
gene_12588
gene_21180
gene_9569
simulated_gene_2
gene_10430
0.000
0.005
0.010
0.015
0.020
0.025
0.030
0.035
(Only a few genes are displayed here to save space. All but the first plot options are only for aesthetics.) • Prediction in terms of the predicted response class or the predicted class probabilities for some selected subjects.
An Introduction to Recursive Partitioning 6
> subjects y y_hat y_hat y_hat_oob sum(dat_genes$status==y_hat)/nrow(dat_genes) [1] 0.9016393 > sum(dat_genes$status==y_hat_oob)/nrow(dat_genes) [1] 0.6721311 > table(dat_genes$status, y_hat) y_hat Bipolar disorder Healthy control Bipolar disorder
28
2
Healthy control
4
27
An Introduction to Recursive Partitioning 7
> table(dat_genes$status, y_hat_oob) y_hat_oob Bipolar disorder Healthy control Bipolar disorder
20
10
Healthy control
10
21
References Hothorn, T., K. Hornik, and A. Zeileis (2006). Unbiased recursive partitioning: A conditional inference framework. Journal of Computational and Graphical Statistics 15 (3), 651–674. Leisch, F. (2002). Sweave: Dynamic generation of statistical reports. In W. H¨ardle and B. R¨onz (Eds.), Proceedings in Computational Statistics, Heidelberg, pp. 575–580. Physika Verlag. Strobl, C., A.-L. Boulesteix, A. Zeileis, and T. Hothorn (2007). Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinformatics 8:25.