A streaming parallel decision tree algorithm - CiteSeerX

A streaming parallel decision tree algorithm Yael Ben-Haim and Elad Yom-Tov IBM Haifa Research Lab, 165 Aba Hushi st., Haifa 31905 , Israel {yaelbh,yomtov}@il.ibm.com

ABSTRACT A new algorithm for building decision tree classifiers is proposed. The algorithm is executed in a distributed environment and is especially designed for classifying large datasets and streaming data. It is empirically shown to be as accurate as standard decision tree classifiers, while being scalable to infinite streaming data and multiple processors.

1.

INTRODUCTION

We propose a new algorithm for building decision tree classifiers for classifying both large datasets and (possibly infinite) streaming data. As recently noted [4], the challenge which distinguishes large-scale learning from small-scale learning is that training time is limited compared to the amount of available data. Thus, in our algorithm both training and testing are executed in a distributed environment. We refer to the new algorithm as the Streaming Parallel Decision Tree (SPDT). Decision trees are simple yet effective classification algorithms. One of their main advantages is that they provide human-readable rules of classification. Decision trees have several drawbacks, especially when trained on large data, where the need to sort all numerical attributes becomes costly in terms of both running time and memory storage. The sorting is needed in order to decide where to split a node. The various techniques for handling large data can be roughly grouped into two approaches: Performing pre-sorting of the data (SLIQ [12] and its successors SPRINT [17] and ScalParC [11]), or replacing sorting by approximate representations of the data such as sampling and/or histogram building (e.g. BOAT [7], CLOUDS [1], and SPIES [10]). While pre-sorting techniques are more accurate, they cannot accommodate very large datasets nor infinite streaming data. Faced with the challenge of handling large data, a large body of work has been dedicated to parallel decision tree algorithms [17],[11],[13],[10],[19],[18],[8]. There are several ways

to parallelize decision trees (described in detail in [2],[19],[13]): In horizontal parallelism, the data is partitioned such that different processors see different examples 1 . In vertical parallelism, different processors see different attributes. Task parallelism involves distribution of the tree nodes among the processors. Finally, hybrid parallelism combines horizontal or vertical parallelism in the first stages of tree construction with task parallelism towards the end. Like their serial counterparts, parallel decision trees overcome the sorting obstacle by applying pre-sorting, distributed sorting, and approximations. Following our interest in infinite streaming data, we focus on approximate algorithms. In streaming algorithms, the dominant approach is to read a limited batch of data and use each such batch to split tree nodes. We refer to processing each such batch as an iteration of the algorithm. The SPIES algorithm [10] is designed for streaming data, but requires holding each batch in memory because it may need several passes over each batch. pCLOUDS [18] relies on assumptions on the behavior of the impurity function, which are empirically justified but can be false for a particular dataset. We note that none of the experiments reported in previous works involved both a large number of examples and a large number of attributes.

2.

ALGORITHM DESCRIPTION

Our proposed algorithm builds the decision tree in a breadthfirst mode, using horizontal parallelism. At the core of our algorithm is an on-line method for building histograms from streaming data at the processors. These histograms are then used for making decisions on new tree nodes at the master processor. We empirically show that our proposed algorithm is as accurate as traditional, single-processor algorithms, while being scalable to infinite streaming data and multiple processors.

2.1

Tree growing algorithm

We construct a decision tree based on a set of training examples {(x1 , y1 ), . . . , (xn , yn )}, where x1 , . . . , xn ∈ Rd are the feature vectors and y1 , . . . , yn ∈ {1, . . . , c} are the labels. Every internal node in the tree possesses two ordered child nodes and a decision rule of the form x(i) < a, where x(i) 1 We refer to processing nodes as processors, to avoid confusion with tree nodes

is the ith feature and a is a real number. Feature vectors that satisfy the decision rule are directed to the node’s left child node, and the other vectors are directed to the right child node. Every example x has thus a path from the root to one of the leaves, denoted l(x). Every leaf has a label t, so that an example x is assigned the label t(l(x)). The label is accompanied with a real number that represents the confidence on the label’s correctness 2 . Initially, the tree consists only of one node. The tree is grown iteratively, such that in each iteration a new level of nodes is appended to the tree. We apply a distributed architecture that consists of Nw processors. Each processor can observe 1/Nw of the data, but has a view of the complete classification tree built so far.

merge. The histogram building algorithm is a slight adaptation of the on-line clustering algorithm developed by Guedalia et al. [9], with the addition of a procedure for merging histograms.

The update procedure: Given a histogram (p1 , m1 ), . . . , (pr , mr ), p1 < . . . < pr and a point p, the update procedure adds p to the set S represented by the histogram. • If p = pi for some i, then increment mi by 1. Otherwise: • Add the bin (p, 1) to the histogram, resulting in a histogram of r + 1 bins (q1 , k1 ), . . . , (qr+1 , kr+1 ), q1 < . . . < qr+1 .

At each iteration, each processor uses the data points it observes to build a histogram for each class, terminal node (leaf), and feature. Each data point is classified to the correct leaf of the current tree, and is used to update the relevant histograms. Section 2.2 provides a description of histogram algorithms. After observing a predefined number of data points (or, in the case of finite data, after seeing the complete data) the histograms are communicated to a master processor which integrates these histograms and reaches a decision on how to split the nodes, using the chosen split criterion (see e.g. [5, 16]): For each bin location in the histogram of each dimension, the (approximate) number of points from each class to the left and to the right of this location is counted. This is then used in the computation of the purity of the leaf’s child nodes, if this node is chosen to be split in the current dimension and location. The feature i and location a for which the child nodes’ purities are maximized will constitute the decision rule x(i) < a. The leaf becomes an internal node with the chosen decision rule, and two new nodes (its child nodes) are created. If the node is already pure enough, the splitting is stopped and the node is assigned a label and a confidence level, both determined by the number of examples from each class that reached it. Decision trees are frequently pruned during or after training to obtain smaller trees and better generalization. We adapted the MDL-based pruning algorithm of [12]. This algorithm involves simple calculations during node splitting, that reflect the node’s purity. In a bottom-up pass on the complete tree, some subtrees are chosen to be pruned, based on estimates of the expected error rate before and after pruning.

2.2

On-line histogram building

A histogram is a set of r pairs (called bins) (p1 , m1 ), . . . , (pr , mr ), where r is a preset constant integer, p1 , . . . , pr are real numbers, and m1 , . . . , mr are integers. The histogram is a compressed and approximate representation of a large set S of P real numbers, so that |S| = ri=1 mi , and mi is the number of points in S at the surroundings of pi . The histogram data structure supports two procedures, named update and 2

Note that since the number of different confidence levels is upper-bounded by the number of leaves, the decision tree does not provide continuous-valued outputs.

• Find a point qi such that qi+1 − qi is minimal. • Replace the bins (qi , ki ), (qi+1 , ki+1 ) by the bin ţ ű qi ki + qi+1 ki+1 , ki + ki+1 . ki + ki+1

The merge procedure: Given two histograms, the merge procedure creates a new histogram, that represents the union S1 ∪S2 of the sets S1 , S2 represented by the histograms. The algorithm is similar to the update algorithm. In the first step, the two histograms form a single histogram with many bins. In the second step, bins which are closest are merged together to form a single bin, and the process repeats until the histogram has r bins.

3.

EMPIRICAL RESULTS

We compared the error rate of the SPDT algorithm with the error rate of a standard decision tree on seven mediumsized datasets taken from the USI repository [3]: Adult, Isolet, Letter recognition, Nursery, Page blocks, Pen digits, and Spambase. The characteristics and error rates of all datasets are summarized in Table 1. Ten-fold cross validation was applied where there was no natural train/test partition. We used an 8-CPU Power5 machine with 16GB memory, using a Linux operating system. Our algorithm was implemented within the IBM Parallel Machine Learning toolbox [15], which runs using MPICH2. The comparison shows that the approximations undertaken by the SPDT algorithm do not necessarily have a detrimental effect on its error rate. The FF statistics combined with Holm’s procedure (see [6]) with a confidence level of 95% shows that all but SPDT with eight processors exhibited performance which could not be detected as statistically significantly different. For relatively small data, using eight processors means that each node sees little data, and thus the histograms suffer in accuracy. This may explain the degradation in performance when using eight processors.

Dataset Adult Isolet Letter Nursery Page blocks Pen digits Spambase

Number of examples 32561 (16281) 6238 (1559) 20000 12960 5473 7494 (3498) 4601

Number of features 105 617 16 25 10 16 57

Standard tree 17.67 18.70 7.48 1.01 3.13 4.6 8.37

SPDT 1 processor 15.75 14.56 8.65 2.58 3.07 5.37 10.52

SPDT 2 processors 15.58 17.90 9.28 2.67 3.18 5.43 11.11

SPDT 4 processors 16.16 19.69 10.13 2.82 3.51 5.20 11.29

SPDT 8 processors 16.50 19.31 10.07 3.16 3.44 5.83 11.61

Table 1: Error rates for medium-sized datasets. The number of examples in parentheses is the number of test examples (if a train/test partition exists). The lowest error rate for each dataset is marked in bold. Dataset Adult Isolet Letter Nursery Page blocks Pen digits Spambase

Error rate before pruning 16.50 19.31 10.07 3.16 3.44 5.83 11.61

Tree size before pruning 1645 221 135 178 55 89 572

Error rate after pruning 14.34 17.77 9.26 3.21 3.44 5.83 11.45

Tree size after pruning 409 141 67 167 36 81 445

Table 2: Error rates and tree sizes (number of processors) before and after pruning, with eight processor.

It is also interesting to study the effect of pruning on the error rate and tree size. Using the procedure described above, we pruned the trees obtained by SPDT. Table 2 shows that pruning usually improves the error rate (though not to a statistically significant threshold (sign test)) , while reducing the tree size by 80% on average. We tested SPDT for speedup and scalability on the alpha and beta datasets from the Pascal Large Scale Learning Challenge [14]. Both datasets have 500000 examples and 500 dimensions, out of which we extracted datasets of sizes 100, 1000, 10000, 100000, and 500000. Figure 1 shows the speedup for different sized datasets. We further tested speedup in five more datasets taken from the Pascal challenge: delta, epsilon, fd, ocr, and dna. Referring to dataset size as the number of examples multiplied by the number of dimensions, we found that dataset size and speedup are highly correlated (Spearman correlation of 0.919). This fits the theoretic analysis of complexity expected for the algorithm, which is dominated by the histogram building process. For scalability, we checked the running time as a function of the dataset size. In a logarithmic scale, we obtain approximate regression curves (average R2 = 0.9982) with slopes improving from 1.1 for a single processor from up to 0.8 for eight processors. Thus, our proposed algorithm is especially suited for cases where large data is available and processing can be shared between many processors.

4.

REFERENCES

[1] K. Alsabti, S. Ranka, and V. Singh. CLOUDS: Classification for large or out-of-core datasets. In Conference on Knowledge Discovery and Data Mining,

August 1998. [2] N. Amado, J. Gama, and F. Silva. Parallel implementation of decision tree learning algorithms. In The 10th Portuguese Conference on Artificial Intelligence on Progress in Artificial Intelligence, Knowledge Extraction, Multi-agent Systems, Logic Programming and Constraint Solving, pages 6–13, December 2001. [3] C. L. Blake, E. J. Keogh, and C. J. Merz. UCI repository of machine learning databases, 1998. [4] L. Bottou and O. Bousquet. The tradeoffs of large scale learning. In Advances in Neural Information Processing Systems, volume 20. MIT Press, Cambridge, MA, 2008. to appear. [5] L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees. Wadsworth, Monterrey, CA, 1984. [6] J. Dem˘sar. Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7:1–30, 2006. [7] J. Gehrke, V. Ganti, R. Ramakrishnan, and W. Y. Loh. BOAT — optimistic decision tree construction. In ACM SIGMOD International Conference on Management of Data, pages 169–180, June 1999. [8] S. Goil and A. Choudhary. Efficient parallel classification using dimensional aggregates. In Workshop on Large-Scale Parallel KDD Systems, SIGKDD, pages 197–210, August 1999. [9] I. D. Guedalia, M. London, , and M. Werman. An on-line agglomerative clustering method for nonstationary data. Neural Comp., 11(2):521–540, 1999. [10] R. Jin and G. Agrawal. Communication and memory efficient parallel decision tree construction. In The 3rd SIAM International Conference on Data Mining, May

6

5

6

100 examples 1000 examples 10000 examples 100000 examples 500000 examples

5

4

Speedup

Speedup

4

100 examples 1000 examples 10000 examples 100000 examples 500000 examples

3

2

2

1

0 1

3

1

2

3

4

5

6

7

8

0 1

2

3

4

5

6

7

8

Number of processors

Figure 1: Speedup of the SPDT algorithm for the alpha (left) and beta (right) datasets.

2003. [11] M. V. Joshi, G. Karypis, and V. Kumar. ScalParC: A new scalable and efficient parallel classification algorithm for mining large datasets. In The 12th International Parallel Processing Symposium, pages 573–579, March 1998. [12] M. Mehta, R. Agrawal, and J. Rissanen. SLIQ: A fast scalable classifier for data mining. In The 5th International Conference on Extending Database Technology, pages 18–32, 1996. [13] G. J. Narlikar. A parallel, multithreaded decision tree builder. In Technical Report CMU-CS-98-184, Carnegie Mellon University, 1998. [14] Pascal large scale learning challenge, 2008. [15] Ibm parallel machine learning toolbox. [16] J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA, 1993. [17] J. Shafer, R. Agrawal, and M. Mehta. SPRINT: A scalable parallel classifier for data mining. In The 22nd International Conference on Very Large Databases, pages 544–555, September 1996. [18] M. K. Sreevinas, K. Alsabti, and S. Ranka. Parallel out-of-core divide-and-conquer techniques with applications to classification trees. In The 13th International Symposium on Parallel Processing and the 10th Symposium on Parallel and Distributed Processing, pages 555–562, 1999. [19] A. Srivastava, E.-H. Han, V. Kumar, and V. Singh. Parallel formulations of decision-tree classification algorithms. Data Mining and Knowledge Discovery, 3(3):237–261, September 1999.