Penalty Functions for Genetic Programming Algorithms - CiteSeerX

Penalty Functions for Genetic Programming Algorithms José L. Monta˜ na1 , César L. Alonso2 , Cruz E. Borges1 , and Javier de la Dehesa1 1

Departamento de Matem´ aticas, Estad´ıstica y Computaci´ on, Universidad de Cantabria, 39005 Santander, Spain {montanjl, borgesce}@unican.es 2 Centro de Inteligencia Artificial, Universidad de Oviedo Campus de Viesques, 33271 Gij´ on, Spain [email protected]

Abstract. Very often symbolic regression, as addressed in Genetic Programming (GP), is equivalent to approximate interpolation. This means that, in general, GP algorithms try to fit the sample as better as possible but no notion of generalization error is considered. As a consequence, overfitting, code-bloat and noisy data are problems which are not satisfactorily solved under this approach. Motivated by this situation we review the problem of Symbolic Regression under the perspective of Machine Learning, a well founded mathematical toolbox for predictive learning. We perform empirical comparisons between classical statistical methods (AIC and BIC) and methods based on Vapnik-Chrevonenkis (VC) theory for regression problems under genetic training. Empirical comparisons of the different methods suggest practical advantages of VCbased model selection. We conclude that VC theory provides methodological framework for complexity control in Genetic Programming even when its technical results seems not be directly applicable. As main practical advantage, precise penalty functions founded on the notion of generalization error are proposed for evolving GP-trees. Key words: Genetic Programming, Symbolic Regression, Inductive Learning, Regression

keywordsModel selection, genetic programming, symbolic regression

1

Introduction

In the last years Genetic Programming (GP) has been applied to a range of complex learning problems, including that of symbolic regression in a variety of fields like quantum computing, electronic design, sorting, searching, game playing, etc. For dealing with these problems GP evolves a population composed by symbolic expressions built from a set of functionals F = {f1 , . . . , fk } and a set of terminals T = {x1 , . . . , c1 , . . . , ct } (including the variables and the constants). Once the functionals and the terminals have been selected, the regression task

2

José L. Monta˜ na, César L. Alonso, Cruz E. Borges, and Javier de la Dehesa

can be thought as a supervised learning problem where the hypothesis class H is the tree structured search space described from the set of leaves T and the set of nodes F . Analogously, the GP algorithm evolving symbolic expressions representing the concepts of class H can be regarded as a supervised learning algorithm that selects the best model inside the class H. Regarding this consideration of GP as a supervised learning task, we use tools from Statistical Learning Theory (SLT) ([10]) with the purpose of model selection in Genetic Programming. This point of view has been previously proposed in [9] (see also [2]). In that works the core of Vapnik-Chervonenkys theory is translated into the GP domain with the aim of addressing the code-bloat problem. A further development of this point of view, including some experimental discussion, can be found in [7]. In the present paper we focus our attention on problems presenting noisy data. We try to identify the shape of good penalty terms in order to minimize the error of generalization, that is, the error over unseen points. Usually, analytic model selection criteria like AIC (Akaike Information Criterium) and BIC (Bayesian Information Criterium) estimate the generalization error as a function of the empirical error with a penalization term related with some measure of model complexity. Then this function is minimized in the class of concepts H. Since most model selection criteria, in particular analytic model selection and Structural Risk Minimization based on VC theory, are based on certain assumptions, mainly linearity and exact computation of the classification capacity of the class of concepts H, it is important to perform empirical comparisons in order to understand their practical usefulness in settings when these assumptions may not hold, which is the case of Genetic Programming. The paper is organized as follows. Section 2 is devoted to present some useful tools from statistical learning theory. Section 3 describes classical model selection criteria (AIC and BIC) and the structural risk minimization approach with a new measure of model complexity founded in VC analysis of GP. Section 4 describes the experimental setting and results. Finally, Section 5 contains some conclusions.

2

Statistical Learning Theory

In the seventies the work by Vapnik and Chervonenkis ([12], [10], [11]) provided a remarkable family of bounds relating the performance of a learning machine. The Vapnik- Chervonenkis dimension (VC-dimension) is a measure of the capacity of a family of functions (or learning machines) f ∈ H as classifiers. In a binary classification problem, an instance x is classified by a label y ∈ {−1, 1}. Given a vector of n instances, (x1 , . . . , xn ), there are 2n possible classification tuples (y1 , . . . , yn ), with yi ∈ {−1, 1}. If for each classification tuple (y1 , . . . , yn ) there is a classifier f ∈ H with f (xi ) = yi , for 1 ≤ i ≤ n, we say that (x1 , . . . , xn ) is shattered by the class H. The VC dimension of a class H is defined as the maximum number of points that can be shattered by H. If VD dimension is h this means that there exists at least some set of h points which

Penalty Functions for Genetic Programming Algorithms

3

can be shattered. For instance, VC dimension of lines in the plane is 3, more generally, VC dimension of hyperplanes in Rn is n + 1. In general, the error, ε(f ), of a learning machine or classifier f is written as ∫ ε(f ) = Q(x, f ; y)dµ, (1) where Q measures some notion of loss between f (x) and y, and µ is the distribution from which examples (x, y) are drawn to the learner, usually x is called the instance and y the label. For example, for classification problems, the error of misclassification is given taking Q(x, f ; y) = |y − f (x)|. Similarly, for regression tasks one takes Q(x, f ; y) = (y − f (x))2 (mean square error). Many of the classic applications of learning machines can be explained inside this formalism. The starting point of statistical learning theory is that we might not know µ. At this point one replace theoretical error ε(f ) by empirical error that is estimated from a finite sample {xi , yi )}ni=1 as: 1∑ εn (f ) = Q(xi , f ; yi )). n i=1 n

(2)

Now, the results by Vapnik state that the error ε(f ) can be estimated independent of the distribution of µ(x, y) due to the following formula. √ ε(f ) ≤ εn (f ) +

h(log(2n/h) + 1) − log(η/4) , n

(3)

where η is the probability that bound is violated and h is the VC dimension of the family of classifiers H from which function f is selected. The second term of the right hand side is called the VC confidence (as example, for h = 200, n = 100000 and η = 0.95 the VC confidence is 0.12. While the existence of the bounds in Equation 3 is impressive, very often these bounds remain meaningless. For instance, V C dimension of the family of Support Vector Machines embedded in m dimensions with polynomial kernels is infinite, however if we also bound the degree of the polynomials, then VC dimension is finite and depends of dimension m and degree d, indeed, it is bounded by (4ed)m . Note that this quantity grows very quickly which again makes the bound in Equation 3 useless. 2.1

VC Dimension of GP

The VC dimension h depends on the class of classifiers, equivalently on a fully specified learning machine. Hence, it does not make sense to calculate VC dimension for GP in general, however it makes sense if we choose a particular class of computer programs as classifiers (i.e. a particular genotype). For the simplified genotype that only uses algebraic analytic operators of bounded degree (polynomials, square roots and, in general, power series with fractional exponents), some chosen computer program structure and a bound on the non-scalar height

4


of the program, VC dimension remains polynomial in the non-scalar height of the program and in the number of parameters of the learning machine. To make clear the notion of non-scalar height it is enough to define the notion of non-scalar (or non-linear) node. We say that a node of a GP -tree is non-scalar if it is not a linear combination of its sons. With this notion we can estate the following general result whose proof is a refinement of the techniques introduced in [6], but having into account that degree only increases at non-scalar nodes. Theorem 1. Let Ck,n be the concept class whose elements are GP-trees Tk,n having k + n terminals ( k constants and n real variables) and non-scalar height h = h(k, n) . Assume that the GP-tree Tk,n has at most q analytic algebraic nodes of degree bounded by D ≥ 2 and number of sons bounded by β. Then, the VC dimension of Ck,n is in the class O((log2 D + log2 max{β, 2}) k(n + k + βq)2 h2 ) . Hence, GP approach with analytic algebraic functionals, and ”short” programs (of height polynomial in the dimension of the space of events) has small VC dimension. For a class of models H with finite complexity (for instance –in the case of GP– trees with bounded size or height), the model can be chosen minimizing the empirical error: 1∑ Q(xi , f, yi ) n i=1 n

εn (f ) =

(4)

The problem of model selection –also called complexity control– arises when a class of models consists of models of varying complexity (for instance –in the case of Genetic Programming– trees with varying size or height). Then the problem of regression estimation requires optimal selection of model complexity (i.e., the size or the height) in addition to model estimation via minimization of empirical risk as defined in Equation 4.

3

Penalty Functions

In general, analytical estimates of error (Equation 1) as a function of empirical error (Equation 4) take one of the following forms: ε(f ) = εn (f ).pen(h, n)

(5)

ε(f ) = εn (f ) + pen(h/n, σ 2 ),

(6)

where f is the model, pen is called the penalization factor, h is the model complexity, n is the size of the training set and σ, when used, is the standard deviation of the additive noise (Equation 6). The first two analytical estimates of error that we shall use in this work are the following well known representative statistical methods:


5

– Akaike Information Criterium (AIC) which is as follows (see [1]): ε(f ) = εn (f ) +

2h 2 σ n

(7)

– Bayesian Information Criterium (BIC) (see [3]): h ε(f ) = εn (f ) + (ln n) σ 2 n

(8)

As it is described in [4], when using a linear estimator with parameters, one first estimates the noise variance from the training data (xi , yi ) as: σ2 =

n 1 ∑ (yi − yî )2 n−hn

(9)

1≤i≤n

yî is the estimation of value yi by model f , i.e. yî = f (xi ). Then one can use Equation 9 in conjunction with AIC or BIC for each (fixed) model complexity. The estimation of the model complexity h for both methods is the number of free parameters of the model f. The third model selection method used in this paper is based on the Structural Risk Minimization (SRM) (see [10]) ( ε(f ) = εn (f ). 1 −

√

ln n p − p ln p + 2n

)−1 ,

(10)

where p = nh , and h stands for the Vapnik-Chervonenkis (VC) dimension as a measure of model complexity. Note that under SRM approach, it is not necessary to estimate noise variance. However an estimation of the VC dimension is required. 3.1

The Genetic Programming Approach

The above model selection criteria are used in the framework of linear estimators and the model complexity h, in this case, is the number of free parameters of the model (for instance, in the familiar case where the models are polynomials, h is the degree of the polynomial). In our attempt to carry the above methods to GP, we will use GP-trees as the evolving structures that represent symbolic expressions or models. The internal nodes of every GP-tree are labeled by functionals from a set F = {f1 , . . . , fk } and the leaves of the GP-tree are labeled by terminals from T = {x1 , . . . , c1 , . . . , ct }. As a GP-tree represents some symbolic expression f, we will use the equations 7, 8 and 10 as the different fitness functions for f in our study. For our GP version of AIC and BIC we will maintain as model complexity the number of free parameters of the model f represented by the tree. On the other hand, in the case of SRM we will introduce a new estimator for the VC dimension of GP-trees. This estimator consists in the number of non-scalar nodes of the tree,

6


that is, nodes which are not labeled with {+, −} operators. This is a measure of the non-linearity of the considered model and can be seen as a generalization of the notion of degree to the case of GP-trees. This notion is related with the VC dimension of the set of models given by GP-trees using a bounded number of non-scalar operators. The exact relationship between non-scalar size of a GP-tree (more generally, a computer program) and its VC dimension is showed in Theorem 1 stated in Section 2. The recombination operators used are the well known standard crossover and mutation operators for trees in Genetic Programming (see [5]). Then an extensive experimentation has been done in order to compare the performance of these three model selection criteria in the Genetic Programming framework.

4 4.1

Experimentation Experimental Settings

We consider instances of symbolic regression problem for our experimentation. We have executed the algorithms over two groups of target functions. The first group includes the following three functions that also were used in [4] for experimentation: Discontinuous piecewise polynomial function: g1 (x) = 4(x2 (3 − 4x) x ∈ [0, 0.5] g1 (x) = (4/3)x(4x2 − 10x + 7) − 3/2 x ∈ (0.5, 0.75] g1 (x) = (16/3)x(x − 1)2 x ∈ (0.75, 1]]

(11)

Sine-square function: g2 (x) = sin2 (2πx), x ∈ [0, 1]

(12)

Two-dimensional sin function:

√ sin x21 + x22 , x1 , x2 ∈ [−5, 5] g3 (x) = x21 + x22

(13)

The second group of functions is constituted by five functions of several classes: trigonometric functions, polynomial functions and one exponential function. These functions are the following: f1 (x) = x4 + x3 + x2 + x f2 (x) = e−sin 3x+2x f3 (x) = e x2 + π x f4 (x) = cos(2x) f5 (x) = min{ x2 , sin(x) + 1}

x ∈ [−5, 5] x ∈ [− π2 , π2 ] x ∈ [−π, π] x ∈ [−π, π] x ∈ [0, 15]

(14)

For the first group of target functions we use the following set of functionals F = {+, −, ∗, //}, incremented with the sign operator for the target function g1 . In the above set F, ”//” indicates the protected division, i.e. x//y returns


7

x/y if y ̸= 0 and 1 otherwise. The terminal set T consists of the variables of the corresponding function and includes the set of constants {0, 1}. For the second group of functions, the basic set of operations F is incremented with other operators. This aspect for each function is showed in table 1.

Table 1: Function set for the second group of target functions Function Function set f1 F ∪ {sqrt} f2 F ∪ {sqrt, sin, cos, exp} f3 F ∪ {sin, cos} f4 F ∪ {sqrt, sin} f5 F ∪ {sin, cos}

We use GP-trees with height bounded by 8. As it was mentioned above, the model complexity h is measured by the number of non-scalar nodes of the tree for the SRM method (fitness equation 10) and by the number of free parameters of the model f represented by the tree, for AIC and BIC methods (fitness equations 7 and 8 respectively). The rest of the parameters for the genetic training process are the following: population size M = 100; maximum number of generations G = 1000; probability of crossover pc = 0, 9 and probability of mutation pm = 0, 1. Tournament selection and the standard operators of crossover and mutation for tree-like structures are used. For all the executions, the genetic training process finishes after 107 operations have been computed. Observe that the number of computed operations equals the internal nodes of the trees that are visited during the process. Training sets of n = 30 examples are generated where the x- values follow from uniform distribution in the input domain. For the computation of the y-values, the equation ?? is used in order to corrupt the values with noise. The noise variance σ was fixed to 0.2. The experimentation scheme is as follows: For each model selection criterium (AIC, BIC, SRM) and each target function, we use a simple competitive coevolution strategy where 10 populations of 100 individuals evolve independently, considering same training set. Then, we select the model proposed by the best of these 10 executions. The above process completes one experiment. We have performed 100 experiments and for each one a different random realization of 30 training samples was considered. Hence for each target function, we have executed each algorithm 1000 times and finally 100 models have been selected. These models are the best ones of each group of 10 populations related to same training set. 4.2

Experimental Results

When the competitive genetic training process finishes, the best individual is selected as the proposed model for the corresponding target function. In order to measure the quality of the selected model, it makes sense to consider a new

8


set of points generated without noise from the target function. This new set of examples is known as the test set or validation set. So, let (xi , yi )1≤i≤ntest a validation set for the target function g(x) (i.e. yi = g(xi )) and let f (x) be the model estimated from the training data. Then the prediction risk εntest is defined by the mean square error (MSE) between the values of f and the true values of the target function g over the validation set: εntest =

1

n∑ test

ntest

i=1

(f (xi ) − yi )2

(15)

Frequently, when different Genetic Programming strategies solving symbolic regression instances are compared, the quality of the selected model is evaluated by means of its corresponding fitness value over the training set. But with this quality measure, a fitness value close to zero does not necessary imply a good model for the target function. It is not possible to distinguish between good executions and overfiting executions. In fact when the training set is corrupted with noise (which is our case), final selected models with very low fitness values are probably not so good models as they seem. Figures 1 and 2 show comparison results for AIC, BIC and SRM, after that the 100 experiments were completed. The empirical distribution of the prediction risk for each model selection is displayed using standard box plot notation with marks at 25%, 50% and 75% of that empirical distribution. The first and last mark in each case, stands for the prediction risk of the best and the worst selected model respectively. In all cases the size ntest of the validation set is 200. Note that the scaling is different for each function. A first analysis of the figures concludes that SRM model selection performs clearly better than AIC and BIC. Note that for all of the studied target functions, the SRM strategy obtains prediction risk values for the best execution that are lower than those obtained by AIC and BIC strategies. This situation also happens for the most part of the target functions, if we consider the 25%, 50% or 75% of the selected models from the empirical distribution. On the other hand, AIC and BIC perform quite similar. This could be because both fitness functions (equations 7 and 8) take the same structure (equation 6) with very similar additive penalization terms. In table 2 the best prediction risk value for each target function and model selection criterium is displayed and tables 3 and 4 show, respectively, the mean of the prediction risk values considering the 5 and the 25 best experiments. Results showed in table 2 confirm that SRM produces the best obtained model for all target functions. This is more clear for the functions f2 , f3 and f5 . From the results displayed in table 3 and table 4 it can be deduced that considering a reasonable set of executions of the studied model selection strategies for each target function, the mean quality of the strategy SRM is the best one. Finally, in table 5 we present the mean prediction risk, considering all the performed executions. Cause we have 10 executions for each experiment and 100 experiments were completed, each result presented in this table corresponds to a mean value over 1000 runs.


f1

9

0

0

2000

6000

10 20 30 40 50 60

f2

AIC

BIC

VC

AIC

BIC

VC

0.8

1.2

1.6

2.0

f3

AIC

BIC

VC

f4

0.40

0.1

0.3

0.50

0.5

0.60

f5

AIC

BIC

VC

AIC

BIC

VC

Fig. 1: Empirical distribution of the prediction risk εntest for target functions f1 to f5 , over 100 experiments.

10


g1

0.10

0.05

0.15

0.15

0.20

0.25

0.25

g2

AIC

BIC

AIC

VC

BIC

VC

0.05

0.15

0.25

0.35

g3

AIC

BIC

VC

Fig. 2: Empirical distribution of the prediction risk for target functions g1 to g3 , over 100 experiments.


11

Table 2: Prediction risk values of the best obtained model for each target function Function AIC BIC SRM f1 2.44E-07 2.40E-07 2.13E-07 f2 3.47E+00 1.33E+00 8.29E-02 f3 1.26E+00 1.26E+00 6.71E-01 f4 4.44E-01 4.44E-01 3.87E-01 f5 3.84E-01 2.15E-01 2.64E-02 g1 2.94E-02 3.07E-02 1.33E-02 g2 1.26E-01 1.15E-01 1.04E-01 g3 3.21E-02 3.21E-02 1.85E-02

Table 3: Mean prediction risk of the best 5% Function AIC BIC SRM f1 3.01E-07 2.74E-07 2.42E-07 f2 4.57E+00 3.48E+00 5.06E-01 f3 1.31E+00 1.31E+00 6.91E-01 f4 4.50E-01 4.50E-01 4.36E-01 f5 4.05E-01 3.34E-01 3.84E-02 g1 3.13E-02 3.33E-02 2.30E-02 g2 1.33E-01 1.55E-01 1.12E-01 g3 4.24E-02 4.24E-02 3.67E-02

Table 4: Mean prediction risk of the best Function AIC f1 1.74E+01 f2 6.37E+00 f3 1.42E+00 f4 4.66E-01 f5 4.24E-01 g1 3.67E-02 g2 1.71E-01 g3 5.64E-02

Table 5: Mean prediction risk tion Function f1 f2 f3 f4 f5 g1 g2 g3

25% of the models, for each target function BIC SRM 3.24E+01 2.89E-07 9.76E+00 8.27E-01 1.43E+00 1.04E+00 4.66E-01 4.64E-01 4.10E-01 7.47E-02 4.00E-02 3.35E-02 1.81E-01 1.18E-01 5.64E-02 5.82E-02

of the 100 executions for each pair strategy-target funcAIC 1.62E+03 1.93E+01 1.62E+00 5.00E-01 4.60E-01 1.23E-01 2.03E-01 1.25E-01

BIC 1.57E+03 2.11E+01 1.62E+00 5.00E-01 4.55E-01 1.79E-01 2.05E-01 1.25E-01

SRM 3.58E-07 2.63E+00 1.50E+00 5.03E-01 1.66E-01 4.61E-02 1.36E-01 1.33E-01

12


The results in table 5 confirm again that SRM presents the best results for the most part of the target functions. There is a remarkable difference in terms of performance for the polynomial function f1 , that can be also seen in table 4. Taking into account the above experimentation we can conclude that Structural Risk Minimization method (based on VC-theory) as a model selection criterium when using genetic training from noisy data, combined with our model complexity measure that counts the number of non-scalar nodes of the tree, clearly outperforms classical statistical methods as AIC or BIC. In general, SRM obtains better solutions than AIC or BIC in almost all studied cases.

5

Conclusions

In this paper we have presented an empiral comparative study of three model selection criteria for learning problems with GP-trees. The first two methods (AIC and BIC) are classical statistical methods and the third one (SRM) is a selection criterium based on VC-theory with a new model complexity measure. The strategy used for the selection of the model was a genetic training method over a finite set of noisy examples. For measuring the quality of the selected model after the training process, a validation set of noise free examples was generated and a mean square error fitness of the model over the validation set was computed. An extensive experimentation over several symbolic regression problem instances, suggests that SRM selection criterium performs better than the other two considered methods. However AIC and BIC methods, that usually are employed in combination with least-squares fitting techniques, also perform quite well when using genetic training. As final remark we note that it is impossible to draw any conclusions based on empirical comparisons unless one is sure that model selection criteria use accurate estimates of model complexity. There exist experimental methods for measuring the VC-dimension of an estimator ([13], [8]); however they are difficult to apply for general practitioners. In this paper we have used a new complexity measure for GP-trees. Essentially, under this approach we combine the known analytical form of a model selection criterium, with appropriately tuned measure of model complexity taken as a function of (some) complexity parameter (i.e. the value h that measures the non-linearity of the considered tree). This alternative practical approach is essentially to come up with empirical ’common-sense’ estimates of model complexity to be used in model selection criteria.

References 1. H. Akaike. Statistical prediction information. Ann. Inst. Statistic. Math, 22:203– 217, 1970. 2. N. M. Amil, N. Bredeche, C. Gagné, S. Gelly, M. Schoenauer, and O. Teytaud. A statistical learning perspective of genetic programming. In EuroGP, pages 327–338, 2009. 3. J. Bernardo and A. Smith. Bayesian theory. John Willey & Sons, 1994.


13

4. V. Cherkassky and M. Yunkian. Comparison of Model Selection for Regression. Neural Computation, 15:1691–1714, 2003. 5. J. Koza. Genetic Programming: On the Programming of Computers by Means of Natural Selection. The MIT Press, 1992. 6. J. L. Monta˜ na. Vcd bounds for some gp genotypes. In ECAI, pages 167–171, 2008. 7. J. L. Monta˜ na, C. L. Alonso, C. E. Borges, and J. L. Crespo. Adaptation, performance and vapnik-chervonenkis dimension of straight line programs. In EuroGP, pages 315–326, 2009. 8. X. Shao, V. Cherkassky, and W. Li. Measuring the VC-dimension using optimized experimental design. Neural Computation, 12:1969–1986, 2000. 9. O. Teytaud, S. Gelly, N. Bredeche, and M. Schoenauer. Statistical Learning Theory Approach of Bloat. In Proceedings of the 2005 conference on Genetic and Evolutionary Computation, pages 1784–1785, 2005. 10. V. Vapnik. Statistical learning theory. John Willey & Sons, 1998. 11. V. Vapnik and A. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16:264– 280, 1971. 12. V. Vapnik and A. Chervonenkis. Ordered risk minimization. Automation and Remote Control, 34:1226–1235, 1974. 13. V. Vapnik, E. Levin, and Y. Cun. Measuring the VC-dimension of a learning machine. Neural Computation, 6:851–876, 1994.