Machine Learning Tutorial Machine Learning basic ... - TU Darmstadt

Machine Learning Tutorial CB,, GS,, REC

Section 1

Machine Learning basic concepts

Machine Learning Tutorial for the UKP lab June 10, 10 2011

This ppt includes some slides/slide-parts/text taken from online materials created by the following people: - Greg Grudic - Alexander Vezhnevets - Hal III Daume

What is Machine Learning? “The goal of machine learning is to build computer systems that can adapt and learn from their experience.” – Tom Dietterich

SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

3

A Generic System

x1 x2

System

…

…

xN

y1 y2

h1 , h2 , ..., h K

yM

x = ( x1 , x2 ,..., xN ) I Input t Variables: V i bl

Hidden Variables: h = ( h1 , h2 ,..., hK )

Output Variables: y = ( y1 , y2 ,..., yK ) SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

4

When are ML algorithms NOT needed? When the relationships between all system variables (input, output, and hidden) is completely understood! This is NOT the case for almost any real system!


5

The Sub Sub-Fields Fields of ML

Supervised Learning Reinforcement Learning Unsupervised Learning


6

Supervised Learning Given: Training examples

{( x , f ( x ) ) , ( x , f ( x ) ) ,..., ( x 1

1

2

2

P

}

, f ( xP ))

for some unknown function (system) y = f ( x ) Find f ( x ) Predict y ′ = f ( x′ ) , where x′ is not in the training set


7

Model model quality Model, Definition: A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E. Learned hypothesis: model of problem/task T Model quality: accuracy/performance measured by P


8

Data / Examples / Sample / Instances Data: experience E in the form f off examples / instances characteristic of the whole input space representative sample independent and identically distributed (no bias in selection / observations)

Good G d example l 1000 abstracts chosen randomly out of 20M PubMed entries (abstracts) probably i.i.d. representative? if annotation is involved it is always a question of compromises

Definitely bad example all abstracts that have John Smith as an author

Instances have to be comparable to each other SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

9

Data / Examples / Sample / Instances Example: set off queries and a set off top retrieved documents (characterized via tf, idf, tf*idf, PRank, BM25 scores) for each Æ try predicting relevance for reranking! top retrieved set is dependent on underlying IR system! issues with representativeness, but for reranking this is fine characterization is dependent on query (exc. PRank), i.e. only certain pairs (for the same Q) are meaningfully comparable (c.f. independent examples for the same Q) we have to normalize the features per query to have same mean/variance we have to form pairs and compare e.g. the diff of feature values

Toy example: Q = „learning“, rank 1: tf = 15, rank 100: tf = 2 Q = „overfitting“, rank 1: tf = 2, rank 10: tf = 0 SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

10

Features The available examples (experience) has to be described to the algorithm in a consumable format Here: examples are represented as vectors of pre-defined features E.g. for credit risk assesment, typical features can be: income range, debt load, load employment history, history real estate properties, properties criminal record, record city of residence, etc.

Common feature types yp binary nominal nominal ordinal numeric numeric

(criminal record, Y/N) (city of residence, X) (income range, 0-10K, 10-20K, …) (debt load, $)


11

Machine Learning Tutorial CB,, GS,, REC

Section 2

Experimental practice … by now you’ve learned what machine learning is; in the supervised approach you need (carefully selected / prepared) examples that you describe through features; the algorithm then learns a model of the problem based on the examples (usually some kind of optimization is performed in the background); and as a result result, improvement is observed in terms of some performance measure …

Machine Learning Tutorial for the UKP lab June 10, 2011

Model parameters 2 kinds of parameters one the user sets for the training procedure in advance – hyperparameter the degree of polynom to match in regression y in Neural Network number/size of hidden layer number of instances per leaf in decision tree

one that actually gets optimized through the training – parameter regression coefficients network weights size/depth of decision tree (in Weka, other implementations might allow to control that)

we usually do not talk about the latter, but refer to hyperparameters as parameters

Hyperparameters the less the algorithm has, the better Naive Bayes the best? No parameters! usually algs with better discriminative power are not parameter-free

typically are set to optimize performance (on validation set, or through cross-validation) manual, grid search, simulated annealing, gradient descent, etc.

common pitfall: select the hyperparameters via CV, e.g. 10-fold + report cross-validation results SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

13

Cross validation Illustration Cross-validation, X k = {x1 ,..., xk }

X1

X2

X3

Test

X4

The result is an average over all iterations

Train


X5

14

Cross Validation Cross-Validation n-fold f ld CV: CV common practice ti for f making ki (hyper)parameter (h ) t estimation ti ti more robust round robin training/testing n times, with (n-1)/n data to train and 1/n data to evaluate the model typical: random splits, without replacement (each instance tests exactly once) the other way: random subsampling cross-validation

n-fold CV: common practice to report average performance, performance deviation, deviation etc. etc No Unbiased Estimator of the Variance of K-Fold Cross-Validation (Bengio and Grandvalet 2004) bad practice? problem: training sets largely overlap, test errors are also dependent tends to underestimate real variance of CV (thus e.g. e g confidence intervals are to be treated with extreme caution) 5-2 CV is a better option: do 2-fold CV and repeat 5 times, calculate average: less overlap in training sets

Folding via ia natural nat ral units nits of processing for the given gi en task typically, document boundaries – best practice is doing it yourself! ML package / CSV representation is not aware of e.g. document boundaries! The PPI case


15

Cross Validation Cross-Validation Ideally the valid settings are: take off-the-shelf algorithms, avoid parameter tuning and compare results e.g. results, e g via cross-validation cross validation n.b. you probably do the folding yourself, trying to minimize biases!

do parameter tuning (n.b. selecting/tuning your features is also tuning!) but then normally you have to have a blind set (from the beginning) e.g. have a look at shared tasks, e.g. CoNLL – practical way to learn experimental best practice to align the predefined standards (you might even benefit from comparative results, etc.)

You might want to do something different – be be aware of these & the consequences SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

16

The ML workflow Common ML experimenting pipeline 1. define the task instance, target variable/labels, collect and label/annotate data credit di risk i k assessment: 1 credit di request, good/bad d/b d credit, di ~s ran out in i the h previous year

2. define and collect/calculate features, define train / validation (development) ((test!)) / test (evaluation) data 3. pick a learning algorithm (e.g. decision tree), train model train on training set optimize/set model hyperparameters (e.g. number of instances / leaf, use pruning, …) according to performance on validation data cross validation: use all training data as validation data

test model accuracy on (blind) test set

4. ready y to use model to p predict unseen instances with an expected p accuracy similar to that seen on test SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

17

Try this in Weka === Run information === Relation: segment Instances: 1500 Attributes: 20 Test mode:

split 80.0% 80 0% train, train remainder test

Scheme: weka.classifiers.trees.J48 -C 0.25 -M 2 Correctly Classified Instances 290 96.6667 % Incorrectly Classified Instances 10 3.3333 % Scheme: weka.classifiers.trees.J48 -C 0.25 -M 12 Correctly Classified Instances 281 93.6667 % Incorrectly Classified C f Instances 19 6.3333 % SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

18

Model complexity Fitting Fitti a polynomial l i l regression: i t

a ( x) = ∑ α nn x

M=0

1.0

0.0

t

M

1.0

M=1

0.0

n =0

By, for instance, least squares:

−1.0 0.0

−1.0 0.5 x

1.0

α = arg min ∑ y j − ∑ α nn x j =1

0.0

0.5 x

1.0

M=3

1.0

M=9

2 0.0

t

M

t

l

1.0

0.0

n =0

−1.0

−1.0 0.0


0.5 x

19

1.0

0.0

0.5 x

1.0

Data size and model complexity Important concept: discriminative power of the algorithm linear vs nonlinear model some theoretical aspects: 1-hidden-layer NN with unlimited hidden nodes can perfectly model any smooth function/surface


20

Data size and model complexity Overfitting: the model perfectly learns to classify training data, data but has no (bad) generalization ability results in high test error (useless model) typical for small sample sizes and powerful models

Underfitting: the model is not capable of learning the (complex) patterns in the training set Reasons of Underfitting and Overfitting: lack of discriminative power smallll sample l size i noise in the data /labels or features/ generalization ability of algorithm has to be chosen wrt. sample size

Size („complexity“) of learnt model grows with data size if the data is consistent, consistent this is OK SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

21

Predictions – Confusion matrix TP: p‘ classified as p FP: n‘ classified as p TN: n‘ classified as n FN: p p‘ classified as n Good G prediction: TP+TN Error: FP (false alarm) + FN (miss) SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

22

Evaluation measures Accuracy The rate of correct (incorrect) predictions made by the model over a data set (cf. coverage). (TP+TN) / (TP+FN+FP+TN)

Error rate The rate of correct (incorrect) predictions made by the model over a data set (cf. coverage). (FP+FN) / (TP+FN+FP+TN)

[Root]?[Mean|Absolute][Squared]?Error The difference between the predicted and actual values e.g.

RMSE=

1 ( f ( x) − y ) 2 ∑ n

Algorithms (e.g. those in Weka) typically optimize these might be a mismatch between optimization objective and actual evaluation measure optimize different measures – research on its own (e.g. in ML for IR, a.k.a. learning to rank)


23

Evaluation measures TP: p p‘ classified as p FP: n‘ classified as p TN: n‘ classified as n FN: p p‘ classified as n

Precision Fraction of correctly predicted positives and all predicted positives TP/(TP+FP) ( )

Recall Fraction ac o o of co correctly ec y p predicted ed c ed pos positives es a and da all ac actual ua pos positives es TP/(TP+FN)

F measure weighted harmonic mean of Precision and Recall (usually equal weighted, β=1)

precision × recall Fβ = (1 + β ) 2 β × precision + recall 2

Only makes sense for a subset of classes (usually measured for a single class) For all classes, it equals the accuracy SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

24

Evaluation measures Sequence P/R/F, P/R/F e.g. e g in Named Entity Recognition, Recognition Chunking, Chunking etc. etc A sequence of tokens with the same label is treated as a single instance John_PER studied_O at_O the_O Johns_ORG Hopkins_ORG University_ORG before_O joining_O IBM_ORG. Why? We need complete phrases to be identified correctly How? With external evaluation script, e.g. conlleval for NER

Example tagging: John_PER studied_O at_O the_O Johns_PER Hopkins_PER University_ORG before_O joining_O IBM_ORG.

Multiple penalty: 3 Positives: John (PER), (PER) Johns Hopkins University (ORG), (ORG) IBM (ORG) 2 FPs: Johns Hopkins (PER) and University (ORG) 1 FN: Johns Hopkins University (ORG) F(PER) = 0.67, 0 67 F(ORG) = 0 0.5 5 SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

25

Loss types 1 1.

The real loss function given to us by the world world. Typically involves notions of money saved saved, time saved, lives saved, hopes of tenure saved, etc. We rarely have any access to this function. The human-evaluation function. Typical examples are fluency/adequecy judgments, relevance assessments etc. assessments, etc We can perform these evaluations, evaluations but they are slow and costly costly. They require humans in the loop. Automatic correlation-driving functions. Typical examples are Bleu, Rouge, word error rate, mean-average-precision. These require humans at the front of the loop, but after that are cheap h and d quick. i k T Typically i ll some effort ff t h has b been putt iinto t showing h i correlation l ti b between t th these and something higher up. Automatic intuition-driven functions. Typical examples are accuracy (for anything), f-score (for parsing, chunking and named-entity recognition), alignment error rate (for word alignment) and d perplexity l it (f (for llanguage modeling). d li ) Th These also l require i h humans att th the ffrontt off th the lloop, but differ from (3) in that they are not actually compared with higher-up tasks.

2. 3.

4.

Be careful what you are optimizing! Some measures (trypically of Type 4) become disfunctional when you are optimizing them!

phrase P/R/F e.g. in NER Readability measures


26

Evaluation measures Sequence P/R/F, P/R/F e.g. e g in Named Entity Recognition, Recognition Chunking, Chunking etc. etc John_PER studied_O at_O the_O Johns_ORG Hopkins_ORG University_ORG before_O joining_O IBM_ORG.

Example p tagging gg g 1:

John_PER studied_O at_O the_O Johns_PER Hopkins_PER University_ORG before_O joining_O IBM_ORG. 3 Positives: John (PER), Johns Hopkins University (ORG), IBM (ORG) 2 FPs: Johns Hopkins (PER) and University (ORG) 1 FN: Johns Hopkins University (ORG) F(PER) = 0.67, F(ORG) = 0.5

Example tagging 2:

JJohn h _PER studied di d_O at_O the h _O Johns J h _O Hopkins H ki _O University U i i _O before b f _O joining j i i _O IBM_ORG. 3 Positives: John (PER), Johns Hopkins University (ORG), IBM (ORG) 0 FP 1 FN: Johns Hopkins p University y ((ORG)) F(PER) = 1.0, F(ORG) = 0.67

Optimizing phrase-F can encourage / prefer systems that do not mark entities! mostt likely, lik l this thi is i bad!! b d!! SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

27

ROC curve ROC OC – Receiver Operating O Characteristic C curve Curve that depicts the relation between recall (sensitivity) and false positives (1-specificity) (1 specificity)

Sensitivvity (Reca all)

Best case

Worst case

False Positives SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

FP / (FP+TN) 28

Evaluation measures Area A under d ROC curve, AUC As you vary the decision threshold, you can plot the recall vs. false positive rate p The area under the curve measures how accurately your model separates t positive iti from f negatives ti perfect ranking: AUC = 1.0 random decision: AUC = 0.5

Similarly (e.g. in IR): area under P/R curve when h there th are too t many (true) (t ) negatives ti correctly identifying negatives is not interesting anyway


29

Evaluation measures (Ranking) Precision P i i @K number of true positives in top K predictions / ranks

MAP The average of precisions computed at the point of each of the positives in the ranked list ((P=0 for positives not ranked at all))

NDCG For graded relevance / ranking Highly relevant documents appearing lower in a search result list should be penalized as the graded relevance value is reduced logarithmically proportional to the position of the result.


30

Learning curve Measures M h how th the – accuracy – error

off the th model d l changes h with ith – sample size – iteration number

Smaller sample worse accuracy y bias in the estimate more likely (representative sample) variance in the estimate

Typical learning curve If it looks differently: you are plotting error vs. size/iteration you are doing something wrong! overfitting (iteration, not sample size)! SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

?? 31

Data or Algorithm? Compare the accuracy of various machine learning algorithms with a varying amount of training data (Banko & Brill, 2001):

Winnow perceptron naïve Bayes memory-based learner

Features: bag of words: words within a window of the target word collocations containing specific words and/or part of speech

Training corpus: 1-billion words from a variety of English texts ((news articles, literature, scientific abstracts, etc.)) SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

32

Take home messages (up until now) Supervised p learning: g based on a set of labeled examples p ((x,, f(x)) ( )) learn the input-output mapping, i.e. f(x) 3 factors of successful machine learning models much data good features well-suited learning algorithm

ML workflow 1. 2 2. 3. 4.

problem definition feature engineering; experimental setup /train, /train validation, validation test …// selection of learning algorithm, (hyper)parameter tuning, training a final model predict unseen examples & fill tables / draw figures for the paper - test

Careful C f l with ith

data representation (i.i.d, comparability, …) experimental setup (cross-validation, blind testing, …) data size and algorithm selection (+ overfitting, overfitting underfitting, underfitting …)) evaluation measures


33