Machine Learning Tutorial Machine Learning basic ... - TU Darmstadt

5 downloads 349 Views 481KB Size Report
Jun 10, 2011 - f f. ▫Example: set of queries and a set of top retrieved documents. (characterized via tf, idf, tf*idf,
Machine Learning Tutorial CB,, GS,, REC

Section 1

Machine Learning basic concepts

Machine Learning Tutorial for the UKP lab June 10, 10 2011

This ppt includes some slides/slide-parts/text taken from online materials created by the following people: - Greg Grudic - Alexander Vezhnevets - Hal III Daume

What is Machine Learning? ƒ“The goal of machine learning is to build computer systems that can adapt and learn from their experience.” – Tom Dietterich

SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

3

A Generic System

x1 x2

System





xN

y1 y2

h1 , h2 , ..., h K

yM

x = ( x1 , x2 ,..., xN ) I Input t Variables: V i bl

Hidden Variables: h = ( h1 , h2 ,..., hK )

Output Variables: y = ( y1 , y2 ,..., yK ) SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

4

When are ML algorithms NOT needed? ƒWhen the relationships between all system variables (input, output, and hidden) is completely understood! ƒThis is NOT the case for almost any real system!

SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

5

The Sub Sub-Fields Fields of ML

ƒSupervised Learning ƒReinforcement Learning ƒUnsupervised Learning

SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

6

Supervised Learning ƒGiven: Training examples

{( x , f ( x ) ) , ( x , f ( x ) ) ,..., ( x 1

1

2

2

P

}

, f ( xP ))

for some unknown function (system) y = f ( x ) ƒFind f ( x ) ƒPredict y ′ = f ( x′ ) , where x′ is not in the training set

SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

7

Model model quality Model, ƒDefinition: A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E. ƒ Learned hypothesis: model of problem/task T ƒ Model quality: accuracy/performance measured by P

SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

8

Data / Examples / Sample / Instances ƒ Data: experience E in the form f off examples / instances characteristic of the whole input space ƒ representative sample ƒ independent and identically distributed (no bias in selection / observations)

ƒ Good G d example l ƒ 1000 abstracts chosen randomly out of 20M PubMed entries (abstracts) ƒ probably i.i.d. ƒ representative? ƒ if annotation is involved it is always a question of compromises

ƒ Definitely bad example ƒ all abstracts that have John Smith as an author

ƒ Instances have to be comparable to each other SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

9

Data / Examples / Sample / Instances ƒ Example: set off queries and a set off top retrieved documents (characterized via tf, idf, tf*idf, PRank, BM25 scores) for each Æ try predicting relevance for reranking! ƒ top retrieved set is dependent on underlying IR system! ƒ issues with representativeness, but for reranking this is fine ƒ characterization is dependent on query (exc. PRank), i.e. only certain pairs (for the same Q) are meaningfully comparable (c.f. independent examples for the same Q) ƒ we have to normalize the features per query to have same mean/variance ƒ we have to form pairs and compare e.g. the diff of feature values

ƒ Toy example: ƒ Q = „learning“, rank 1: tf = 15, rank 100: tf = 2 ƒ Q = „overfitting“, rank 1: tf = 2, rank 10: tf = 0 SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

10

Features ƒThe available examples (experience) has to be described to the algorithm in a consumable format ƒ Here: examples are represented as vectors of pre-defined features ƒ E.g. for credit risk assesment, typical features can be: income range, debt load, load employment history, history real estate properties, properties criminal record, record city of residence, etc.

ƒCommon feature types yp ƒbinary ƒnominal nominal ƒordinal numeric ƒnumeric

(criminal record, Y/N) (city of residence, X) (income range, 0-10K, 10-20K, …) (debt load, $)

SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

11

Machine Learning Tutorial CB,, GS,, REC

Section 2

Experimental practice … by now you’ve learned what machine learning is; in the supervised approach you need (carefully selected / prepared) examples that you describe through features; the algorithm then learns a model of the problem based on the examples (usually some kind of optimization is performed in the background); and as a result result, improvement is observed in terms of some performance measure …

Machine Learning Tutorial for the UKP lab June 10, 2011

Model parameters ƒ 2 kinds of parameters ƒ one the user sets for the training procedure in advance – hyperparameter ƒ the degree of polynom to match in regression y in Neural Network ƒ number/size of hidden layer ƒ number of instances per leaf in decision tree

ƒ one that actually gets optimized through the training – parameter ƒ regression coefficients ƒ network weights ƒ size/depth of decision tree (in Weka, other implementations might allow to control that)

ƒ we usually do not talk about the latter, but refer to hyperparameters as parameters

ƒ Hyperparameters ƒ the less the algorithm has, the better ƒ Naive Bayes the best? No parameters! ƒ usually algs with better discriminative power are not parameter-free

ƒ typically are set to optimize performance (on validation set, or through cross-validation) ƒ manual, grid search, simulated annealing, gradient descent, etc.

ƒ common pitfall: ƒ select the hyperparameters via CV, e.g. 10-fold + report cross-validation results SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

13

Cross validation Illustration Cross-validation, X k = {x1 ,..., xk }

X1

X2

X3

Test

X4

The result is an average over all iterations

Train

SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

X5

14

Cross Validation Cross-Validation ƒ n-fold f ld CV: CV common practice ti for f making ki (hyper)parameter (h ) t estimation ti ti more robust ƒ round robin training/testing n times, with (n-1)/n data to train and 1/n data to evaluate the model ƒ typical: random splits, without replacement (each instance tests exactly once) ƒ the other way: random subsampling cross-validation

ƒ n-fold CV: common practice to report average performance, performance deviation, deviation etc. etc ƒ No Unbiased Estimator of the Variance of K-Fold Cross-Validation (Bengio and Grandvalet 2004) ƒ bad practice? problem: training sets largely overlap, test errors are also dependent ƒ tends to underestimate real variance of CV (thus e.g. e g confidence intervals are to be treated with extreme caution) ƒ 5-2 CV is a better option: do 2-fold CV and repeat 5 times, calculate average: less overlap in training sets

ƒ Folding via ia natural nat ral units nits of processing for the given gi en task ƒ typically, document boundaries – best practice is doing it yourself! ƒ ML package / CSV representation is not aware of e.g. document boundaries! ƒ The PPI case

SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

15

Cross Validation Cross-Validation ƒIdeally the valid settings are: ƒ take off-the-shelf algorithms, avoid parameter tuning and compare results e.g. results, e g via cross-validation cross validation ƒ n.b. you probably do the folding yourself, trying to minimize biases!

ƒ do parameter tuning (n.b. selecting/tuning your features is also tuning!) but then normally you have to have a blind set (from the beginning) ƒ e.g. have a look at shared tasks, e.g. CoNLL – practical way to learn experimental best practice to align the predefined standards (you might even benefit from comparative results, etc.)

ƒYou might want to do something different – ƒbe be aware of these & the consequences SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

16

The ML workflow ƒ Common ML experimenting pipeline 1. define the task ƒ instance, target variable/labels, collect and label/annotate data ƒ credit di risk i k assessment: 1 credit di request, good/bad d/b d credit, di ~s ran out in i the h previous year

2. define and collect/calculate features, define train / validation (development) ((test!)) / test (evaluation) data 3. pick a learning algorithm (e.g. decision tree), train model ƒ train on training set ƒ optimize/set model hyperparameters (e.g. number of instances / leaf, use pruning, …) according to performance on validation data ƒ cross validation: use all training data as validation data

ƒ test model accuracy on (blind) test set

4. ready y to use model to p predict unseen instances with an expected p accuracy similar to that seen on test SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

17

Try this in Weka === Run information === Relation: segment Instances: 1500 Attributes: 20 Test mode:

split 80.0% 80 0% train, train remainder test

Scheme: weka.classifiers.trees.J48 -C 0.25 -M 2 Correctly Classified Instances 290 96.6667 % Incorrectly Classified Instances 10 3.3333 % Scheme: weka.classifiers.trees.J48 -C 0.25 -M 12 Correctly Classified Instances 281 93.6667 % Incorrectly Classified C f Instances 19 6.3333 % SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

18

Model complexity ƒ Fitting Fitti a polynomial l i l regression: i t

a ( x) = ∑ α nn x

M=0

1.0

0.0

t

M

1.0

M=1

0.0

n =0

ƒ By, for instance, least squares:

−1.0 0.0

−1.0 0.5 x

1.0

α = arg min ∑ y j − ∑ α nn x j =1

0.0

0.5 x

1.0

M=3

1.0

M=9

2 0.0

t

M

t

l

1.0

0.0

n =0

−1.0

−1.0 0.0

SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

0.5 x

19

1.0

0.0

0.5 x

1.0

Data size and model complexity ƒImportant concept: discriminative power of the algorithm ƒlinear vs nonlinear model ƒsome theoretical aspects: 1-hidden-layer NN with unlimited hidden nodes can perfectly model any smooth function/surface

SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

20

Data size and model complexity ƒ Overfitting: the model perfectly learns to classify training data, data but has no (bad) generalization ability ƒ results in high test error (useless model) ƒ typical for small sample sizes and powerful models

ƒ Underfitting: the model is not capable of learning the (complex) patterns in the training set ƒ Reasons of Underfitting and Overfitting: ƒ lack of discriminative power smallll sample l size i noise in the data /labels or features/ ƒ generalization ability of algorithm has to be chosen wrt. sample size

ƒ Size („complexity“) of learnt model grows with data size ƒ if the data is consistent, consistent this is OK SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

21

Predictions – Confusion matrix ƒTP: p‘ classified as p ƒFP: n‘ classified as p ƒTN: n‘ classified as n ƒFN: p p‘ classified as n ƒGood G prediction: TP+TN ƒError: FP (false alarm) + FN (miss) SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

22

Evaluation measures ƒ Accuracy ƒ The rate of correct (incorrect) predictions made by the model over a data set (cf. coverage). ƒ (TP+TN) / (TP+FN+FP+TN)

ƒ Error rate ƒ The rate of correct (incorrect) predictions made by the model over a data set (cf. coverage). ƒ (FP+FN) / (TP+FN+FP+TN)

ƒ [Root]?[Mean|Absolute][Squared]?Error ƒ The difference between the predicted and actual values ƒ e.g.

RMSE=

1 ( f ( x) − y ) 2 ∑ n

ƒ Algorithms (e.g. those in Weka) typically optimize these ƒ might be a mismatch between optimization objective and actual evaluation measure ƒ optimize different measures – research on its own (e.g. in ML for IR, a.k.a. learning to rank)

SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

23

Evaluation measures TP: p p‘ classified as p FP: n‘ classified as p TN: n‘ classified as n FN: p p‘ classified as n

ƒ Precision ƒ Fraction of correctly predicted positives and all predicted positives ƒ TP/(TP+FP) ( )

ƒ Recall ƒ Fraction ac o o of co correctly ec y p predicted ed c ed pos positives es a and da all ac actual ua pos positives es ƒ TP/(TP+FN)

ƒ F measure ƒ weighted harmonic mean of Precision and Recall (usually equal weighted, β=1) ƒ

precision × recall Fβ = (1 + β ) 2 β × precision + recall 2

ƒ Only makes sense for a subset of classes (usually measured for a single class) ƒ For all classes, it equals the accuracy SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

24

Evaluation measures ƒ Sequence P/R/F, P/R/F e.g. e g in Named Entity Recognition, Recognition Chunking, Chunking etc. etc ƒ A sequence of tokens with the same label is treated as a single instance ƒ John_PER studied_O at_O the_O Johns_ORG Hopkins_ORG University_ORG before_O joining_O IBM_ORG. ƒ Why? We need complete phrases to be identified correctly ƒ How? With external evaluation script, e.g. conlleval for NER

ƒ Example tagging: ƒ John_PER studied_O at_O the_O Johns_PER Hopkins_PER University_ORG before_O joining_O IBM_ORG.

ƒ Multiple penalty: ƒ 3 Positives: John (PER), (PER) Johns Hopkins University (ORG), (ORG) IBM (ORG) ƒ 2 FPs: Johns Hopkins (PER) and University (ORG) ƒ 1 FN: Johns Hopkins University (ORG) ƒ F(PER) = 0.67, 0 67 F(ORG) = 0 0.5 5 SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

25

Loss types 1 1.

The real loss function given to us by the world world. Typically involves notions of money saved saved, time saved, lives saved, hopes of tenure saved, etc. We rarely have any access to this function. The human-evaluation function. Typical examples are fluency/adequecy judgments, relevance assessments etc. assessments, etc We can perform these evaluations, evaluations but they are slow and costly costly. They require humans in the loop. Automatic correlation-driving functions. Typical examples are Bleu, Rouge, word error rate, mean-average-precision. These require humans at the front of the loop, but after that are cheap h and d quick. i k T Typically i ll some effort ff t h has b been putt iinto t showing h i correlation l ti b between t th these and something higher up. Automatic intuition-driven functions. Typical examples are accuracy (for anything), f-score (for parsing, chunking and named-entity recognition), alignment error rate (for word alignment) and d perplexity l it (f (for llanguage modeling). d li ) Th These also l require i h humans att th the ffrontt off th the lloop, but differ from (3) in that they are not actually compared with higher-up tasks.

2. 3.

4.

Be careful what you are optimizing! Some measures (trypically of Type 4) become disfunctional when you are optimizing them!

ƒ ƒ ƒ

phrase P/R/F e.g. in NER Readability measures

SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

26

Evaluation measures ƒ Sequence P/R/F, P/R/F e.g. e g in Named Entity Recognition, Recognition Chunking, Chunking etc. etc ƒ John_PER studied_O at_O the_O Johns_ORG Hopkins_ORG University_ORG before_O joining_O IBM_ORG.

ƒ Example p tagging gg g 1: ƒ ƒ ƒ ƒ ƒ

John_PER studied_O at_O the_O Johns_PER Hopkins_PER University_ORG before_O joining_O IBM_ORG. 3 Positives: John (PER), Johns Hopkins University (ORG), IBM (ORG) 2 FPs: Johns Hopkins (PER) and University (ORG) 1 FN: Johns Hopkins University (ORG) F(PER) = 0.67, F(ORG) = 0.5

ƒ Example tagging 2: ƒ ƒ ƒ ƒ ƒ

JJohn h _PER studied di d_O at_O the h _O Johns J h _O Hopkins H ki _O University U i i _O before b f _O joining j i i _O IBM_ORG. 3 Positives: John (PER), Johns Hopkins University (ORG), IBM (ORG) 0 FP 1 FN: Johns Hopkins p University y ((ORG)) F(PER) = 1.0, F(ORG) = 0.67

ƒ Optimizing phrase-F can encourage / prefer systems that do not mark entities! ƒ mostt likely, lik l this thi is i bad!! b d!! SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

27

ROC curve ƒ ROC OC – Receiver Operating O Characteristic C curve ƒ Curve that depicts the relation between recall (sensitivity) and false positives (1-specificity) (1 specificity)

Sensitivvity (Reca all)

Best case

Worst case

False Positives SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

FP / (FP+TN) 28

Evaluation measures ƒ Area A under d ROC curve, AUC ƒ As you vary the decision threshold, you can plot the recall vs. false positive rate p ƒ The area under the curve measures how accurately your model separates t positive iti from f negatives ti ƒ perfect ranking: AUC = 1.0 ƒ random decision: AUC = 0.5

ƒ Similarly (e.g. in IR): area under P/R curve ƒ when h there th are too t many (true) (t ) negatives ti ƒ correctly identifying negatives is not interesting anyway

SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

29

Evaluation measures (Ranking) ƒ Precision P i i @K ƒ number of true positives in top K predictions / ranks

ƒ MAP ƒ The average of precisions computed at the point of each of the positives in the ranked list ((P=0 for positives not ranked at all))

ƒ NDCG ƒ For graded relevance / ranking ƒ Highly relevant documents appearing lower in a search result list should be penalized as the graded relevance value is reduced logarithmically proportional to the position of the result.

SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

30

Learning curve ƒ Measures M h how th the – accuracy – error

off the th model d l changes h with ith – sample size – iteration number

ƒ Smaller sample ƒ worse accuracy y bias in the estimate ƒ more likely (representative sample) ƒ variance in the estimate

ƒ Typical learning curve ƒ If it looks differently: ƒ you are plotting error vs. size/iteration ƒ you are doing something wrong! ƒ overfitting (iteration, not sample size)! SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

?? 31

Data or Algorithm? ƒ Compare the accuracy of various machine learning algorithms with a varying amount of training data (Banko & Brill, 2001): ƒ ƒ ƒ ƒ

Winnow perceptron naïve Bayes memory-based learner

ƒ Features: ƒ bag of words: words within a window of the target word ƒ collocations containing specific words and/or part of speech

ƒ Training corpus: 1-billion words from a variety of English texts ((news articles, literature, scientific abstracts, etc.)) SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

32

Take home messages (up until now) ƒ Supervised p learning: g based on a set of labeled examples p ((x,, f(x)) ( )) learn the input-output mapping, i.e. f(x) ƒ 3 factors of successful machine learning models ƒ much data ƒ good features ƒ well-suited learning algorithm

ƒ ML workflow 1. 2 2. 3. 4.

problem definition feature engineering; experimental setup /train, /train validation, validation test …// selection of learning algorithm, (hyper)parameter tuning, training a final model predict unseen examples & fill tables / draw figures for the paper - test

ƒ Careful C f l with ith ƒ ƒ ƒ ƒ

data representation (i.i.d, comparability, …) experimental setup (cross-validation, blind testing, …) data size and algorithm selection (+ overfitting, overfitting underfitting, underfitting …)) evaluation measures

SS 2011 | Computer Science Department | UKP Lab - György Szarvas |

33