Kriti Puniyani - Meetup

0 downloads 159 Views 4MB Size Report
Apr 15, 2011 - So far.. Logistic regression is a binary classifier (multinomial ... Normal distribution (0 .... Maximize
Kriti Puniyani Carnegie Mellon University [email protected]

About me   Graduate student at Carnegie Mellon University   Statistical machine learning   Topic models   Sparse network learning   Optimization

  Domains of interest   Social media analysis   Systems biology   Genetics   Sentiment analysis   Text processing

4/15/11

2

Machine learning   Computers to “learn with experience”   Learn : to be able to predict “unseen” things.   Many applications   Search   Machine translation   Speech recognition   Vision : identify cars, people, sky, apples   Robot control

  Introductions :   http://www.cs.cmu.edu/~tom/pubs/MachineLearning.pdf   http://videolectures.net/mlss2010_lawrence_mlfcs/ 4/15/11

3

Classification   Is this the digit “9” ?

ρ

  Will this patient survive ?

  Will this user click on my ad ?

4/15/11

4

Predict the next coin toss )) IRLS : http://www.cs.cmu.edu/~ggordon/IRLS-example/logistic.m Implement your own 

4/15/11

21

Decision surface is linear

Errors Y=0

Y=1

4/15/11

22

Decision surface is linear

http://www.cs.technion.ac.il/~rani/LocBoost/

4/15/11

23

So far..   Logistic regression is a binary classifier (multinomial

version exists)   P(Y=1|X,w) is a logistic function   Inference : Compute P(Y=1|X,w), and do “rounding”.   Parameter learnt by maximizing log-likelihood of data.   Decision surface is linear (kernelized version exists)

4/15/11

24

Improvements in the model   Prevent over-fitting   Maximize accuracy directly   Non-linear decision surface

Regularization SVMs Kernel Trick

  Multi-label data

4/15/11

25

Occam’s razor

The simplest explanation is most likely the correct one

4/15/11

26

New and improved learning   “Best” w == maximize log-likelihood

Maximum Log-likelihood Estimate (MLE) Small concern … over-fitting

If data is linearly separable, w ∞

4/15/11

27

L2 regularization

|| w ||2 =

∑ wi

2

i

2 2

max w l(w) − λ || w ||   Prevents over-fitting



  “Pushes” parameters

towards zero   Equivalent to a prior on the parameters



  Normal distribution (0

mean, unit covariance)

λ : tuning parameter ( 0.1)

4/15/11

28

Patient Diagnosis   Y = disease   X = [age, weight, BP, blood sugar, MRI, genetic tests …]   Don’t want all “features” to be relevant.   Weight vector w should be “mostly zeros”.

4/15/11

29

L1 regularization

|| w ||1= ∑ | w i | i

max w l(w) − λ || w ||1   Prevents over-fitting



  “Pushes” parameters to

zero   Equivalent to a prior on the parameters



  Laplace distribution

λ increases, more zeros (irrelevant) features

4/15/11

30

L1 v/s L2 example   MLE estimate

: [ 11 0.8 ]

  L2 estimate

: [ 10 0.6 ]

shrinkage

  L1 estimate

: [ 10.2 0 ]

sparsity

  Mini-conclusion :   L2 optimization is fast, L1 tends to be slower. If you have the

computational resources, try both (at the same time) !   ALWAYS run logistic regression with at least some regularization.   Corollary : ALWAYS run logistic regression on features that have been standardized (zero mean, unit variance) 4/15/11

31

So far …   Logistic regression   Model   Inference   Learning via maximum likelihood   L1 and L2 regularization

Next …. SVMs !

4/15/11

32

Why did we use probability again?   Aim : Maximize “accuracy”   Logistic regression : Indirect method that maximizes

likelihood of data.   A much more direct approach is to directly maximize

accuracy.

Support Vector Machines (SVMs)

4/15/11

33

Maximize the margin Maximize the margin

"

! 2005-2007 Carlos Guestrin

4/15/11

34

Geometry review Y=1

2x1+x2-2=0

Y= -1

For a point on the line : (0.5, 1 ) : d = 2*0.5 + 1 – 2

=0

Signed “distance” to the line from (x10, x20) d = 2x10 + x20 - 2

4/15/11

35

Geometry review Y=1

2x1+x2-2=0

Y= -1

(1,

2.5) : d = 2*1 + 2.5 - 2 = 2.5 > 0 y(wx+b) = 1*2.5 = 2.5 > γ

4/15/11

36

Geometry review Y=1

2x1+x2-2=0

Y= -1

(0.5, 0.5) : d = 2*0.5 + 0.5 – 2 = -0.5 < 0 y(w.x+b) = y*d = -1 * -0.5 = 0.5

4/15/11

37

Support Vector Machines Normalized margin – Canonical hyperplanes

"

! 2005-2007 Carlos Guestrin

x+

Support vectors are the points touching the margins.

x-

4/15/11 ! 2005-2007 Carlos Guestrin

!

38

= !j w(j) x(j)

!

! 2005-2007 Carlos Guestrin

w.x = !j w(j) x(j)

w.x = !j w(j) x(j)

!

! 2005-2007 Carlos Guestrin

Slack variables

! 2005-2007 Carlos Guestrin

  SVMs are made robust by adding “slack variables” that

allow training error to be non-zero

ximize the   Onemargin for each data point. Slack variable ==0 for correctly classified points. Maximize the margin Maximizemax the margin γ −C ∑ξ i



4/15/11

39

Slack variables max γ −C ∑ξ i

€Need to tune C : high C == minimize mis-classifications low C == maximize margin

4/15/11

40

SVM summary   Model :

w.x + b > 0 w.x + b < 0

if y = +1 if y = -1

  Inference : ŷ = sign(w.x+b)   Learning : Maximize { (margin) - C ( slack-variables) }

Next … Kernel SVMs

4/15/11

41

The kernel trick   Why linear separator ? What if data looks like below ?

The kernel trick allows you to use SVMs with nonlinear separators. Different kernels 1.  Polynomial 2.  Gaussian 3.  exponential

4/15/11

42

Logistic

Linear SVM

Error ~ 40% in both cases 4/15/11

43

Kernel SVM with polynomial kernel of degree 2 Polynomial kernel of degree 2/4 do very well, but degree 3/5 do very bad.

Error = 7%

Gaussian kernel has tuning parameter (bandwidth). Performance depends on picking the right bandwith. 4/15/11

44

SVMs summary   Maximize the margin between positive and negative

examples.   Kernel trick is widely implemented, allowing non-linear decision surface.   Not probabilistic 

  Software :   SVM-light http://svmlight.joachims.org/,   LIBSVM http://www.csie.ntu.edu.tw/~cjlin/libsvm/   Weka, matlab, R

4/15/11

45

Demo

http://www.cs.technion.ac.il/~rani/LocBoost

4/15/11

46

Which to use ?   Linear SVMs and logistic regression work very similar in

most cases.   Kernelized SVMs work better than linear SVMs (mostly)   Kernelized logistic regression is possible, but implementations are not available easily.

4/15/11

47

Recommendations 1.  First, try logistic regression. Easy, fast, stable. No “tuning”

parameters. 2.  Equivalently, you can first try linear SVMs, but you need to tune “C” 3.  If results are “good enough”, stop. 4.  Else try SVMs with Gaussian kernels.

Need to tune bandwidth, C – by using validation data. If you have more time/computational resources, try random forests as well. ** Recommendations are opinions of the presenter, and not known facts. 4/15/11

48

In conclusion …   Logistic Regression   Support Vector Machines

Other classification approaches …   Random forests / decision trees   Naïve Bayes   Nearest Neighbors   Boosting (Adaboost) 4/15/11

49

Thank you Questions?

4/15/11

50

Kriti Puniyani Carnegie Mellon University [email protected]

Is this athlete doing drugs ?   X = Blood-test-to-detect-drugs   Y = Doped athlete ?   Two types of errors :   Athlete is doped, we predict “NO” : false negative   Athlete is NOT doped, we predict “YES” : false positive

  Penalize false positives more than false negatives

4/15/11

52

Outline   What is classification ?   Parameters, data, inference, learning   Predicting coin tosses (0-dimensional X)   Logistic Regression   Predicting “speaker success” (1-dimensional X)   Formulation, optimizatiob   Decision surface is linear   Interpreting coefficients   Hypothesis testing   Evaluating the performance of the model   Why is it called “regression” : log-odds   L2 regularization   Patient survival (d-dimensional X)   L1 regularization   Support Vector Machines   Linear SVMs + formulation   What are “support vectors”   The kernel trick   Demo : logistic regression v/s SVMs v/s kernel tricks

4/15/11

53

Overfitting a more serious problem

2x+y-2 = 0 4x+2y-4 = 0 400x+200y-400 = 0

w = [2 1 -2] w = [4 2 -4] w = [400 200 -400]

 Absolutely need L2 regularization

4/15/11

54