Apr 15, 2011 - So far.. Logistic regression is a binary classifier (multinomial ... Normal distribution (0 .... Maximize
Kriti Puniyani Carnegie Mellon University
[email protected]
About me Graduate student at Carnegie Mellon University Statistical machine learning Topic models Sparse network learning Optimization
Domains of interest Social media analysis Systems biology Genetics Sentiment analysis Text processing
4/15/11
2
Machine learning Computers to “learn with experience” Learn : to be able to predict “unseen” things. Many applications Search Machine translation Speech recognition Vision : identify cars, people, sky, apples Robot control
Introductions : http://www.cs.cmu.edu/~tom/pubs/MachineLearning.pdf http://videolectures.net/mlss2010_lawrence_mlfcs/ 4/15/11
3
Classification Is this the digit “9” ?
ρ
Will this patient survive ?
Will this user click on my ad ?
4/15/11
4
Predict the next coin toss )) IRLS : http://www.cs.cmu.edu/~ggordon/IRLS-example/logistic.m Implement your own
4/15/11
21
Decision surface is linear
Errors Y=0
Y=1
4/15/11
22
Decision surface is linear
http://www.cs.technion.ac.il/~rani/LocBoost/
4/15/11
23
So far.. Logistic regression is a binary classifier (multinomial
version exists) P(Y=1|X,w) is a logistic function Inference : Compute P(Y=1|X,w), and do “rounding”. Parameter learnt by maximizing log-likelihood of data. Decision surface is linear (kernelized version exists)
4/15/11
24
Improvements in the model Prevent over-fitting Maximize accuracy directly Non-linear decision surface
Regularization SVMs Kernel Trick
Multi-label data
4/15/11
25
Occam’s razor
The simplest explanation is most likely the correct one
4/15/11
26
New and improved learning “Best” w == maximize log-likelihood
Maximum Log-likelihood Estimate (MLE) Small concern … over-fitting
If data is linearly separable, w ∞
4/15/11
27
L2 regularization
|| w ||2 =
∑ wi
2
i
2 2
max w l(w) − λ || w || Prevents over-fitting
€
“Pushes” parameters
towards zero Equivalent to a prior on the parameters
€
Normal distribution (0
mean, unit covariance)
λ : tuning parameter ( 0.1)
4/15/11
28
Patient Diagnosis Y = disease X = [age, weight, BP, blood sugar, MRI, genetic tests …] Don’t want all “features” to be relevant. Weight vector w should be “mostly zeros”.
4/15/11
29
L1 regularization
|| w ||1= ∑ | w i | i
max w l(w) − λ || w ||1 Prevents over-fitting
€
“Pushes” parameters to
zero Equivalent to a prior on the parameters
€
Laplace distribution
λ increases, more zeros (irrelevant) features
4/15/11
30
L1 v/s L2 example MLE estimate
: [ 11 0.8 ]
L2 estimate
: [ 10 0.6 ]
shrinkage
L1 estimate
: [ 10.2 0 ]
sparsity
Mini-conclusion : L2 optimization is fast, L1 tends to be slower. If you have the
computational resources, try both (at the same time) ! ALWAYS run logistic regression with at least some regularization. Corollary : ALWAYS run logistic regression on features that have been standardized (zero mean, unit variance) 4/15/11
31
So far … Logistic regression Model Inference Learning via maximum likelihood L1 and L2 regularization
Next …. SVMs !
4/15/11
32
Why did we use probability again? Aim : Maximize “accuracy” Logistic regression : Indirect method that maximizes
likelihood of data. A much more direct approach is to directly maximize
accuracy.
Support Vector Machines (SVMs)
4/15/11
33
Maximize the margin Maximize the margin
"
! 2005-2007 Carlos Guestrin
4/15/11
34
Geometry review Y=1
2x1+x2-2=0
Y= -1
For a point on the line : (0.5, 1 ) : d = 2*0.5 + 1 – 2
=0
Signed “distance” to the line from (x10, x20) d = 2x10 + x20 - 2
4/15/11
35
Geometry review Y=1
2x1+x2-2=0
Y= -1
(1,
2.5) : d = 2*1 + 2.5 - 2 = 2.5 > 0 y(wx+b) = 1*2.5 = 2.5 > γ
4/15/11
36
Geometry review Y=1
2x1+x2-2=0
Y= -1
(0.5, 0.5) : d = 2*0.5 + 0.5 – 2 = -0.5 < 0 y(w.x+b) = y*d = -1 * -0.5 = 0.5
4/15/11
37
Support Vector Machines Normalized margin – Canonical hyperplanes
"
! 2005-2007 Carlos Guestrin
x+
Support vectors are the points touching the margins.
x-
4/15/11 ! 2005-2007 Carlos Guestrin
!
38
= !j w(j) x(j)
!
! 2005-2007 Carlos Guestrin
w.x = !j w(j) x(j)
w.x = !j w(j) x(j)
!
! 2005-2007 Carlos Guestrin
Slack variables
! 2005-2007 Carlos Guestrin
SVMs are made robust by adding “slack variables” that
allow training error to be non-zero
ximize the Onemargin for each data point. Slack variable ==0 for correctly classified points. Maximize the margin Maximizemax the margin γ −C ∑ξ i
€
4/15/11
39
Slack variables max γ −C ∑ξ i
€Need to tune C : high C == minimize mis-classifications low C == maximize margin
4/15/11
40
SVM summary Model :
w.x + b > 0 w.x + b < 0
if y = +1 if y = -1
Inference : ŷ = sign(w.x+b) Learning : Maximize { (margin) - C ( slack-variables) }
Next … Kernel SVMs
4/15/11
41
The kernel trick Why linear separator ? What if data looks like below ?
The kernel trick allows you to use SVMs with nonlinear separators. Different kernels 1. Polynomial 2. Gaussian 3. exponential
4/15/11
42
Logistic
Linear SVM
Error ~ 40% in both cases 4/15/11
43
Kernel SVM with polynomial kernel of degree 2 Polynomial kernel of degree 2/4 do very well, but degree 3/5 do very bad.
Error = 7%
Gaussian kernel has tuning parameter (bandwidth). Performance depends on picking the right bandwith. 4/15/11
44
SVMs summary Maximize the margin between positive and negative
examples. Kernel trick is widely implemented, allowing non-linear decision surface. Not probabilistic
Software : SVM-light http://svmlight.joachims.org/, LIBSVM http://www.csie.ntu.edu.tw/~cjlin/libsvm/ Weka, matlab, R
4/15/11
45
Demo
http://www.cs.technion.ac.il/~rani/LocBoost
4/15/11
46
Which to use ? Linear SVMs and logistic regression work very similar in
most cases. Kernelized SVMs work better than linear SVMs (mostly) Kernelized logistic regression is possible, but implementations are not available easily.
4/15/11
47
Recommendations 1. First, try logistic regression. Easy, fast, stable. No “tuning”
parameters. 2. Equivalently, you can first try linear SVMs, but you need to tune “C” 3. If results are “good enough”, stop. 4. Else try SVMs with Gaussian kernels.
Need to tune bandwidth, C – by using validation data. If you have more time/computational resources, try random forests as well. ** Recommendations are opinions of the presenter, and not known facts. 4/15/11
48
In conclusion … Logistic Regression Support Vector Machines
Other classification approaches … Random forests / decision trees Naïve Bayes Nearest Neighbors Boosting (Adaboost) 4/15/11
49
Thank you Questions?
4/15/11
50
Kriti Puniyani Carnegie Mellon University
[email protected]
Is this athlete doing drugs ? X = Blood-test-to-detect-drugs Y = Doped athlete ? Two types of errors : Athlete is doped, we predict “NO” : false negative Athlete is NOT doped, we predict “YES” : false positive
Penalize false positives more than false negatives
4/15/11
52
Outline What is classification ? Parameters, data, inference, learning Predicting coin tosses (0-dimensional X) Logistic Regression Predicting “speaker success” (1-dimensional X) Formulation, optimizatiob Decision surface is linear Interpreting coefficients Hypothesis testing Evaluating the performance of the model Why is it called “regression” : log-odds L2 regularization Patient survival (d-dimensional X) L1 regularization Support Vector Machines Linear SVMs + formulation What are “support vectors” The kernel trick Demo : logistic regression v/s SVMs v/s kernel tricks
4/15/11
53
Overfitting a more serious problem
2x+y-2 = 0 4x+2y-4 = 0 400x+200y-400 = 0
w = [2 1 -2] w = [4 2 -4] w = [400 200 -400]
Absolutely need L2 regularization
4/15/11
54