Text Quantification - Semantic Scholar

1 downloads 237 Views 1MB Size Report
Among the posts about the iPhone6 posted on forums, what is the percentage of “very ... the groups (and sometimes thei
Text Quantification Fabrizio Sebastiani Qatar Computing Research Institute Qatar Foundation PO Box 5825 – Doha, Qatar E-mail: [email protected] http://www.qcri.com/

RUSSIR 2015 St. Petersburg, RU – August 24–28, 2015 Download most recent version of these slides at http://bit.ly/1PbcrBv

What is quantification? 1

1 Dodds, Peter et al. Temporal Patterns of Happiness and Information in a Global Social Network: Hedonometrics and Twitter. PLoS ONE, 6(12), 2011. 2 / 99

What is quantification? (cont’d)

3 / 99

What is quantification? (cont’d) I

Classification : a ubiquitous enabling technology in data science

I

In many applications of classification, the real goal is determining the relative frequency of each class in the unlabelled data; this task is called quantification

I

E.g. I

I

I

Among the blog posts concerning the next presidential elections, what is the percentage of pro-Democrat posts? Among the posts about the iPhone6 posted on forums, what is the percentage of “very positive” ones? How do these percentages evolve over time?

I

This task has applications in IR, ML, DM, NLP, and has given rise to learning methods and evaluation measures specific to it

I

We will mostly be interested in quantification from text

4 / 99

Outline

1. Introduction 2. Applications of quantification in IR, ML, DM, NLP 3. Evaluation of quantification algorithms 4. Supervised learning methods for quantification 5. Advanced topics 6. Resources + shared tasks 7. Conclusions

5 / 99

Outline 1. Introduction 1.1 1.2 1.3 1.4 1.5

Distribution drift The “paradox of quantification” Historical development Related tasks Notation and terminology

2. Applications of quantification in IR, ML, DM, NLP 3. Evaluation of quantification algorithms 4. Supervised learning methods for quantification 5. Advanced topics 6. Resources + shared tasks 7. Conclusions

6 / 99

Classification: A Primer

I

Classification (aka “categorization”) is the task of assigning data items to groups (“classes”) whose existence is known in advance

I

Examples : 1. Assigning newspaper articles to one or more of Home News, Politics, Economy, Lifestyles, Sports 2. Assigning email messages to exactly one of Legitimate, Spam 3. Assigning product reviews to exactly one of Disastrous, Poor, Average, Good, Excellent 4. Assigning one or more classes from the ACM Classification Scheme to a computer science paper 5. Assigning photographs to one of Still Life, Portrait, Landscape, Events 6. Predicting tomorrow’s weather as one of Sunny, Cloudy, Rainy

7 / 99

Classification: A Primer (cont’d)

I

Classification is different from clustering, since in the latter case the groups (and sometimes their number) are not known in advance

I

Classification requires subjective judgment : assigning natural numbers to either Prime or NonPrime is #not# classification

I

Classification is thus prone to error; we may experimentally evaluate the error made by a classifier against a set of manually classified objects

8 / 99

Classification: A Primer (cont’d) I

(Automatic) Classification is usually tackled via supervised machine learning : a general-purpose learning algorithm trains (using a set of manually classified items) a classifier to recognize the characteristics an item should have in order to be attributed to a given class

I

“Learning” metaphor: advantageous, since I

I

I

no domain knowledge required to build a classifier (cheaper to manually classify some items for training than encoding domain knowledge by hand into the classifier) easy to revise the classifier (a) if new training items become available, or (b) if new classes need to be considered

Popular classes of supervised learning algorithms: SVMs, boosting, decision trees, k-NN, Naïve Bayes, genetic algorithms, neural networks, etc.

9 / 99

Introduction (cont’d)

I

Quantification goes under different names in different fields 1. “prevalence estimation” (in statistics and epidemiology) 2. “class prior estimation” (in machine learning) 3. “quantification” (in data mining)

I

“relative frequency” ≡ “prevalence” ≡ “a priori probability” ≡ “prior (probability)”

I

Slight differences among 1., 2., 3. are that I

I

There are no training data and no classifiers in 1., while there are in 2. and 3. The task is of independent interest in 1. and 3, while it is only ancillary (i.e., functional to generating better classifiers) in 2.

10 / 99

Introduction (cont’d) I

Quantification may be also defined as the task of approximating a true distribution by a predicted distribution

+,-.!6,:8324,!

6,:8324,!

6,73-89!

5;?@ 1 I

I

I

I

MLMC is usually reduced to binary by solving n independent binary classification problems

Ordinal classification (aka “ordinal regression”): each item has exactly one of C = (c1  ...  cn ), where  is a total order and n>2 (Metric regression): each item has a real-valued score from the range [α, β]

For each such brand of classification we will be interested in its “quantification equivalent” (Q-equivalent), i.e., in solving and evaluating that classification task at the aggregate level. 20 / 99

Notation and terminology (cont’d)

x vectorial representation of item x C = {c1 , ..., cn } set of classes pS (cj ) pˆS (cj ) pˆSM (cj )

true prevalence (aka “prior probability”) of cj in set S estimated prevalence of cj in set S estimate pˆS (cj ) obtained via method M

p(cj |x) p(δj ) pS (δj )

posterior probability of cj returned by the classifier probability that classifier attributes cj to a random item fraction of items in S labelled as cj by the classifier

21 / 99

Outline 1. Introduction 2. Applications of quantification in IR, ML, DM, NLP 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8

Dimensions of quantification Applications to the social / political sciences Applications to epidemiology Applications to market research Applications to resource allocation (Meta-)applications to classification Applications to word sense disambiguation Miscellaneous applications

3. Evaluation of quantification algorithms 4. Supervised learning methods for quantification 5. Advanced topics 6. Resources + shared tasks 7. Conclusions 22 / 99

Dimensions of quantification I

Text quantification, like text classification, may be performed across various dimensions (i.e., criteria): I

I

I

I

by topic : applications to the social sciences, epidemiology, market research, resource allocation, word sense disambiguation by sentiment (“sentiment classification”): applications to the social sciences, political sciences, market research, ... by language (“language identification”): e.g., estimating language diversity

Applications of quantification found in the literature may be distinguished into I I

those that apply methods especially designed for quantification those that, unaware of the existence of specific methods for quantification, apply standard classification methods with “classify and count”

23 / 99

Applications to the social / political sciences

24 / 99

Applications to the social / political sciences (cont’d)

25 / 99

Applications to the social / political sciences (cont’d) I

Social science is a discipline in which individual cases hardly matter, and where the goal is obtaining quantitative indicators about a population (e.g., by age group, gender, ethnic group, geographical region, time interval, ...) [Others] may be interested in finding the needle in the haystack, but social scientists are more commonly interested in characterizing the haystack. (Hopkins and King, 2010)

I

Further applications include I

I

I

predicting election results by estimating the prevalence of blog posts (or tweets) supporting a given candidate or party3 estimate the emotional responses of the population to a natural disaster (Mandel et al., 2012)

Computational social science is the big new paradigm spurred by the availability of “big data” from social networks

3 Hopkins, D. J. and G. King: 2010, A Method of Automated Nonparametric Content Analysis for Social Science. American Journal of Political Science 54(1), 229–247. 26 / 99

Applications to epidemiology

I

Epidemiology : concerned with tracking the incidence of diseases across spatio-temporal contexts and across other variables (e.g., gender, age group, religion, job type, ...)

I

Text quantification: Supporting epidemiological research by estimating the prevalence of clinical reports where a specific pathology is diagnosed 4

4 S. Baccianella, A. Esuli, F. Sebastiani: 2013, Variable-Constraint Classification and Quantification of Radiology Reports under the ACR Index. Expert Systems and Applications 40(9), 3441–3449. 27 / 99

Applications to epidemiology (cont’d) I

Quantification: Supporting epidemiology via verbal autopsies, i.e., tracking causes of death in populations where medical death certification is missing5 I

I

I

“verbal autopsy”: estimating the cause of death from verbal symptom reports obtained from relatives (in the form of e.g., binary answers to questions about symptoms) a supervised learning task: training data (xi , yi ) are death records from nearby hospital in which both symptom reports obtained from caregivers (xi ) and medical death certification (yi ) are available

Verbal autopsies: I

I

cheaper and more effective than having physicians guess the cause of death from verbal reports of crucial importance for international health policy-making and for channelling research efforts

5 King, G. and Y. Lu: 2008, Verbal Autopsy Methods with Multiple Causes of Death. Statistical Science 23(1), 78–91. 28 / 99

Applications to market research

I

Survey coding is the task of classifying natural language responses (“open-ends”, aka “verbatims”) elicited via open-ended questions in questionnaires6

I

Main applications: 1. 2. 3. 4.

Market Research Customer/Employee Relationship Management (CRM/ERM) Social Science Political Science (opinion polls)

6 Esuli, A. and F. Sebastiani: 2010a, Machines that Learn how to Code Open-Ended Survey Data. International Journal of Market Research 52(6), 775–800. 29 / 99

Applications to market research (cont’d) I

Example 1 (CRM): “How satisfied are you with our mobile phone services?” Asked by: telecom company Class of interest: MayDefectToCompetition Goal: classification (at the individual level)

I

Example 2 (MR): “What do you think of the recent ad for product X?” Asked by: MR agency Class of interest: LovedTheCampaign Goal: quantification (at the aggregate level)

30 / 99

Applications to resource allocation I

Customer support center: planning the amount of human resources to allocate to different types of issues

I

Can be done by estimating the prevalence of customer calls related to a given issue 7

I

The same can be done for customer feedback obtained via email

I

Important in order to improve customer support (determine priorities, track costs, plan product fixes / reengineering) “rising problems can be identified before they become epidemics” (Forman, 2008)

7 Forman, G., E. Kirshenbaum, and J. Suermondt, Pragmatic text mining: Minimizing human effort to quantify many issues in call logs. KDD 2006, pp. 852–861. 31 / 99

Applications to word sense disambiguation I

Word Sense Disambiguation (WSD) is the task of determining, given the occurrence of a word w in a text, which sense of w is meant

I

WSD is a text classification task, where I I

the linguistic context of the occurrence is the text the different senses of the word are the classes

I

Words have sense priors, i.e., different senses have different prevalences in language; WSD algorithms do exploit these priors

I

The same word may have different priors in different domains; if the WSD algorithms has been trained on domain d1 , applying it on domain d2 may yield suboptimal results

I

Quantification may be used to estimate the word sense priors8 of the new domain d2 , and use them to re-tune the classifier trained on domain d1 (a case of domain adaptation)

8 Chan, Y. S. and H. T. Ng, Estimating Class Priors in Domain Adaptation for Word Sense Disambiguation. ACL 2006, pp. 89–96. 32 / 99

(Meta-)applications to classification I

I

Accurate quantification may improve classification accuracy since, in the presence of distribution drift, classification accuracy may suffer E.g., in a Naïve Bayesian classifier p(cj |x) =

I

p(x|cj )p(cj ) p(x)

posterior probabilities have been “calibrated” for Tr Probabilities are calibrated for a set S when 1 X pS (cj ) = ES [cj ] = p(cj |x) |S| x∈S

I

which means that in the presence of distribution drift they cannot be calibrated for both Tr and Te By estimating class prevalence in Te we can adjust the classifier itself so as to yield better classification accuracy9

9 Saerens, M., P. Latinne, C. Decaestecker: 2002, Adjusting the Outputs of a Classifier to New a Priori Probabilities: A Simple Procedure. Neural Computation 14(1), 21–41. 33 / 99

(Meta-)applications to classification (cont’d)

I

Posterior probabilities p(cj |x) can be re-calibrated as pˆTe (cj ) · pTr (cj |x) pTr (cj ) p(cj |x) = P pˆTe (cj ) · pTr (cj |x) cj ∈C pTr (cj ) where the pTr (cj |x) are the posteriors before calibration

I

Also investigated for semi-supervised learning (Xue and Weiss, KDD 2009)

I

Quantification is “ancillary” to classification, and not a goal in itself

34 / 99

Miscellaneous applications I

Ante litteram: In the late 1600s the Catholic Church tracked the proportion of printed texts which were non-religious

I

Quantifying the proportion of damaged cells in biological samples, e.g., sperm for artificial insemination (González-Castro et al., 2013) or human tissues for oncology (Decaestecker et al., 1997)

I

Measuring the prevalence of different types of pets’ activity as detected by wearable devices (Weiss et al., 2013)

I

Estimation of skeleton age distribution in paleodemography, the study of ancient human mortality, fertility, and migration (Hoppa and Vaupel, 2002)

I

Real-time estimation of collective sentiment about TV shows from tweets (Amati et al., 2014)

I

...

35 / 99

Outline 1. Introduction 2. Applications of quantification in IR, ML, DM, NLP 3. Evaluation of quantification algorithms 3.1 Evaluation measures for quantification I I I

Measures for evaluating binary and SLMC quantification Measures for evaluating ordinal quantification Multi-objective measures

3.2 Experimental protocols for evaluating quantification

4. Supervised learning methods for quantification 5. Advanced topics 6. Resources + shared tasks 7. Conclusions

36 / 99

Measures for evaluating binary + SLMC quantification

1. Absolute Error 2. Relative Absolute Error 3. Kullback-Leibler Divergence

37 / 99

Absolute Error I

Absolute Error (AE – sometimes called “Mean Absolute Error”) is 1 X |ˆ p(cj ) − p(cj )| (1) AE(ˆ p, p) = |C| cj ∈C

I

Ranges between 0 (best) and 2(1 − min p(cj )) cj ∈C

|C| (worst) I

A normalized version of AE that always ranges between 0 (best) and 1 (worst) can thus be obtained as P p(cj ) − p(cj )| cj ∈C |ˆ NAE(ˆ p, p) = (2) 2(1 − min p(cj )) cj ∈C

38 / 99

Absolute Error (cont’d)

I

Pros: I

I

I

enforces the notion that positive and negative bias are equally undesirable, and can thus be used as a general metric of Q-accuracy intuitive, appealing to non-initiates too

Cons: I

predicting p ˆTe (cj ) = 0.01 when pTe (cj ) = 0.02 should be considered more serious than predicting p ˆTe (cj ) = 0.49 when pTe (cj ) = 0.50

39 / 99

Relative Absolute Error I

Relative Absolute Error (RAE) is p(cj ) − p(cj )| 1 X |ˆ RAE(ˆ p, p) = |C| p(cj )

(3)

cj ∈C

I

Ranges between 0 (best) and 1 − min p(cj ) |C| − 1 +

cj ∈C

min p(cj ) cj ∈C

|C| I

(worst) A normalized version of RAE that always ranges between 0 (best) and 1 (worst) can thus be obtained as X |ˆ p(cj ) − p(cj )| NRAE(ˆ p, p) =

cj ∈C

p(cj ) 1 − min p(cj )

|C| − 1 +

(4)

cj ∈C

min p(cj ) cj ∈C

40 / 99

Relative Absolute Error (cont’d)

I

May be undefined due to the presence of zero denominators.

I

To solve this we can smooth p(cj ) and pˆ(cj ) via additive smoothing; the smoothed version of p(cj ) is ps (cj ) =

 + p(cj ) X |C| + p(cj )

(5)

cj ∈C

I I

1 is often used as a smoothing factor. 2|Te| Pros: =

I

all of the above, plus: relativizes to true class prevalence

41 / 99

Kullback-Leibler Divergence

I

KLD (aka normalized cross-entropy) is KLD(ˆ p, p) =

X

10

p(cj ) log

cj ∈C

p(cj ) pˆ(cj )

(6)

I

An information-theoretic measure of the inefficiency incurred when estimating a true distribution p over a set C of classes by means of a predicted distribution pˆ.

I

Not symmetric

I

Ranges between 0 (best) and +∞ (worst)

10 Forman, G., Counting Positives Accurately Despite Inaccurate Classification. ECML 2005, pp. 564–575. 42 / 99

Kullback-Leibler Divergence (cont’d) I

A normalized version of KLD ranging between 0 (best) and 1 (worst) may be defined as NKLD(ˆ p, p) =

σ =0.20 σ =0.42 σ =1.00 σ =2.00 σ =3.00

e KLD(ˆp,p) − 1 e KLD(ˆp,p)

(7)

1.0 0.8 0.6 0.4 0.2

-10.0

-8.0

-6.0

-4.0

-2.0

0.0

2.0

4.0

6.0

8.0

10.0

-0.2

43 / 99

Kullback-Leibler Divergence (cont’d)

I

Pros: I

I

all of the above, plus: well studied within information theory and language modelling

Cons: I I

hardly intuitive, difficult to explain to non-initiates ... undefined when the p ˆTe is 0 for at least one class; smoothing thus needed with  + p(cj ) ps (cj ) = X |C| + p(cj ) cj ∈C

44 / 99

Kullback-Leibler Divergence (cont’d)

I

KLD has somehow become the “standard” measure for binary, MLMC, and SLMC quantification

I

KLD is a member of the family of “f -divergences”; other such members might be appropriate measures for evaluating quantification; e.g., the Hellinger distance11 X 1 1 1 HD(ˆ p, p) = ( (ˆ p(cj ) 2 − p(cj ) 2 )2 ) 2 cj ∈C

11 Víctor González-Castro, Rocío Alaiz-Rodríguez, Enrique Alegre: Class distribution estimation based on the Hellinger distance, Information Sciences 218 (2013), 146–164. 45 / 99

Measures for evaluating ordinal quantification I

I

I

Ordinal classification ≡ SLMC classification when there is a total order on the n classes Important in the social sciences, ordinal scales often used to elicit human evaluations (e.g., product reviews) The only known measure for ordinal quantification is the Earth Mover’s Distance12 (aka “Wasserstein metric”) |C|−1

EMD(ˆ p, p) =

j j X X X | pˆ(ci ) − p(ci )| j=1

I

i=1

(8)

i=1

The EMD may be seen as measuring the “minimum effort” to turn the predicted distribution into the true distribution, where the effort is measured by I

I

the probability masses that need to be moved between one class and the other; the “distance” traveled by these probability masses

12 Esuli, A. and F. Sebastiani: 2010, Sentiment quantification. IEEE Intelligent Systems 25(4), 72–75. 46 / 99

Measures for evaluating ordinal quantification (cont’d) +,-.!6,:8324,!

6,:8324,!

6,73-89!

5;?@