Language with character: A stratified corpus comparison - Informatics ...

0 downloads 219 Views 298KB Size Report
Feb 21, 2005 - writing, and one in which conventions are quite fluid, e-mail is a genre in which we may ..... the examin
Language with character: A stratified corpus comparison of individual differences in e-mail communication Jon Oberlander

Alastair J. Gill

School of Informatics, University of Edinburgh

February 21, 2005

Correspondent Jon Oberlander Email [email protected] Pmail School of Informatics, University of Edinburgh, 2 Buccleuch Place, Edinburgh, EH8 9LW, Scotland Phone +44 131 650 4439 Fax +44 131 650 4587 Running Head Language with character Keywords Linguistics; psychology; communication; discourse; language production; corpus analysis; individual differences; personality; computer-mediated communication.

1

Language with character

2

Language with character: A stratified corpus comparison of individual differences in e-mail communication

Abstract The goal is to learn more about language production capacities by studying individual differences between adults. We focus on the links between personality traits and language use in e-mail communication. We previously gathered a corpus of e-mail messages written by individuals of known personality, as measured by Eysenck’s EPQR instrument. Here, it is argued that top-down content analysis techniques are not sufficient to reveal expected patterns in the data. We therefore use a bottom-up stratified corpus comparison to isolate linguistic features associated with different personality types, via both word and part-of-speech n-gram analysis. It is shown that: Extraversion is associated with linguistic features involving fluency, positivity and implicitness, and Neuroticism with self-concern, negativity and implicitness. To explain these findings, the paper discusses a model which links affect to language production processes.

Language with character

1

3

Introduction

Give two people a communication task—like e-mailing a friend about recent activities—and they are likely to accomplish it in different ways. Some differences depend on their recent experiences, or on what they think interests the recipient. Others might depend on their characteristic style, or personality. Our primary aim is to learn more about language production capacities, by using a comparative technique. Cognitive science has benefitted greatly from such techniques. For instance, we can compare: generalisation abilities in humans and primates; or linguistic competence in adults and children; or reading processes in languages with different scripts; or problem-solving in adults with varying working memory capacities. In each case, comparison between different types of individuals has helped to illuminate what is core to a cognitive capability, and what is variable. Comparative approaches have also helped show that ostensibly similar surface behaviours may be produced via very different underlying representations or processes. The specific comparative technique we recruit here is the study of individual differences between adults. There are, of course, many dimensions along which adults vary; for instance, in their work on memory and problem-solving, Carpenter et al. (1990) start from scores on Raven Progressive Matrices, from IQ testing. Here, we choose to investigate systematic differences in character, or personality. There are three reasons for this. First, within cognitive science, there is growing interest in the links between cognition and affect (Damasio, 1994). Emotions, such as happiness, can be short-lived, and hence hard to study. By contrast, personality traits can be interpreted as dispositions to experience particular emotions, and are sufficiently stable over time to be amenable to investigation (Matthews et al., 2003). Secondly, there is a body of work on the relations between personality and (spoken) language, and hence there are existing, testable hypotheses. For instance, it might seem that outgoing people speak more loudly than others (Scherer, 1979). Thirdly, a number of personality researchers have been moving towards cognitive models, which associate person-

Language with character

4

ality traits (such as trait-anxiety) with particular information-processing biases (Matthews et al., 2000). It is therefore timely for linguists and cognitive psychologists to consider whether traffic in the opposite direction might also prove productive. Naturally, if we do find systematic personality-based differences in language production, we will want to know how to fit them to existing cognitive models. Beyond all this, a secondary aspect of the current study is that we will learn more about how people express themselves in e-mail. It is a ubiquitous means of written communication, and unlike most writing, it is widely regarded as having much of the spontaneity of speech (B¨alter, 1998; Baron, 1998; Colley and Todd, 2002). As a relatively unplanned form of writing, and one in which conventions are quite fluid, e-mail is a genre in which we may expect to find real differences in how diverse individuals accomplish language production tasks. This paper is structured as follows. We introduce trait theories of personality, summarise some previous findings on language and personality, and select hypotheses for further testing. Some of these previous results have been obtained using content-analysis techniques. We argue that three problems arise when applying such techniques to a corpus of e-mail data. A solution is to exploit bottom-up techniques from computational corpus linguistics. Regularities are uncovered, and we relate these to our hypotheses, suggesting how the results might be integrated into a broader picture of language production and communication.

2 2.1

Background Theories of Personality

One view of personality sees it in terms of essential traits, or factors. There are two main trait models: Eysenck’s three-factor model (Eysenck and Eysenck, 1991; Eysenck et al., 1985), and the five-factor model (Digman, 1990; Costa and McCrae, 1992; Wiggins and Pincus, 1992; Goldberg, 1993). Each factor gives a continuous, orthogonal scale ranging from ‘low’ to ‘high’. In practice, there may be some relationship between traits, especially

Language with character

5

for extreme scorers (cf. Eysenck, 1970; Buckingham et al., 2001; Matthews et al., 2003). The models differ in their theoretical roots. Eysenck’s is claimed to have a ‘biological basis’ (Eysenck, 1970; Eysenck and Eysenck, 1991) underpinning a trait’s validity—along with, for example, its cultural invariance. With the five-factor model, the ‘lexical hypothesis’ is used to derive factors which group statistically, and validity is demonstrated via further replication of these factors (McCrae and Costa, 1987, 1997; Funder, 2001; Matthews et al., 2003). Two core traits are shared by the main models: Extraversion (or Extraversion–Introversion) and Neuroticism (Emotionality–Stability) (Matthews et al., 2003; Lippa and Dietz, 2000). The three-factor model adds Psychoticism, while the five-factor model adds Openness, Agreeableness and Conscientiousness. Recent work suggests that questions like ‘how many traits?’ are not always productive: Larstone et al. (2002) note that ‘each instrument is an imperfect measure of personality that shares components of variance with the other while also tapping specific dimensions’ (cf. Depue and Collins, 1999). It is not our goal to choose between these theories (cf. Deary and Matthews, 1993). For convenience, we adopt the three-factor EPQ-R (Eysenck et al., 1985; Eysenck and Eysenck, 1991) as our measuring instrument. It has at least three advantages. First, fewer factors give a simpler initial model, with these three factors (E, N, P) regarded as having ‘considerable external validity’ (Kline, 1993, p304). Secondly, interpersonal measures are a viable alternative (Wiggins, 1979; Kiesler, 1983), but can be incorporated into more general models (McCrae and Costa, 1989; Trapnell and Wiggins, 1990). Finally, we prefer a theory that has potential for implementational grounding. Given this, we now outline the three traits. First, Eysenck related level of Extraversion to the degree of inhibition and excitation in the central nervous system. In their NEO-PI-R model, Costa and McCrae (1992) divide this personality dimension into six facets: Warmth, Gregariousness, Assertiveness, Activity, Excitement-Seeking, and Positive Emotions. Eysenck and Eysenck (1975) describe Extraversion thus:

Language with character

6

The typical extravert is sociable, likes parties, has many friends, needs to have people to talk to . . . The typical introvert is a quiet, retiring sort of person, introspective, fond of books rather than people; he is reserved and distant except to intimate friends. [p9] Secondly, Eysenck and Eysenck (1975) view level of Neuroticism as largely inherited, and suggest that it is related to the degree of lability of the autonomic nervous system. Costa and McCrae (1992) give the factor six facets: Anxiety, Angry Hostility, Depression, SelfConsciousness, Impulsiveness, and Vulnerability. Costa and McCrae (1984) showed that all the facets are related to psychological well-being: negative affect and lower life satisfaction. According to Eysenck and Eysenck (1975), the high Neuroticism scorer is: an anxious, worrying individual, moody and frequently depressed. [. . . ] The stable individual, on the other hand, . . . is usually calm, even-tempered, controlled and unworried. [pp9–10] Finally, Psychoticism was a late addition to the EPI model (Eysenck et al., 1985, cf. Eysenck and Eysenck, 1964, 1975). Although related to behavioural disorders, it is designed to measure individuals in a ‘normal’ population; to avoid evaluative connotations, it is sometimes termed ‘toughmindedness’. We do not discuss Psychoticism further in this paper.

2.2

Personality and Language

The majority of work to date has focussed on speech rather than writing, and the emphasis has been on Extraversion, and to a lesser extent Neuroticism. The reasons for this are as follows. Speech is ubiquitous, with many easily observable paralinguistic features, such as pronunciation, intonation or loudness. These can be seen to vary across individuals for a variety of reasons, including social or geographical differences. From a sociolinguistic perspective, spontaneous spoken language between close family or friends is often seen as providing the most revealing form of data (e.g., Labov, 1972; Chambers and Trudgill, 1980), and this is why many studies have focussed upon speech (see e.g., Scherer, 1979, for

Language with character

7

a review). In addition, Extraversion is the more salient and visible trait (Funder, 1995), compared to Neuroticism (Lippa and Dietz, 2000). Taking these points together, a focus on Extraversion in spoken language is understandable. Based on the characterisations in Section 2.1, we might expect that Extraverts: think aloud; talk more; are less self-focussed; and tend to skip from topic to topic. And we would expect that Introverts: monopolise the conversation on topics important to them; are more selffocussed; and tend towards discussing one topic in depth (cf. Teiger and Barron-Teiger, 1998). In fact, Furnham (1990) considered the fundamental characteristics of Extraverts, and proposed the following properties of their language. It is less formal. It has a more restricted, rather than elaborated, code. It is uses vocabulary more loosely, where this is defined in terms of how correctly words are used, and how unusual they are. It is more implicit—defined as a showing a preference for pronouns, adverbs and verbs—as opposed to explicit—with a preference for nouns, modifiers and prepositions. Extravert speech will tend towards non-standard accents; speech rate will be higher; and there will be more dysfluencies. Heylighen and Dewaele (2002) argue that Extraverts produce implicit (or informal) language because it requires less effort, while relying more on the context for interpretation. Given that our focus is on written e-mail, we touch briefly on speech features, before discussing grammar, lexical content, and dialogue behaviour. Reviews of various aspects of this work can be found in Scherer (1979), Furnham (1990), Smith (1992), Dewaele and Furnham (1999), and Pennebaker and King (1999). 2.2.1

Speech

Speech of American Extraverts is perceived to be louder and more nasal in voice (Scherer, 1978). For English as second language, we find that high Extravert speakers score lower for pronunciation (Busch, 1982). Extraverts have higher speech rates (Siegman, 1987), in both informal and formal settings

Language with character

8

(Dewaele, 1998; Dewaele and Furnham, 2000). Extraverts also show an inverse relationship with silence quotient (derived from silent pauses and speech rate, Siegman, 1978; but cf. Dewaele, 1998 for issues of silent pauses and measurement of speech rate). In more complex verbal tasks, Introverts’ pauses before speaking were significantly longer than Extraverts’ (Ramsay, 1968). Extravert children and teenagers showed greater verbal fluency for simple and complex recall tasks (Tapasak et al., 1979). Additionally it has been found that in formal situations Extraverts show less hesitation (‘er’), but also make a higher proportion of semantic errors (Dewaele and Furnham, 2000). 2.2.2

Grammar

Extravert speech shows higher counts of pronouns, adverbs, verbs and total number of words (taking ‘zestful’ to be a synonym for Extravert, cf. Furnham, 1990; Dewaele and Furnham, 1999). These characteristics of Extravert language are also found for non-native speakers. Using factor analysis of syntactic tokens from L2 speakers, Dewaele and Furnham (2000) confirm an Extravert preference for implicit language, and Introvert preference for explicit language. This finding holds in both informal and formal situations, and mirrors previous analyses of the individual linguistic categories (Dewaele, 2001). Additionally, Heylighen and Dewaele (2002) note that Introvert language features tend to be closely related to those of formal language. A further finding is that Extraverts demonstrate lower lexical richness in formal situations (controlling for length; Dewaele, 1993; Dewaele and Furnham, 2000). Cope (1969) also notes a lower lexical diversity (measured as type-token ratio), for Extravert native English speakers. However, this is less reliable, given that Extraverts also use a greater total number of words, and thus may be a length effect (cf. Dewaele, 1993; Dewaele and Furnham, 2000). It is worth noting that low type-token ratio is also related to language produced in anxietypromoting situations (Howeler, 1972), and in the perception of greater anxiety (Bradac, 1990). Given that Neurotic individuals are considered to be prone to anxiety, they might also show lower lexical diversity. Since—with Extraverts—this is related to a preference

Language with character

9

for implicit or informal language, it could well be that Neurotics also have a preference for implicit language; however, this hypothesis has not been tested, to our knowledge. 2.2.3

Content

We focus on significant results which have been obtained using the Linguistic Inquiry and Word Count (LIWC; Pennebaker and Francis, 1999, see also the more recent LIWC2001; Pennebaker et al., 2001) text analysis program. LIWC is primarily concerned with lexical content. Although it counts some syntactic features, such as pronouns, and verbs of various tenses, these are not derived from a part-of-speech analysis of the data. We return to methodological issues concerning LIWC, and approaches like it, in section 4.2. Pennebaker and King (1999) applied LIWC analysis to texts written by authors for whom (five-factor) personality information was available. Via factor analysis, they derived a small set of linguistic factors grouping the LIWC features, and correlated them with scores on personality dimensions. There are some clear parallels between this factor-analytic method, and that adopted by Biber (see e.g., Biber, 1995). However, Biber used a broader set of linguistic features, and a dictionary derived from the Brown corpus, his goal being to analyse pre-existing corpora to locate factors associated with register variation across linguistic genres. Putting Pennebaker and King (1999)’s language factors to one side, and examining relationships between the personality dimensions and individual LIWC variables shows the following. High Extraverts use: more social process (like talk or friend) and positive emotion words (happy, good); and fewer negations (no, never), tentative words (maybe, perhaps), exclusives (but, without), inclusives (and, with), causation words (because, hence), negative emotion words (hate, worthless), and articles (a, the). High Neurotics use: more first person singular (I, my) and negative emotion words; and fewer positive emotion words, and articles. High Openness scorers use: more articles, longer words and insight words (think, know); and fewer first person singular, present tense verbs, and causation words. High Agreeableness scorers use: more first person singular and positive emotion words; and fewer articles and negative emotion words. High Conscientiousness scorers use: more pos-

Language with character

10

itive emotion words; and fewer negations, discrepancies (should, would), negative emotion, causation, and exclusive words. 2.2.4

Dialogue

Extraverts show greater desire to communicate and initiate interactions (McCroskey and Richmond, 1990), and this is also found in computer-mediated communication (Yellen et al., 1995). Although overall amount of speaking or text produced by Extraverts is greater, studies of second language speakers have shown that the length of the longest utterances is actually shorter, especially in informal situations (Dewaele, 1995; Dewaele and Furnham, 2000). Additionally, Dewaele (2002b) finds that in L3 English production, Extraversion showed a strong negative relationship to communicative anxiety, whilst Neuroticism showed a positive relationship. Analysis of speech acts shows that Extraverts: initiate more individual and group laughter; use more self-referent statements; and talk more (Gifford and Hine, 1994). Other studies confirm that Extraverts use a greater total number of words (Campbell and Rushton, 1978; Carment et al., 1965). In a study of conversational dyads, coding of the speech acts found that Introverts used more hedges and problem talk, but that Extraverts expressed more pleasure talk, agreement, and compliments, with content focusing more on extracurricular activities. However, significant differences were not found between the groups for talk time or number of speech acts (Thorne, 1987).

3

Hypotheses

It is clear that there are plenty of predictions to be tested on a corpus of e-mail text— especially for Extraversion. In fact, there are so many that we need to find some way to collapse some of these together, to isolate a reasonably concise set of expectations which will help guide subsequent discussion. One point is that we require predictions for two factors. Another point is that we can exploit linguistic approximations to translate diverse

Language with character

11

predictions into a common format. For instance, there have been findings on the use of articles (like the and a or numbers); and there are others on the use of nouns. We expect that articles require nouns (although not vice versa). Hence, an expectation of more article use can be approximated by an expectation of more noun use. With reductions like these, we end up with the following ideas to test. They are framed in terms of what to expect at the High end of a given personality dimension; expectations for the Low end can be derived from these in the obvious way.

3.1

Extraversion Hypotheses

Fluency Extraverts will write more words overall. A written reflex of spoken fluency will be that, rather than using full stops between clauses, they will use greater amounts of informal or non-standard punctuation (ellipsis, exclamation, hyphenation), together with conjunctions which help form longer constituents and clauses.1 Positivity Extraverts will use more terms indicating positive affect (pleasure, agreement, compliments, ability), and fewer terms indicating negative affect. They will use fewer tentative expressions, such as hedges like possibly, and problem talk, including negations and causation words. A positive view of ability may also be reflected in verbmodification patterns. Implicitness Extravert language will contain: more adverbs, pronouns, and verbs; and fewer nouns, adjectives (modifiers) and prepositions. Personal pronouns are also expected because of the social tendency to refer to self and other people; verbs are also expected via reference to actions. Fewer nouns patterns with fewer articles.

3.2

Neuroticism Hypotheses

Self-concern High Neurotics are worriers, and we expect this self-preoccupation to be expressed through a preference for first person singular pronouns over other second or 1

But note that LIWC results suggest that Extraverts will use fewer inclusives and exclusives, and these categories overlap with both conjunctions and prepositions.

Language with character

12

third person, or plural pronouns. There should be more inclusive words, associated with a desire for attachment. Negativity High Neurotics’ language will use more terms associated with emotion, particularly negative affect. Conceivably, it might also generally use more terms for positive affect, but previous findings suggest there will be fewer positive affect terms. A negative view of ability may also be reflected in verb-modification patterns. Implicitness High Neurotic language will contain: more adverbs, pronouns, and verbs; and fewer nouns, adjectives (modifiers) and prepositions. In fact, more emotion in general is also associated with intensified language, such as adverbs and adjectives. So we do expect more adverbs, but the prediction of more adjectives would not fit the general implicitness hypothesis. Pronoun use will be greater, but differentiated as noted above. Fewer nouns patterns with fewer articles.

4

Problems with dictionary-based content analysis

In work reported elsewhere (Gill, 2003), we have analysed a corpus of anonymised, elicited e-mail data using dictionary-based techniques. The corpus is described in more detail in Section 5. The relevance of this is as follows: we used LIWC to attempt to replicate the findings of Pennebaker and King (1999), and we have used another dictionary, the MRC Psycholinguistic database (Wilson, 1987), to provide a different perspective. The dictionaries give a set of (potentially stemmed and lemmatised) forms, to allow occurrences in a corpus to be counted. However, we would argue that dictionary-based techniques such as the ones we employed have significant limitations. First, we note a key feature of our results; then we explore general problems associated with dictionaries.

4.1

Dictionary-based analysis of an e-mail corpus

After replicating Pennebaker and King (1999)’s factor analysis, we carried out multiple regression analyses using both the LIWC dictionary and the MRC dictionary. We summarise

Language with character

13

here the results of the LIWC analysis, which attempts to abstract away from features of genre or topic. We found that both the topic-controlled LIWC dictionary and the MRC dictionary allowed us to explain a certain amount of variance in the data. Using this version of the LIWC dictionary, we could explain 8% of the variance in Extraversion score and 11% of Neuroticism score; the MRC dictionary helped explain 5% and 14% of the variances, respectively. The topic-controlled version of the LIWC follows Pennebaker and King (1999) in removing words associated with personal concerns. However, Pennebaker and King also control for genre, requiring that any linguistic variable to be included in the analysis has a minimum frequency of 1% in the corpus. If we do so, the variance in scores which is explained falls to: 0% for Extraversion and 11% for Neuroticism. In passing, we note that level of Neuroticism correlates positively with use of inclusive words and first person pronouns. So there is a problem here, and it is worst with Extraversion. The conservative LIWC analysis leaves no explanation of variance. This is striking, since the great majority of predictions concerning language and personality involve Extraversion, and its association with fluency, positivity and implicitness. There are two obvious possibilities. One is that the e-mail corpus simply does not possess the normal features associated with Extraversion. The other is that the dictionaries we are using are missing some of the relevant linguistic indicators. The former option does not seem right. As part of the dictionary-based study, we replicated the factor structure uncovered by Pennebaker and King, with some minor differences. So the e-mail genre is relatively similar to the range of texts previously studied. Thus, there may instead be problems in the application of the dictionary-based content-analysis techniques.

4.2

Three problems with dictionaries

A number of problems affect approaches which rely on counting words in a text which match against a pre-defined dictionary of words or stems. First, content-specificity relates

Language with character

14

to which words or stems are included in the dictionary. Secondly, coverage relates to how many words are included in the dictionary. Finally, context-insensitivity relates to uses of how words are matched. First, then, these content analysis approaches are ‘top-down’ methods, and are intrinsically limited by the items which are defined in the dictionaries. This is a particular issue for the LIWC analysis, because its dictionaries have been selected for their psychological relevance; the MRC database is designed to have a much broader coverage, extending across the majority of the vocabulary. In the case of the LIWC, with the exception of the Linguistic Dimensions, these dictionary categories may not generalise well across genre or topic. Pennebaker and King (1999) note that in some of their validation studies, ‘the types of words people used varied tremendously depending on the [. . . ] topic’ [p1300]. Similarly, in a replication of a study which linked therapeutic outcomes with LIWC analysis of diary entries, Stephenson et al. (1997) conclude ‘the particular linguistic correlates of progress vary from one treatment setting to another’ [p409]. Indeed, Mehl and Pennebaker (2003) have recently acknowledged that linguistic style—as measured by articles, prepositions, first person pronouns, present tense verbs, and positive and negative emotion words—showed greater consistency than content. In response to such content-specific limitations of ‘top-down’ content analysis, Campbell and Pennebaker (2003) adopt latent semantic analysis as a ‘bottom-up’ alternative method. Although this is a data-driven approach, it expresses its findings in terms of vector measures for the texts. This is therefore rather more opaque than, for example multi-dimensional analysis, which does at least allow the examination of linguistic features which compose the factors. Secondly, Ball (1994) notes that a problem for all top-down approaches is that of ‘recall’, which relates to the technique’s success in identifying and counting features. This is particularly relevant to LIWC, due to the relatively small size of its dictionaries: despite the inclusion of words and word-stems to broaden potential matches, there are only around 2,000 words, compared with the 40,000 of the MRC database. Furthermore, the simple pattern-matching technique used to identify input words with the dictionaries in both tech-

Language with character

15

niques depends upon the input texts being ‘cleaned’ or edited to specific guidelines. Any failure to clean results in the words not being recognised—or counted. A corollary is that the incorporation of systematic non-standard features (such as words or spellings) in the analysis is precluded. Finally, a further limitation of content analysis techniques is directly acknowledged by Pennebaker and King (1999) in relation to LIWC. It can identify which words are used, but not how they are used. Hazards include ‘context, irony, sarcasm, or [. . . ] multiple meanings of words’ [p1297]. Disambiguation of word senses is less of a problem for the MRC psycholinguistic analysis, since this uses part-of-speech information, but contextual information has still been ignored in these analyses.

4.3

A solution

Therefore, in the next section, instead of top-down approaches, we follow Tribble (2000) in adopting data-driven techniques from computational corpus linguistics; specifically the analysis of n-grams. This has previously been put a variety of uses. For our purposes, it is especially relevant that n-grams have been used to characteristic multi-word terms which distinguish specific types of texts (Damerau, 1993). Since n-gram analysis is a data-driven approach, it will allow us to identify features which are characteristic of different personality groups, irrespective of the specific content of the text. The problem of coverage is also solved, because all expressions are potentially relevant—not just those in a pre-defined dictionary. And the problem of context-insensitivity is at least partially alleviated, because in calculating the probability of groups of terms, or n-grams, occurring together, it captures some of the contextual information of language use. It thus provides us with potential insight into differences in language structuring and the use of formulaic language (Wray and Perkins, 2000).

Language with character

5 5.1

16

Corpus collection Participants

105 current or recently graduated university students participated in this experiment, of which 37 were males, and 68 females. The mean age of subjects was 24.34, with 53 studying (or having studied) at an undergraduate level, and 52 at a postgraduate level. All participants were recruited via e-mail from the experimenter, and spoke English as their first language. A sociobiographical questionnaire and Eysenck Personality Questionnaire-Revised (short version) (Eysenck et al., 1985) were administered to give information about the subjects’ background and scores on the personality dimensions of Psychoticism (Mean score: 2.90, SD 1.7; Normative score: M = 3.08, F = 2.35), Extraversion (Mean score: 7.91, SD 3.3; Normative score: M = 6.36, F = 7.60), Neuroticism (Mean score: 5.51, SD 3.2; Normative score: M = 4.95, F = 5.90), and Lie Scale (Mean score: 3.48, SD 2.2; Normative score: M = 3.86, F = 2.71).

5.2

Materials

The experiment was conducted on-line via an HTML form which subjects filled in and then submitted over the internet. The web page had a simple design to minimise the chances of a subject ‘getting lost’. It first gave an introduction and an estimate of the time required to complete the form, along with contact details, and it indicated that all responses would be treated in confidence and suitably anonymised. The next part of the form was for the collection of sociobiographical and personality information, with the results just noted. The final part consists of the two message writing tasks. Subjects were first instructed: ‘If during either of the following writing tasks, you are worried about writing anything too personal, simply substitute names of people and places as appropriate.’ The writing task was then completed using a large scrollable text box which subjects could type into, with the following instructions provided

Language with character

17

for the first writing task:

‘Imagine you haven’t seen a good friend for quite some time, and in order to keep them up to date with your news you decide to write them an e-mail. In the message you should write about what has happened to you, or what you have done in the past week, trying to remember and write down as much as possible, as quickly as possible. Your message should be written in normal English prose (that is, standard sentences, although don’t worry if your grammar is not perfect). Once you have started writing a sentence, you should complete it and not go back to alter or edit it. Also, don’t worry too much about spelling, and don’t bother addressing it to anyone or signing it. Just write down the main body of the text. You should spend 10 minutes on this task.’ The second writing task was very similar to the first, except that subjects were instructed to write about their plans for the week ahead. On final submission of the form, the subject was thanked, and the form processed to check for any missing obligatory information. On acceptance of the completed form, the subject was given the contact details of the experimenter, for any follow-up.

5.3

Preparation

Under these conditions, 105 subjects provided 2 e-mail texts each, giving around 65,000 words in total. Apart from anonymisation, pre-editing of these e-mail texts was kept to a minimum so as to retain as much individuality as possible (for example, non-standard words and spellings to imitate sounds). Such informal linguistic strategies, along with a relaxed attitude to typographical errors, are regarded as a feature of e-mail (Baron, 1998; Colley and Todd, 2002). However, a distinction was made between intentional non-standard

Language with character

18

spellings for communicative effect, and spelling errors. A basic spell-check was carried out (using the standard emacs spell-checker; Stallman, 1994) and the resulting texts were hand corrected to ensure unintentional spelling errors had been fixed. Copies of texts at each stage of editing were retained for reference, or future analysis if required (Sinclair, 1991).

6

Stratified Corpus Comparison

To analyse the prepared corpus, we use techniques from comparative corpus linguistics, and define a ‘reference corpus’ from authors with a personality profile which is not extreme on any of the measured dimensions. We can then compare authors from each end (‘High’ or ‘Low’) of each personality dimension with this ‘Neutral’ (or ‘Mid’) group. To control for individuals who may be extreme on more than one dimension, we also ensure that authors representative of the extreme groups are ‘Neutral’ on the other dimensions. The primary goal is to identify words (unigrams) or strings of words (n-grams) which form reliable collocations for one group, but not for another; these can then be considered distinctive collocations. A three-way stratified corpus comparison allows us to trace the behaviour of linguistic features along a dimension. By contrast, other studies have usually divided the data using binary categories, such as native/non-native, young/old, or higher/lower class language users (Milton, 1998; Granger and Rayson, 1998; Aarts and Granger, 1998; Rayson and Hodges, 1997).

6.1 6.1.1

Method Procedure

The full e-mail corpus of texts was stratified into sub-corpora as follows. High and Low personality group samples were created by splitting them at greater than 1 standard deviation above and below the EPQ-R score for each dimension. Authors had to be within 1 standard deviation on the dimensions other than the one for which they were extremely high

Language with character

19

or low. Furthermore, all texts which were within 1 standard deviation across all personality dimensions were assigned to the personality-neutral Mid sub-corpus. The resulting sizes of the sub-corpora are as follows. There are around 6,000 words (11 authors) for the High Extraverts, and over 2,000 words (4 authors) for the Low Extraverts. There are just over 3,000 words (6 authors) for the High Neurotics and around 6,000 words (9 authors) for the Low Neurotics. The Mid group contains over 9,000 words (23 authors). Stratification thus leaves us with just over 50% of the subjects and words we started with, but there are relatively few subjects in some sub-groups. This would be a concern if the results were to show that these individuals were unrepresentative; that is, if linguistic behaviour on a dimension deviated greatly from that reported in Section 2.2 and predicted in Section 3. As we shall shortly see, this problem does not arise; we will briefly revisit the topic in Section 8. 6.1.2

Analysis

First, we use a version of the corpus which has been tokenised using the CLAWS tagger (available via the Wmatrix tool; Rayson, 2003), and lemmatised. The tokenisation splits multi-word units, into their constituent parts, for example can’t will be divided into ca and n’t, and also provides some basic annotation, for example marking clause boundaries (represented as ), and ellipsis (). By lemmatising (or stemming), minor variants of words can be collapsed together, increasing the power of the analysis. In such a processed corpus words such as play, plays, played, or playing, are all realised in the base form of the verb: play; punctuation markers (like and ) are collapsed into hpi. More importantly, in our data there are instances of proper nouns. For example: names of places (Edinburgh); names of people (Dave); or days of the week (Saturday). These provide too much specificity to allow broader patterns of language usage to emerge, or for the results to be easily generalised. Therefore a further script was used to collapse proper names into np1, except for names of days, which were collapsed into npd1. Secondly, to identify robust collocations in the tagged sub-corpora, we calculate 1–5 word

Language with character

20

n-grams, and do not use a rank or frequency cut-off during calculation, but only present features with a frequency ≥5. This enables an accurate log-likelihood statistic (G2 ) of their occurrence between groups to be calculated (cf. Rayson, 2003). We use n-gram software (Banerjee and Pedersen, 2003) to compute G2 for 2- and 3-grams. Finally, to identify those robust collocations which distinguish one group from another, we need to make a three-way comparison of the linguistic features across the High-Mid-Low corpora for each group. We calculate the relationships between the three groups, and for each feature in each corpus we identify its frequency and relative frequency, and then where relevant the relative-frequency ratios and log-likelihood between High-Low, High-Mid and Low-Mid groups. This allows us to compare the relative usage and statistical significance of the difference in the use of features between groups.

6.2

Results

Here we present the results from the three-way stratified analysis tabularly, with features ordered by log-likelihood (G2 ) value. Since we only examine expected frequencies of 5 or more—which compare more reliably with the χ2 distribution—we here include results with a critical value of 10.83 or greater, taking this to be equivalent to reaching p ≤ 0.001 significance, and those results with a critical value of 15.13 or greater are taken to be equivalent to reaching p ≤ 0.0001 significance (cf. Rayson, 2003). Note that if a feature is overused by the Mid group, we do not report the G2 for this, and in cases where the relative-frequency ratio or G2 is not available, we replace this by ‘-’. The results are found in Tables 1, 2, and ??. In this presentation, we draw attention to features which are characteristic of the High or Low groups, compared with the usage of the feature more generally. In the tables, we distinguish whether a feature is under- or over-used by one of the three groups (High, Mid or Low), relative to the two other groups; this information is given in the final three columns of each table, with over-use indicated by + and under-use by −. However, a more concise view of the results can be gained in the following way. At least two kind of features can be associated with (say) High Neuroticism:

Language with character

21

n-grams which are over-used by High Neurotics; and n-grams which are under-used by Low Neurotics. Thus, Fig. 1 lists, for each dimension and each sub-group, the features which are associated with that group either via their over-use of the feature, or an opposite group’s underuse. However, it will be noted that for the Mid groups on each dimension, we distinguish between unlabelled n-grams (these are the over-used cases), and others labelled specifically as under-use. The reason for this is that the under-used cases could be reallocated to both High and Low groups (because they over-use the n-grams relative to the Mid group); but this would lead to an n-gram being listed under more than one sub-group on a given dimension, and this would make what is already a complex picture even less clear. We can see that there are reasonable numbers of distinctive collocations. As a short-hand, we will refer to those reaching the most conservative 15.13 critical level (p ≤ 0.0001 significance) as ‘level-1’ collocations, and those reaching between 15.13 and 10.83 (0.0001 < p ≤ 0.001) as ‘level-2’ collocations. There are 15 level-1 collocations for Extraversion; the even larger number of 22 for Neuroticism is, however, somewhat inflated by several n-grams which represent repeated punctuation (most of which corresponds to multiple exclamation marks). Fig. 1 gives a broad view of the level-1 and -2 distinctive collocations. In describing them further, we focus primarily on level-1 collocations, and only refer to level-2 explicitly from time to time. Several types of distinctive collocation can be identified. These involve: punctuation (as just noted); time expressions; proper nouns; personal pronouns; conjunctions (and adverbs); and verbs and auxiliaries, which can be associated with ability. Obviously, a particular collocation can be a member of more than one type. Equally, a collocation may be distinctive on more than one dimension. A notable example of this involves the deictic expressions this and that. [that be] is a collocation preferred by Low Extraverts and dispreferred by Mid Neurotics. Change this example But [this be] is only relevant to the Psychoticism dimension, where it is preferred by High Psychotics. Let us consider the two personality dimensions in turn, in terms of the six collocation types.

1 2 3 4 5 5 6 7 8 8 9 9 10 11 11

12 12 12 13 14 14 14 14 14 14 14 14 15 16 17 18 18 18 18 18

then i day hpi will have np1 and be supposed to be supposed to be supposed to be supposed supposed to fairly hpi although that i and i and np1 take cool hpi from the of it today hpi what i

Rank

play get a be so i play christmas hpi year hpi week hpi i will hpi take with i be supposed that be bread np1 for i really

Feature

7 7 7 21 0 0 0 0 0 0 0 0 20 19 25 6 6 6 6 6

3 15 14 0 10 10 0 28 9 9 0 0 0 8 8

High Freq.

0.0010 0.0010 0.0010 0.0029 0 0 0 0 0 0 0 0 0.0028 0.0027 0.0035 0.0008 0.0008 0.0008 0.0008 0.0008

0.0004 0.0021 0.0020 0 0.0014 0.0014 0 0.0039 0.0013 0.0013 0 0 0 0.0011 0.0011

High R.Freq.

0 0 0 0.0019 0.0001 0.0001 0.0001 0.0001 0.0001 0.0003 0.0006 0.0023 0.0038 0.0011 0.0011 0 0 0 0 0

0.0002 0 0 0 0 0 0.0036 0.0030 0 0 0 0 0.0003 0 0

Mid R.Freq.

3 0 0 0 5 5 5 5 5 5 5 5 0 0 7 0 0 0 0 0

14 5 0 7 3 0 7 0 0 0 5 5 6 0 0

Low Freq.

0.0011 0 0 0 0.0019 0.0019 0.0019 0.0019 0.0019 0.0019 0.0019 0.0019 0 0 0.0026 0 0 0 0 0

0.0052 0.0019 0 0.0026 0.0011 0 0.0026 0 0 0 0.0019 0.0019 0.0022 0 0

Low R.Freq.

1.54 0.73 2.36 3.11 -

2.42 1.29 -

High-Mid R.F.Ratio

21.66 21.67 21.66 21.65 21.66 5.41 3.09 0.80 2.33 -

30.31 0.72 8.66 -

Low-Mid R.F.Ratio

0.87 1.33 -

0.08 1.12 1.24 -

High-Low R.F. Ratio

Note. ∗p < .05, ∗ ∗ p < .01, ∗ ∗ ∗p < .001, ∗ ∗ ∗ ∗ p < .0001, df = 1.

Table 1: Lemmatised n-gram analysis, Extraversion.

0 0 0 22 1 1 1 1 1 4 7 27 44 13 13 0 0 0 0 0

2 0 0 0 0 0 42 35 0 0 0 0 3 0 0

Mid Freq.

13.47*** 13.47*** 13.47*** 2.00 1.35 5.84* 11.79*** 11.54*** 11.54*** 11.54*** 11.54*** 11.54***

0.97 28.86**** 26.93**** 19.24**** 19.24**** 1.02 17.31**** 17.31**** 15.39**** 15.39****

High-Mid G2

10.04** 11.75*** 11.75*** 11.75*** 11.74*** 11.74*** 6.03* 3.34 0.22 2.93 -

35.63**** 16.74**** 23.43**** 10.04** 0.69 16.74**** 16.74**** 9.87** -

Low-Mid G2

0.04 4.44* 4.44* 13.32*** 13.03*** 13.03*** 13.03*** 13.03*** 13.03*** 13.03*** 13.03*** 13.03*** 12.69*** 12.05*** 0.48 3.81 3.81 3.81 3.81 3.81

22.53**** 0.05 8.88** 18.24**** 0.11 6.34* 18.24**** 17.76**** 5.71* 5.71* 13.03*** 13.03*** 15.63**** 5.07* 5.07*

High-Low G2

+ + + + + + +

− −

+ +

+ +

+ +

+ −

+







High Mid Use Use



− + + + + + +

+ + +



+

+

Low Use

Language with character 22

1 2 3 4 5 6 7 8 9 9 10 11 11 12 13 14 15 16 16 16 17 18

19 20 20 21 22 23 23 23 24 24 24 24 25 25 25 26 27 28 29 29 29 29 29 30 31 32 32 32

to the though hpi to np1 we still about it it do rowing and she the film be the time experiment hpi which have not np1 and stuff hpihpi we i ca of time get a go on party hpi stuff hpi have to thesis hpi the he be well hpi

Rank

hpi it hpihpihpihpi hpi np1 hpi hpihpihpi hpi hpi as hpihpi hpi film i go that be hpi he hpi well will be have be all the hpi so be in hpihpi well film be the film well i year hpi to do

Feature

0 7 7 18 0 0 0 0 5 5 5 5 0 0 0 3 4 6 3 0 0 0 0 21 6 0 0 0

0 16 16 21 0 43 11 8 7 0 21 0 0 9 0 0 12 6 6 6 0 0

High Freq.

0 0.0017 0.0017 0.0044 0 0 0 0 0.0012 0.0012 0.0012 0.0012 0 0 0 0.0007 0.0010 0.0015 0.0007 0 0 0 0 0.0051 0.0015 0 0 0

0 0.0039 0.0039 0.0051 0 0.0105 0.0027 0.0020 0.0017 0 0.0051 0 0 0.0022 0 0 0.0029 0.0015 0.0015 0.0015 0 0

High R.Freq.

0.0021 0.0010 0.0022 0.0103 0.0026 0 0 0 0 0 0 0 0.0013 0.0028 0.0019 0.0003 0.0029 0.0009 0 0 0 0 0 0.0026 0.0001 0.0014 0.0019 0.0016

0.0039 0.0002 0.0049 0.0009 0 0.0048 0.0002 0 0 0 0.0010 0.0034 0.0032 0.0008 0.0029 0 0.0003 0 0 0 0 0.0009

Mid R.Freq.

16 0 0 47 15 7 7 7 0 0 0 2 14 14 14 13 5 0 6 6 6 6 6 11 3 12 12 12

38 0 0 2 14 23 0 8 11 11 10 21 21 0 20 9 6 0 0 0 8 17

Low Freq.

0.0022 0 0 0.0066 0.0021 0.0010 0.0010 0.0010 0 0 0 0.0003 0.0020 0.0020 0.0020 0.0018 0.0007 0 0.0008 0.0008 0.0008 0.0008 0.0008 0.0015 0.0004 0.0017 0.0017 0.0017

0.0053 0 0 0.0003 0.0020 0.0032 0 0.0011 0.0015 0.0015 0.0014 0.0029 0.0029 0 0.0028 0.0013 0.0008 0 0 0 0.0011 0.0024

Low R.Freq.

1.65 0.79 0.43 2.83 0.33 1.70 1.98 16.99 -

22.67 0.80 5.95 2.18 15.58 4.96 2.83 8.50 -

High-Mid R.F.Ratio

1.08 0.64 0.81 1.51 0.71 1.03 7.02 0.24 0.59 4.86 1.21 0.88 1.08

1.37 0.32 0.67 1.35 0.87 0.92 0.95 2.43 2.50

Low-Mid R.F.Ratio

0.67 4.37 0.40 1.40 0.87 3.34 3.50 -

18.37 3.27 1.75 1.11 3.67 3.50 -

High-Low R.F. Ratio

Note. ∗p < .05, ∗ ∗ p < .01, ∗ ∗ ∗p < .001, ∗ ∗ ∗ ∗ p < .0001, df = 1.

Table 2: Lemmatised n-gram analysis, Neuroticism.

24 12 25 119 30 0 0 0 0 0 0 0 15 32 22 3 34 10 0 0 0 0 0 30 1 16 22 18

45 2 57 10 0 56 2 0 0 0 12 39 37 9 34 0 4 0 0 0 0 11

Mid Freq.

1.06 0.31 13.74*** 13.44*** 13.44*** 13.44*** 13.44*** 1.56 5.73* 1.00 8.06** 5.47* 10.99*** -

31.66**** 0.68 23.50**** 13.89*** 19.60**** 21.50**** 18.81**** 20.43**** 4.67* 16.67**** 16.12**** 16.12**** 16.12**** -

High-Mid G2

0.06 7.13** 0.45 13.48*** 13.48*** 13.48*** 3.85* 1.22 1.20 0.01 12.48*** 12.45*** 11.56*** 11.56*** 11.56*** 11.56*** 11.56*** 2.35 2.24 0.26 0.12 0.04

2.00 2.65 26.97**** 2.85 15.41**** 21.19**** 21.19**** 0.48 0.26 0.10 0.03 17.34**** 1.94 15.41**** 5.80*

Low-Mid G2

14.47*** 14.16*** 14.16*** 2.21 13.57*** 6.33* 6.33* 6.33* 10.11** 10.11** 10.11** 3.54 12.66*** 12.66*** 12.66*** 2.38 0.25 12.13*** 0.04 5.43* 5.43* 5.43* 5.43* 11.24*** 3.39 10.85*** 10.85*** 10.85***

34.37**** 32.37**** 32.36**** 30.70**** 12.66*** 22.43**** 22.25**** 1.23 0.05 9.95** 12.53*** 18.99**** 18.99**** 18.20**** 18.09**** 8.14** 6.78** 12.13*** 12.13*** 12.13*** 7.24** 15.37****

High-Low G2

+ + − − −

− − −

+ + +





+ + + +

+ − − + −

+ +

+

− +



+



+

− −

High Mid Use Use

+ + + +



+

+ + +

− −

+ +

+

+

+



Low Use

Language with character 23

Language with character

24

High Extraverts (1) [be so] [year hpi] [i will ] [hpi take] [with i] [np1 for ] [i really] (2) [day hpi] [will have] [np1 and ] [and i] [and np1] [take] [cool hpi] [from the] [of it] [today hpi] [what i] Mid Extraverts (1) Underuse: [get a] [christmas hpi] (2) Underuse: [then i] Low Extraverts (1) [play] [i play] [week hpi] [be supposed ] [that be] [bread ] (2) [be supposed to] [be supposed to be] [supposed to be] [supposed ] [supposed to] [fairly] [hpi although] [that i] High Neurotics (1) [hpihpihpihpihpi] [np1 hpi] [hpihpihpihpi] [hpihpihpi] [film] [hpi well ] [all the] [hpihpi well ] [film be] [the film] [well i] (2) [though hpi] [to np1] [and she] [the film be] [the time] [i ca] [have to] [thesis] Mid Neurotics (1) Underuse: [i go] [that be] (2) [we] [hpihpi we] Underuse: [experiment] [of time] Low Neurotics (1) [hpi it] [hpi as] [hpi he] [will be] [have be] [hpi so] [be in] [year hpi] [to do] (2) [to the] [still ] [about it] [it do] [rowing] [hpi which] [have not] [np1 and ] [stuff ] [get a] [go on] [party hpi] [stuff hpi] [hpi the] [he be] [well hpi]

Figure 1: Summary of tokenised, lemmatised analysis. N-grams reaching: (1) the 15.13 critical level (p ≤ 0.0001); (2) between 15.13 and 10.83 (0.0001 < p ≤ 0.001).

Language with character 6.2.1

25

Extraversion

The Extraversion dimension, on this lemmatised analysis, does not involve distinctive multiple punctuation. However, it is worth noting in passing that analysis of a tokenised but unlemmatised version of the corpus does reveal that over-use of ellipsis (multiple periods) is associated with High Extraversion. To the extent that punctuation is involved in distinctive collocations, there is an association with time expressions: High Extraverts (High-E) are associated with [year hpi], while Low Extraverts (Low-E) are associated with [week hpi]. Proper nouns pattern differently for High-E and Low-E. np1 covers singular proper nouns, and we find [np1 for ] (and also the level-2 [np1 and ], and [and np1]) for the High-E, and no NP collocations for the Low-E. Patterns of proper nouns associating with conjunction also connect with distinctive patterns of personal pronoun use. High-E are associated with [with i ], [i really], and [i will ]; for Low-E, the only level-1 collocation for a personal pronoun is [i play]. [i will ] connects to patterns of verbs and auxiliaries. High-E are also associated with patterns involving [take] whereas Low-E are associated with collocations involving [be supposed ] and [play]. Turning finally to expressions that can help form longer constituents and clauses, there are the conjunctions (or inclusives) involving NPs and personal pronouns already noted for the High-E. At level-2, the Mid group under-uses [then i ], which means that both High-E and Low-E use it more than them. Also at level-2, [hpi although], which LIWC would classify as an exclusive, is associated with the Low-E, along with another expression associated with tentativeness: [fairly]. 6.2.2

Neuroticism

The same set of collocation types is useful for discussing Neuroticism. First, multiple punctuation is a feature of High Neuroticism, including sequences of 5, 4 or 3 punctuation marks; closer inspection reveals that these are exclamation marks, rather than ellipses. There are also two collocations involving double punctuation: [hpihpi well ] for High Neurotics (HighN), and [hpihpi we] for the Mid group; we will return to these shortly. Considering collocations involving only single punctuation marks, it is the Low Neurotics (Low-N) who show

Language with character

26

particularly distinctive behaviour: several collocations involve a mark followed by a word: [hpi it], [hpi as], [hpi he], and [hpi so]. Both as and so can help connect clauses together. The Low-N have a preference for collocations involving a word followed by a punctuation mark: for instance, the level-1 [year hpi] involves a time expression, and is also associated with High-E. High-N have only one collocation involving a word followed by a punctuation mark: the level-2 [though hpi], which can be considered a connective. This is, in fact, reminiscent of the Low-E level-2 use of [hpi although]. High-N do have a pair of level-1 collocations involving punctuation followed by a word: [hpi well ] and [hpihpi well ], and these pattern with another High-N, [well i ]. Looking at personal pronoun collocations more generally, we find that the Mid group over-use both [we] and [hpihpi we], while under-using [i go]. The Low-N only show a preference for collocations involving third person: [hpi it] and [hpi he]. More generally, singular proper nouns again pattern distinctively. High-N show a preference for [np1 hpi] (and at level-2, [to np1], contrasting with the Low-N level-2 [np1 and ]). High-N are also associated with collocations involving [the film be]. This last collocation links to verbs and auxiliaries. The Low-N show a level-1 preference for [will be], [have be], [be in] and [to do]. The Mid group under-uses both [i go], and [that be]. On their own, the High-N only have level-2 collocations, associated with the negative [i ca]—the stem of I can’t—and [have to]. To sum up so far: stratified corpus comparison appears to be capable of revealing distinctive collocations for each of the personality dimensions of interest. Notably, the technique reveals features in our e-mail corpus even for Extraversion, where the dictionary-based techniques encountered difficulties. The collocations can be usefully grouped into types, and this helps to organise the bottom-up data. Interestingly, via the results on proper noun collocations, lemmatisation has already pointed to the utility of parts-of-speech analysis. We pursue this further level of analysis in the next section, before returning to discuss how the current results bear on the hypotheses framed in Section 3.

Language with character

7 7.1

27

Syntactic Analysis of the Corpus Method

The personality corpus was tagged using the Penn part-of-speech tagset and the MXPOST tagger (Ratnaparkhi, 1996). Further processing removed the original words, leaving their associated POS tags. A subsequent stage of processing reduced the POS tags from the detailed Penn tagset to more general syntactic categories. The 45 Penn tags (see Marcus et al., 1994, for more details) were converted to 10 broader categories, as implemented in the electronic version of the Shorter Oxford English Dictionary which is incorporated into the MRC Psycholinguistic Database (Wilson, 1987). These are: Noun (nn), Adjective (adj), Verb (vbn), Adverb (adv), Preposition (prp), Conjunction (conj), Pronoun (prn), Interjection (int), Past Participle (vpp), and Other (o). In addition to these categories, we also make use of hpi indicating punctuation, and ‘NA’, which indicates that a feature does not belong to any of the above categories and generally represents the hENDi, which was used to mark the end of the e-mail texts. Thus, the categories are a superset of those in the MRC database, but it should be noted that the labels for the categories differ in places from those used in both the Penn tagset, and the MRC database: for instance, we use ‘prp’ where the MRC uses ‘R’. The resulting general syntactic version of the corpus was then divided into the High-MidLow stratified corpus groups and analysed, as in the previous section. We firstly display the results of the unigram analysis for each dimension. We then display the results of the overall n-gram analyses (1–5 item sequences) together, as in the previous lemmatised word n-gram analysis.

7.2

Unigram Syntactic Analysis Results

Results of the unigram analysis can be found in Tables 3, 4, and ??. We display the results for all tags present in our data; however G2 values which achieve significance of p ≤ 0.05 or p ≤ 0.01 are noted by ∗ or ∗∗ respectively. Once again, in the tables, we indicate whether

Language with character

28

High Extraverts [conj] Mid Extraverts – Low Extraverts [vpp] High Neurotics [conj] [prn] Mid Neurotics – Low Neurotics [adj] [nn]

Figure 2: Summary of unigram POS analysis: characteristic language a feature is under- or over-used by one of the three groups (High, Mid or Low), relative to the others. And when presenting a more concise view, in Fig. 2, we again list, for each dimension and each sub-group, features associated with that group either via their over-use of the feature, or an opposite group’s under-use. For Extraversion, conjunction (conj) is characteristic of High-E, and past participle verbs (vpp) of Low-E. The Mid Extravert group shows no significant under- or over-use of the general tags. For Neuroticism, conjunction (conj) and pronouns (prn) are characteristic of High-N, and adjectives (adj) and nouns (nn) of Low-N. The Mid Neurotic group shows no significant under- or over-use of the general tags. For these POS tag unigram results, we note the generally modest levels of significant differences between groups relative to the previous n-gram analyses. We may take this to indicate that the personality groups generally use quite similar proportions of the relevant parts of speech. However, the POSs may also occur in different contexts or sequences, thus indicating differences in they way they are used. We therefore turn to the results of the n-gram analysis of the syntactic tag data.

7.3

N-gram Syntactic Analysis Results

Results using the reduced syntactic category tags are to be found in Tables 5, 6 and ??. The features here reach higher levels of significance than the unigrams, so we display only those which reach the critical value of 10.83 (i.e., p ≤ 0.001; level-2 in our current short-hand). 32 n-gram features reach this value for Neuroticism and 25 for Extraversion. Of these, the

1 2

3 4 5 6 7 8 9 10 11 12

ADJ CONJ

NN PRN INT VPP VBN NA PRP O ADV hpi

Rank

2 3 4 5 6 7 8 9 10 11 12

Feature

1

CONJ ADV PRP O VBN hpi ADJ NA PRN NN INT

Rank

VPP

Feature

625 424 9 63 688 13 352 627 318 382

193 155

High Freq.

258 562 679 1071 1156 667 404 23 696 1177 11

118

High Freq.

0.1624 0.1102 0.0023 0.0164 0.1787 0.0034 0.0915 0.1629 0.0826 0.0992

0.0501 0.0403

High R.Freq.

0.0378 0.0824 0.0995 0.1570 0.1695 0.0978 0.0592 0.0034 0.1020 0.1725 0.0016

0.0173

High R.Freq.

0.0310 0.0882 0.1008 0.1570 0.1652 0.0960 0.0565 0.0043 0.1024 0.1782 0.0019

0.0185

Mid R.Freq.

88 238 231 369 449 228 136 9 277 442 5

66

Low Freq.

0.0347 0.0938 0.0910 0.1454 0.1769 0.0898 0.0536 0.0035 0.1091 0.1742 0.0020

0.0260

Low R.Freq.

1.22 0.93 0.99 1.00 1.03 1.02 1.05 0.78 1.00 0.97 0.84

0.93

High-Mid R.F.Ratio

1.12 1.06 0.90 0.93 1.07 0.94 0.95 0.82 1.07 0.98 1.02

1.41

Low-Mid R.F.Ratio

1.09 0.88 1.09 1.08 0.96 1.09 1.11 0.95 0.93 0.99 0.82

0.67

High-Low R.F. Ratio

5.80* 1.67 0.06 0.00 0.44 0.14 0.53 0.95 0.01 0.76 0.23

0.34

High-Mid G2

0.88 0.71 2.02 1.82 1.65 0.84 0.32 0.30 0.89 0.19 0.00

5.43*

Low-Mid G2

0.1782 0.1024 0.0019 0.0185 0.1652 0.0043 0.1008 0.1570 0.0882 0.0960

0.0565 0.0310

Mid R.Freq.

1230 648 6 146 1132 19 650 1035 595 657

447 210

Low Freq.

0.1815 0.0956 0.0009 0.0215 0.1671 0.0028 0.0959 0.1528 0.0878 0.0970

0.0660 0.0310

Low R.Freq.

0.91 1.08 1.22 0.88 1.08 0.78 0.91 1.04 0.94 1.03

0.89 1.30

High-Mid R.F.Ratio

1.02 0.93 0.46 1.16 1.01 0.65 0.95 0.97 1.00 1.01

1.17 1.00

Low-Mid R.F.Ratio

0.89 1.15 2.64 0.76 1.07 1.20 0.95 1.07 0.94 1.02

0.76 1.30

High-Low R.F. Ratio

4.13* 1.62 0.23 0.74 3.04 0.63 2.55 0.62 1.04 0.31

2.15 7.09**

High-Mid G2

Table 4: Reduced syntactic tag unigram analysis, Neuroticism.

1945 1118 21 202 1804 47 1100 1714 963 1048

617 338

Mid Freq.

0.27 1.93 3.19 1.95 0.09 2.63 0.99 0.48 0.01 0.04

6.15* 0.00

Low-Mid G2

Table 3: Reduced syntactic tag unigram analysis, Extraversion.

338 963 1100 1714 1804 1048 617 47 1118 1945 21

202

Mid Freq.

5.22* 5.06* 3.48 3.44 1.94 0.26 0.53 1.60 0.78 0.13

10.50** 6.01*

High-Low G2

0.50 2.76 1.40 1.64 0.60 1.23 1.03 0.02 0.89 0.03 0.13

6.73**

High-Low G2

− +

+

High Mid Use Use

+

High Mid Use Use

+

Low Use

+

Low Use

Language with character 29

Language with character

30

majority for Neuroticism and Extraversion also reach the 15.13 critical value (p ≤ 0.0001; level-1): 23 and 17, respectively. Level-1 features are predominantly bigrams, exceptions being the longer n-grams for punctuation found for Neuroticism. Once again, we distinguish whether a feature is under- or over-used by a group, and the concise view is given in Fig. 3. Previously, we looked for distinctive word and/or punctuation collocations. In interpreting this new data, we now look for distinctive POS collocations. Table 7 shows, for each sub-group, how many distinctive level-1 and -2 collocations involving each POS were found. In the text, we focus primarily on level-1, except where otherwise stated. High Extraverts (1) [conj vbn] [nn nn] [adv hpi] [prn nn] [hpi o] [adv o] [adj hpi] [nn adv] [conj adv] [vpp prp] [adj o] [hpi adj] (2) [prn o adv] [vbn o nn hpi] [prn o adv vbn] [hpi o vbn adj hpi] [hpihpihpi] Mid Extraverts (1) Underuse: [hpi adv] [hpi nn] Low Extraverts (1) [adv prp] [prn adv] [vbn prn o] (2) [vbn prn adv] [conj vbn prn] [vbn hpi prn] High Neurotics (1) [vbn prp] [hpi o] [hpihpihpihpihpi] [hpihpihpihpi] [hpihpi] [hpihpihpi] [vbn prn o] [adj prn vbn] [prp adj] [vbn o vbn adv] (2) [prn vbn prn o adv] [vbn adj conj] [adj prn] [vbn prn o adv vbn] [vbn prn o adv] [adv prn vbn prn] Mid Neurotics (1) Underuse: [prn hpi adv] (2) Underuse: [nn vbn o adj] [nn vbn o adj nn] [prn o vbn hpi] Low Neurotics (1) [hpi adv] [prn adv] [adv adv] [adj hpi] [adv o] [vpp adv] [o adv] [adv prn] [conj adv] [adv vpp] [prn adj] [vpp prp]

Figure 3: Summary of POS analysis. N-grams reaching: (1) the 15.13 critical level (p ≤ 0.0001); (2) between 15.13 and 10.83 (0.0001 < p ≤ 0.001).

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 15 16

17 18 19 20 20 21 22 23

PRN O ADV VBN O NN hpi PRN O ADV VBN VBN PRN ADV CONJ VBN PRN VBN hpi PRN hpi O VBN ADJ hpi hpihpihpi

Rank

hpi ADV hpi NN CONJ VBN ADV PRP NN NN ADV hpi PRN NN PRN ADV hpi O ADV O ADJ hpi NN ADV CONJ ADV VBN PRN O VPP PRP ADJ O hpi ADJ

Feature

20 25 19 0 0 2 6 18

76 68 60 0 116 89 75 0 71 65 56 55 41 0 33 33 25

High Freq.

51 11 47 18 19 17 0 12

0 0 0 0 220 180 91 0 116 101 88 109 49 36 35 58 23

Mid Freq.

0.0047 0.0010 0.0043 0.0016 0.0017 0.0016 0 0.0011

0 0 0 0 0.0202 0.0165 0.0083 0 0.0106 0.0093 0.0081 0.0100 0.0045 0.0033 0.0032 0.0053 0.0021

Mid R.Freq.

1 3 1 5 5 8 0 0

36 25 0 34 0 0 0 14 0 0 0 0 0 8 0 0 0

Low Freq.

0.0004 0.0012 0.0004 0.0020 0.0020 0.0032 0 0

0.0142 0.0099 0 0.0134 0 0 0 0.0055 0 0 0 0 0 0.0032 0 0 0

Low R.Freq.

0.63 3.64 0.65 0.19 2.40

0.84 0.79 1.32 0.98 1.03 1.02 0.81 1.34 1.51 0.91 1.74

High-Mid R.F.Ratio

0.08 1.17 0.09 1.20 1.13 2.03 -

0.96 -

Low-Mid R.F.Ratio

7.44 3.10 7.06 0.09 -

0.79 1.01 -

High-Low R.F. Ratio

3.31 14.15*** 2.71 7.54** 11.47*** 5.67*

145.26**** 129.97**** 114.68**** 2.23 3.34 3.11 0.02 0.03 0.01 1.71 1.88 2.84 0.19 3.65

High-Mid G2

Note. ∗p < .05, ∗ ∗ p < .01, ∗ ∗ ∗p < .001, ∗ ∗ ∗ ∗ p < .0001, df = 1.

Table 5: Reduced syntactic tag n-gram analysis, Extraversion.

0.0029 0.0037 0.0028 0 0 0.0003 0.0009 0.0026

0.0111 0.0100 0.0088 0 0.0170 0.0130 0.0110 0 0.0104 0.0095 0.0082 0.0081 0.0060 0 0.0048 0.0048 0.0037

High R.Freq.

14.76*** 0.06 13.25*** 0.12 0.06 2.46 -

120.11**** 83.41**** 113.44**** 46.71**** 0.01 -

Low-Mid G2

7.22** 4.57* 6.68** 13.05*** 13.05*** 12.14*** 3.79 11.38***

1.39 0.00 37.95**** 88.76**** 73.36**** 56.29**** 47.43**** 36.55**** 44.90**** 41.11**** 35.42**** 34.78**** 25.93**** 20.89**** 20.87**** 20.87**** 15.81****

High-Low G2

− − − + +

+



+

− −

High Mid Use Use





− − −

+ − − − + − − − − −

Low Use

Language with character 31

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 15 15 16 17 18 19 20 21

22 22 23 24 25 26 26 26 27

PRN VBN PRN O ADV VBN ADJ CONJ ADJ PRN VBN PRN O ADV VBN VBN PRN O ADV NN VBN O ADJ NN VBN O ADJ NN PRN O VBN hpi ADV PRN VBN PRN

Rank

VBN PRP hpi ADV hpi O PRN ADV ADV ADV hpihpihpihpi hpi hpihpihpi hpi ADJ hpi ADV O hpi hpi VPP ADV O ADV hpihpi hpi VBN PRN O ADV PRN CONJ ADV ADV VPP PRN ADJ VPP PRP ADJ PRN VBN PRP ADJ VBN O VBN ADV PRN hpi ADV

Feature

7 7 11 8 9 3 2 2 6

84 0 38 0 0 24 29 0 0 59 0 0 36 24 0 0 0 0 0 10 8 5 2

High Freq.

3 7 5 2 3 0 0 0 1

0 0 116 0 110 0 2 88 101 48 0 89 12 36 45 49 52 35 35 2 39 23 0

Mid Freq.

0.0003 0.0006 0.0005 0.0002 0.0003 0 0 0 0.0001

0 0 0.0106 0 0.0101 0 0.0002 0.0081 0.0093 0.0044 0 0.0082 0.0011 0.0033 0.0041 0.0045 0.0048 0.0032 0.0032 0.0002 0.0036 0.0021 0

Mid R.Freq.

0 0 7 1 2 6 6 6 1

0 82 0 35 73 0 1 67 63 17 27 57 4 0 35 35 35 28 25 5 0 1 8

Low Freq.

0 0 0.0010 0.0001 0.0003 0.0009 0.0009 0.0009 0.0001

0 0.0121 0 0.0052 0.0108 0 0.0001 0.0099 0.0093 0.0025 0.0040 0.0084 0.0006 0 0.0052 0.0052 0.0052 0.0041 0.0037 0.0007 0 0.0001 0.0012

Low R.Freq.

6.62 2.84 6.24 11.35 8.51 17.03

0.93 41.15 3.49 8.51 1.89 14.19 0.58 0.62 -

High-Mid R.F.Ratio

2.26 0.81 1.07 1.61

1.07 0.81 1.23 1.01 0.57 1.03 0.54 1.25 1.15 1.08 1.29 1.15 4.03 0.07 -

Low-Mid R.F.Ratio

2.77 14.09 7.92 0.88 0.59 0.59 10.56

51.06 6.11 15.85 3.52 8.80 0.44

High-Low R.F. Ratio

Note. ∗p < .05, ∗ ∗ p < .01, ∗ ∗ ∗p < .001, ∗ ∗ ∗ ∗ p < .0001, df = 1.

Table 6: Reduced syntactic tag n-gram analysis, Neuroticism.

0.0018 0.0018 0.0029 0.0021 0.0023 0.0008 0.0005 0.0005 0.0016

0.0218 0 0.0099 0 0 0.0062 0.0075 0 0 0.0153 0 0 0.0094 0.0062 0 0 0 0 0 0.0026 0.0021 0.0013 0.0005

High R.Freq.

8.42** 3.65 12.73*** 12.72*** 12.52*** 8.07** 5.38* 5.38* 11.00***

225.90**** 0.16 64.56**** 64.38**** 40.46**** 50.08**** 5.53* 17.29**** 2.18 1.06 5.38*

High-Mid G2

1.97 0.03 0.01 11.52*** 11.52*** 11.52*** 0.11

157.42**** 67.19**** 0.20 0.03 1.57 0.00 4.28* 51.84**** 0.03 1.27 0.99 0.40 0.14 0.99 0.29 3.15 15.81**** 15.36****

Low-Mid G2

14.22*** 14.22*** 4.58* 10.87*** 9.65** 0.03 0.46 0.46 7.34**

170.58**** 73.77**** 77.17**** 31.49**** 65.68**** 48.75**** 51.03**** 60.28**** 56.68**** 54.32**** 24.29**** 51.28**** 50.70**** 48.74**** 31.49**** 31.49**** 31.49**** 25.19**** 22.49**** 5.71* 16.25**** 5.65* 1.25

High-Low G2

+

+ + +

+

− + + − − − − − +

− + + − − +

+

− − −



High Mid Use Use

Language with character 32



− −

+

+ − +

Low Use

Language with character

POS hpi adj adv conj nn prn prp vbn vpp o na Total

33

Extraversion High Mid Low 7 4 6 2 4 3 1 4 1 7 0 39

2 0 1 0 1 0 0 0 0 0 0 4

1 0 3 1 0 5 1 4 0 1 0 16

Neuroticism High Mid Low 5 4 5 1 0 7 2 9 0 6 0 39

2 2 1 0 2 2 0 3 0 3 0 15

2 2 9 1 0 3 1 0 3 2 0 23

Total 33 14 34 8 17 31 15 25 8 23 1 209

Table 7: Distribution of distinctive collocations involving a given POS and reaching the 10.83 critical level (p ≤ 0.001).

7.3.1

Extraversion

From the unigram analysis, we are particularly interested in collocations involving conjunctions (for the High-E) and past participle verbs (for the Low-E). As far as conjunctions are concerned, High-E are associated with the use of [conj vbn] and [conj adv], while Low-E are associated with the level-2 [conj vbn prn]. The latter offers a particularly distinctive collocation, since the pronoun switches the preference from High to Low. Turning to past participles, we find that High-E prefer [vpp prp], but there are no preferred collocations for Low-E. Given Table 7, the remaining discrepancies between the High-E and Low-E are as follows. Allowing that there are substantially more distinctive collocations for the High-E overall, we find that the High-E have more collocations involving adjectives and nouns. The LowE have more collocations involving pronouns (and proportionately more involving verbs). Using a log-likelihood test, only the pronoun tendency is significant (p < 0.05).

Language with character 7.3.2

34

Neuroticism

Here, we are most interested in collocations involving pronouns and conjunctions (for the High-N) and adjectives and nouns (for the Low-N). Taking pronouns first, we find a High-N preference for [adj prn vbn] and [vbn prn o]; there are no level-1 collocations involving conjunctions. Both the pronoun collocations also involve verbs. In fact, putting to one side the punctuation collocations, all but one of the High-N preferences involve verbs; the only exception is [prp adj]. Like the High-N, the Low-N have one pronoun collocation involving an adjective—[prn adj]—but the other three of their preferred pronoun or conjunction collocations instead involve adverbs: [prn adv], [adv prn] and [conj adv]. Given Table 7, and allowing that there are rather more distinctive collocations for the HighN group overall, we find that the High-N have more collocations involving verbs. The Low-N have more collocations involving: past participle verbs and adverbs. All these tendencies are significant (p < 0.05).

8

Discussion

Using data-driven techniques we have been able to investigate linguistic features which characterise the expression of personality in e-mail communication, without being restricted by pre-defined dictionaries. We discuss findings for each dimension in turn, and then sketch the beginnings of model which could provide an integrated explanation of them.

8.1

Extraversion

The original predictions were as follows: Fluency Extraverts will write more words overall. A written reflex of spoken fluency will be that, rather than using full stops between clauses, they will use greater amounts of informal or non-standard punctuation (ellipsis, exclamation, hyphenation), together with conjunctions which help form longer constituents and clauses.

Language with character

35

Positivity Extraverts will use more terms indicating positive affect (pleasure, agreement, compliments, ability), and fewer terms indicating negative affect. They will use fewer tentative expressions, such as hedges like possibly, and problem talk, including negations and causation words. A positive view of ability may also be reflected in verbmodification patterns. Implicitness Extravert language will contain: more adverbs, pronouns, and verbs; and fewer nouns, adjectives (modifiers) and prepositions. Personal pronouns are also expected because of the social tendency to refer to self and other people; verbs are also expected via reference to actions. Fewer nouns patterns with fewer articles. Let us consider the fluency hypotheses first. We noted the existence of non-standard, multiple punctuation for High-E, shown here in distinctive POS collocations, and confirmed in unlemmatised word collocations. There is a mixture of ellipsis and multiple exclamation. The former fits fluency particularly well. The word collocations showed a number of conjunction constructions for High-E, and their distinctive use of conjunction was confirmed by the unigram POS frequency results. It was notable, also, that High-E conjunction collocations included what LIWC would term inclusive words, such as and and with, while the Low-E had although, an exclusive word. Next, there is the issue of positivity. High-E showed a preference for collocations containing positive affect terms cool and take, from take care. However, Low-E did use play, which also has positive connotations. It was notable that the Low-E had collocations containing tentative or negative terms: although and fairly. The expected sociability of High-E was reflected in a bias towards word collocations involving the first person singular; High-E had five of these collocations, while Low-E had only two. We shall return to these patterns of pronoun use while discussing implicitness, below. From the point of view of ability, we found High-E with collocations involving will and Low-E with be supposed to, with connotations of obligation. Low-E also showed a general preference for past participle verbs. This turns our attention towards the implicitness hypothesis.

Language with character

36

Following Dewaele and Furnham, it was predicted that High-E will use more verbs, adverbs and pronouns, and Low-E will use more nouns, adjectives, and prepositions. The unigram POS analysis did not support these predictions. It indicated that High-E use more conjunctions, and that Low-E use more past participle verbs. No other overall differences were found, although it is worth noting that having split past participles from general verbs, our categories are slightly finer-grained, which may affect the result. One reason for the divergence from Dewaele and Furnham’s results may be that they were largely dealing with speech, rather than writing, and with non-native speakers in particular. Perhaps implicitness is more closely related to another dimension, like Neuroticism, for native writers, and more closely related to Extraversion only for non-native speakers. Another possible reason is methodological. As noted earlier, with a small number of Low-E left after stratification of the corpus, it could be that these subjects are unrepresentative. However, their behaviour with respect to fluency and positivity was as expected, and hence the subjects so far appear representative. Thus, before pursuing either of these lines of reasoning, we should also consider the results of the n-gram POS analysis. At least two gross patterns are interesting. First, where a High and Low group do not differ overall in the relative frequency of use of a POS, one group may have rather more types of distinctive collocation involving that POS than the other group. If overall use does not differ, it means that the former group is using the POS in a narrower, or perhaps more stereotypical, range of contexts; the latter group is using the POS more flexibly, in many different contexts. Let us call the greater-range case ‘pervasive’ use. Secondly, where a High and Low group do differ in relative frequency of use of a POS, it is interesting to note whether higher frequency is associated with a greater set of collocations involving that POS, or a smaller set. Intuitions here are not firm; but we might expect that greater relative frequency is associated with a greater range of use—and hence, with perhaps fewer stereotypical collocations. If so, frequency may track pervasiveness. So, consider again the implicitness hypothesis: High-E will use more verbs, adverbs and pronouns, and Low-E will use more nouns, adjectives, and prepositions. We find that High-

Language with character

37

E prefer conjunctions overall, but that it is the Low-E who tend towards POS-collocations involving pronouns. So High-E use of pronouns may not be not greater overall, but it is pervasive. Equally, Low-e prefer past participle verbs overall, and we have (weaker) evidence that the High-E tend towards POS-collocations involving adjectives and nouns. So, perhaps Low-E use of adjectives and nouns is pervasive. But more definitely, since Low-E actually use proportionately more vpp, their complete lack of distinctive robust collocations suggests that they use vpp pervasively. No additional results are found for prepositions. Perhaps, pace LIWC, the exclusives and inclusives are preferred by different groups, as suggested above; but these differences are not apparent at the POS level. Thus, we conclude that, by supplementing the frequency findings with the idea of pervasiveness, our results can be seen as consistent with the implicitness hypothesis.

8.2

Neuroticism

The original predictions were as follows: Self-concern High Neurotics are worriers, and we expect this self-preoccupation to be expressed through a preference for first person singular pronouns over other second or third person, or plural pronouns. There should be more inclusive words, associated with a desire for attachment. Negativity High Neurotics’ language will use more terms associated with emotion, particularly negative affect. Conceivably, it might also generally use more terms for positive affect, but previous findings suggest there will be fewer positive affect terms. A negative view of ability may also be reflected in verb-modification patterns. Implicitness High Neurotic language will contain: more adverbs, pronouns, and verbs; and fewer nouns, adjectives (modifiers) and prepositions. In fact, more emotion in general is also associated with intensified language, such as adverbs and adjectives. So we do expect more adverbs, but the prediction of more adjectives would not fit the general implicitness hypothesis. Pronoun use will be greater, but differentiated as

Language with character

38

noted above. Fewer nouns patterns with fewer articles. Consider self-concern. From the results on word collocations, we find a High-N tendency towards first person, which contrasts with a Low-N preference for third person collocations. Mid Neurotics show a preference for first plural. The unigram POS results confirm an overall High-N preference for pronouns, and also show a preference for conjunctions. If the latter involve inclusive words in particular, they fit with a desire for attachment. The POS collocation results add nothing further to this picture. Next, there is the issue of negativity. The High-N preference for multiple punctuation is revealed by both the word and POS collocation analyses, and reflects a particular use of exclamation marks, which fits with greater emotional expressiveness, but not negativity per se. In this connection, we note the presence of High-N collocations involving clause-initial well; this is a filler expression with little independent meaning, but with connotations of concession. An expectation of intensity might lead to greater use of both adjectives and adverbs. In fact, the POS unigram results indicated that it was the Low-N who preferred adjectives; we defer the matter of adverbs to the discussion of implicitness. Negativity did appear in both clause connectives, and ability-related expressions. High-N preferred though to expressions such as as and so. And they preferred can’t and have to to collocations involving will, have and be. Finally, we consider the implicitness hypothesis. Like High-E, High-N are predicted to prefer implicit language: High-N will use more verbs, adverbs and pronouns, and Low-N will use more nouns, adjectives, and prepositions. The unigram analysis partially supported these predictions. It found that High-N use more pronouns (and conjunctions), and that Low-N use more nouns and adjectives. However, no overall differences were found for verbs, adverbs or prepositions. Taking pervasiveness into account, we find that the Low-N tend towards POS-collocations involving past participle verbs and adverbs. So perhaps High-N use of past participle verbs and adverbs is pervasive. Equally, it is the High-N who tend towards POS-collocations

Language with character

39

involving verbs. And so perhaps Low-N use of verbs is pervasive. This pattern is not so simple as the Extravert case, in part because we have split the verb category in two, distinguishing past participle verbs from verbs in general. Putting this to one side, however, we do find High-N use of adverbs to be pervasive; and this approach to Neurotic implicitness fits the picture of pervasiveness that seemed to be emerging with Extraversion.

8.3

Towards a model of personality and language production

Our study has used bottom-up methods to reveal personality patterns in e-mail text. Specifically, we have found support for: Extravert fluency, positivity and implicitness; and Neurotic self-concern, negativity and implicitness. But how are these features to be explained at the level of language production? In seeking a model, we start from the following point. Both Extraversion and Neuroticism are associated with implicit language; the first dimension also involves fluency, while the second also involves self-concern. Heylighen and Dewaele (2002) have indicated why particular parts of speech, like pronouns or adverbs, are likely to group together as implicit. Essentially, they are more context-bound, being better interpretable with respect to the immediate environment, either linguistic or extra-linguistic. They also require less cognitive effort to produce, compared with context-neutral, ‘syntactic mode’ expressions. For our purposes, it follows that a language producer will tend towards implicit language either when they have greater access to salient information about the immediate environment, or when they have more limited processing resources available for linguistic choice. Now, we can argue that High-E and High-N tend towards implicitness for different reasons. Dewaele (2002a) notes that High Extraverts are considered to have larger working memory capacities, and he suggests that this allows working memory’s central executive to divert resources to the visual-spatial scratchpad. With this resource, utterances can be related to the spatio-temporal context. In turn, the typically briefer context-bound linguistic expressions require less of the phonological loop’s time-limited capacity. We can add that High Extraverts have a relatively large capacity there, anyway, and together these facts mean

Language with character

40

that a larger number of expressions can be queued in the loop. Enhanced visual-spatial capacity reduces the need for detailed language planning, allowing more words to be produced instead, and thus enables High Extraverts to gain or maintain the conversational floor. This allows Extraverts to pursue the social drive to interact with others. Hence, fluency and implicitness interact, at the level of detailed language planning. By contrast, Neurotics have generally limited resources because they are pre-occupied with personal concerns; this is a key feature of Wells and Matthews (1994)’s S-REF model of neuroticism. For us, anxiety about external threats, and processing over internal representations of issues concerning the self, diverts resources away from language planning; hence, Neurotics become more implicit, while becoming more likely to speak about their current personal concerns. Here, self-concern and implicitness interact, but they do so at a stage prior to detailed language planning. High Extraverts have plenty of resource for linguistic interaction, but need to put less of it into detailed planning. High Neurotics have less resource for linguistic interaction in the first place. If this is so, then it suggests a simple model of the relations between personality variation and language production algorithms. It can be framed in terms of Levelt (1989)’s architecture for human language production, and also in terms of fairly standard natural language generation systems. First, Extraversion finds its effects mostly at the stages of formulation (surface realisation). That is, the process and representations used in realisation differ between High and Low Extraverts. Secondly, Neuroticism finds its effects at the stage of conceptualisation (content selection). That is, the process and representations used in content selection differ between High and Low Neurotics. Since content selection precedes surface realisation, variations in Neuroticism will have consequences beyond the content selection stage, but this is their primary locus. The next question is whether the findings on positivity and negativity fit this framework. The answer is that they do, if we advert to some of the work in cognitive neuroscience which relates hemispheric asymmetry to the processing of affect, and to language production. On the one hand, there is evidence that emotion processing is lateralised. In one influential

Language with character

41

model, the left cerebral hemisphere (LH) is responsible for approach behaviours, and the right (RH) for withdrawal behaviours, and these can be mapped to levels of Extraversion and Neuroticism respectively (for instance, Davidson, 2001; Davidson and Irwin, 1999). On the other hand, it is widely accepted that language processing is lateralised. Areas within the LH (particularly Broca’s area) have long been acknowledged to play a role in syntactic language processing—and sequencing behaviour more generally. There is also support for the view that the RH is important for relating language to context, and damage can lead to impairment in processing extended discourse, nonliteral meaning, indirect requests, prosody and humour (Beeman, 1998; Stemmer and Joanette, 1998; Beeman et al., 2000). Borod et al. (1998) explicitly review the links between linguistic and emotional asymmetries, and note that while some studies suggest that the RH is specialised for lexical emotion in general (which is their favoured hypothesis), others suggest that valence matters, so that the LH is specialised for positive emotional language and the RH for negative emotional language. Given this, we could maintain that language production processes involving broad semantic associations are linked to the RH, which is associated with negative emotionality, and level of Neuroticism. Production processes involving surface sequencing are linked to the LH, which is associated with positive emotionality and level of Extraversion. So, this fits with the idea that the two dimensions find their effects at different stages of language production. In spite of this support, however, an empirically adequate model is unlikely to be simple. One reason is that not all of the emerging results from brain imaging fit with hemispheric models. For instance, on emotion, Canli et al. (2001) report a functional imaging study in which Extraversion does appear to be associated with reactivity to positive stimuli in LH neural clusters. And Neuroticism is associated with reactivity to negative stimuli in other specific neural clusters, but these are not lateralised to the RH. And in arguing for cognitive models of personality, Matthews et al. (2000) discuss studies where trait anxiety is associated with left hemisphere attentional focussing activity. Equally, on language lateralisation, the differential role of the RH in processing concrete—as opposed to abstract—words has recently been questioned (Fiebach and Friederici, 2004). Nonetheless, according to Poeppel and Hickok (2004), ‘one of the main consequences of imaging research has been to highlight

Language with character

42

the extensive activation of the right hemisphere in language tasks’ [p10]. A more complex model of the links between personality and production would certainly move beyond gross anatomical coincidences, and instead look for local relationships between the areas involved in emotional processing and those involved in language processing. At this finer level of detail, it could still be that areas involved in surface realisation are more closely linked to those involving positive affect and approach; and areas involved in content selection are more closely linked with those involving negative affect and withdrawal.

9

Conclusion

So, we have proposed that surface realisation, Extraversion, approach and left hemisphere go together; and that content selection, Neuroticism, withdrawal and right hemisphere go together. This model of the links between language and personality is simple and speculative, but it can be put to the test both in simulation, and in further experiment beyond the current domain. Methods for probing the affect–production link go well beyond corpus studies, and even without using imaging or lesion studies, a range of techniques is available. For instance, tendencies to approach and withdraw are likely to influence measurable aspects of interactive dialogue behaviour (Gill et al., 2004). Whether or not the links are real, the results we have uncovered confirm that there are linguistic differences which can be systematically associated with the differing characters of language users. Specifically, we link Extraversion with fluency, implicitness and positivity; and Neuroticism with self-concern, implicitness and negativity. Those results have been derived using bottom-up stratified corpus comparison methods which are sufficiently sensitive to avoid some of the problems associated with dictionary-based methods. And the results confirm that individual differences persist in the medium of e-mail communication.

Language with character

43

Acknowledgements We are grateful to our colleagues for extensive discussions, advice and assistance. In particular, we would like to thank Elizabeth Austin, Carsten Brockmann, Zo¨e Bruce, James Curran, Jean-Marc Dewaele, Scott Nowson, Judy Robertson, and Keith Stenning. We are also obliged to anonymous reviewers for, and audience members at, meetings where aspects of this work have been presented. The second author gratefully acknowledges studentship support from the UK Economic and Social Research Council.

References Aarts, J. and Granger, S. (1998). Tag sequences in learner corpora: A key to interlanguage grammar and discourse. In Granger, S., editor, Learner English on Computer, Studies in Language and Linguistics, pages 132–141. Addison Wesley Longman, New York. Ball, C. (1994). Automated text analysis: Cautionary tales. Literary and Linguistic Computing, 9:295–302. B¨alter, O. (1998). Electronic Mail in a Working Context. PhD thesis, Royal Institute of Technology, Stockholm. Banerjee, S. and Pedersen, T. (2003). The design, implementation, and use of the ngram statistics package. In Proceedings of the Fourth International Conference on Intelligent Text Processing and Computational Linguistics, Mexico City. Baron, N. (1998). Letters by phone or speech by other means: the linguistics of email. Language and Communication, 18:133–170. Beeman, M. (1998). Coarse semantic coding and discourse comprehension. In Beeman, M. and Chiarello, C., editors, Right Hemisphere Language Comprehension, pages 255–284. Lawrence Erlbaum Associates, Mahwah, New Jersey.

Language with character

44

Beeman, M., Bowden, E., and Gernsbacher, M. (2000). Right and left hemisphere cooperation for drawing predictive and coherence inferences during normal story comprehension. Brain and Language, 71:310–336. Biber, D. (1995). Dimensions of Register Variation. Cambridge University Press, Cambridge. Borod, J., Bloom, R., and Santschi-Haywood, C. (1998). Verbal aspects of emotional communication. In Beeman, M. and Chiarello, C., editors, Right Hemisphere Language Comprehension, pages 285–307. Lawrence Erlbaum Associates, Mahwah, New Jersey. Bradac, J. (1990). Language attitudes and impression formation. In Giles, H. and Robinson, W., editors, Handbook of Language and Social Psychology, pages 387–412. Wiley, Chichester. Buckingham, R., Charles, M., and Beh, H. (2001). Extraversion and Neuroticism, partially independent dimensions? Personality and Individual Differences, 31:769–777. Busch, D. (1982). Introversion-Extraversion and the EFL proficiency of Japanese students. Language Learning, 32:109–132. Campbell, A. and Rushton, J. (1978). Bodily communication and personality. British Journal of Social and Clinical Psychology, 17:31–36. Campbell, R. and Pennebaker, J. (2003). The secret life of pronouns: Flexibility in writing style and physical health. Psychological Science, 14:60–65. Canli, T., Zhao, Z., Kang, E., Gross, J., Desmond, J., and Gabrieli, J. (2001). An fMRI study of personality influences on brain reactivity to emotional stimuli. Behavioral Neuroscience, 115:33–42. Carment, D. W., Miles, C. G., and Cervin, V. B. (1965). Persuasiveness and persuasibility as related to Intelligence and Extraversion. British Journal of Social and Clinical Psychology, 4:1–7.

Language with character

45

Carpenter, P., Just, M., and Shell, P. (1990). What one intelligence test measures: A theoretical account of the processing in the raven progressive matrices test. Psychological Review, 97:404–431. Chambers, J. and Trudgill, P. (1980). Dialectology. Cambridge University Press, Cambridge. Colley, A. and Todd, Z. (2002). Gender-linked differences in the style and content of e-mails to friends. Journal of Language and Social Psychology, 21:380–392. Cope, C. (1969). Linguistic structure and personality development. Journal of Counselling Psychology, 16:1–19. Costa, P. and McCrae, R. (1984). Personality as a lifelong determinant of well-being. In Malatesta, C. and Izard, C., editors, Affective processes in adult development and aging, pages 141–157. Sage, Beverley Hills, CA. Costa, P. and McCrae, R. R. (1992). NEO PI-R Professional Manual. Psychological Assessment Resources, Odessa, Florida. Damasio, A. (1994). Descartes’ Error: Emotion, Reason and the Human Brain. Putnam, New York. Damerau, F. (1993). Generating and evaluating domain-oriented multi-word terms from texts. Information Processing and Management, 29:433–448. Davidson, R. and Irwin, W. (1999). The functional neuroanatomy of emotion and affective style. Trends in Cognitive Science, 3:11–21. Davidson, R. J. (2001). Toward a biology of personality and emotion. Annals of the NY Academy of Sciences, 935:191–207. Deary, I. and Matthews, G. (1993). Personality traits are alive and well. The Psychologist, 6:299–311.

Language with character

46

Depue, R. and Collins, P. (1999). Neurobiology of the structure of personality: Dopamine, facilitation of incentive motivation, and extraversion. Behavioral and Brain Sciences, 22:491–569. Dewaele, J.-M. (1993). Extraversion et richnesse lexicale dans deux styles d’interlangue fran¸caise [Extraversion and lexical richness in 2 styles of French interlanguage]. I.T.L Review of Applied Linguistics, 22:87–105. Dewaele, J.-M. (1995). Variation dans la longueur moyenne d’´enonc´es dans l’interlangue fran¸caise [variation in the mean length of utterances in french interlanguage]. In Beheydt, L., editor, linguistique appliqu´ee dans les ann´ees 90 [Special Issue], volume 16 of ALBA Papers, pages 43–58. ALBA. Dewaele, J.-M. (1998). Speech rate variation in 2 oral styles of advanced French interlanguage. In Regan, V., editor, Contemporary approaches to second language acquisition in social context: Cross-linguistic perspectives, pages 113–123. University College Academic Press, Dublin. Dewaele, J.-M. (2001). Interpreting the maxim of quantity: interindividual and situational variation in discourse styles of non-native speakers. In N`emeth, E., editor, Selected Papers from the 7th International Pragmatics Conference, volume 1, pages 85–99. International Pragmatics Association, Antwerp. Dewaele, J.-M. (2002a). Individual differences in L2 fluency: the effect of neurobiological correlates. In Cook, V., editor, Portraits of the L2 user, pages 219–250. Multilingual Matters, Clevedon. Dewaele, J.-M. (2002b). Psychological and sociodemographic correlates of communication anxiety in L2 and L3 production. International Journal of Bilingualism, 6:23–28. Dewaele, J.-M. and Furnham, A. (1999). Extraversion: The unloved variable in applied linguistic research. Language Learning, 49:509–544.

Language with character

47

Dewaele, J.-M. and Furnham, A. (2000). Personality and speech production: a pilot study of second language learners. Personality and Individual Differences, 28:355–365. Digman, J. (1990). Personality structure: Emergence of the five-factor model. Annual Review of Psychology, 41:417–440. Eysenck, H. (1970). The Biological Basis of Personality. Thomas, Springfield, IL. Eysenck, H. and Eysenck, S. B. G. (1975). The Eysenck Personality Questionnaire. Hodder and Stoughton, London. Eysenck, H. and Eysenck, S. B. G. (1991). The Eysenck Personality Questionnaire-Revised. Hodder and Stoughton, Sevenoaks. Eysenck, H. J. and Eysenck, S. B. G. (1964). Manual of the Eysenck Personality Inventory. University of London Press. Eysenck, S., Eysenck, H., and Barrett, P. (1985). A revised version of the psychoticism scale. Personality and Individual Differences, 6:21–29. Fiebach, C. and Friederici, A. (2004). Processing concrete words: fMRI evidence against a specific right-hemisphere involvement. Neuropsychologica, 42:62–70. Funder, D. (2001). Personality. Annual Review of Psychology, 52:197–221. Funder, D. C. (1995). On the accuracy of personality judgement: A realistic approach. Psychological Review, 102:652–670. Furnham, A. (1990). Language and personality. In Giles, H. and Robinson, W., editors, Handbook of Language and Social Psychology, pages 73–95. Wiley, Chichester. Gifford, R. and Hine, D. W. (1994). The role of verbal behaviour in the encoding and decoding of interpersonal dispositions. Journal of Research in Personality, 28:115–132. Gill, A. (2003). Personality and Language: The projection and perception of personality in computer-mediated communication. PhD thesis, University of Edinburgh, Edinburgh, UK.

Language with character

48

Gill, A., Harrison, A., and Oberlander, J. (2004). Interpersonality: Individual differences and interpersonal priming. In Proceedings of the 26th Annual Conference of the Cognitive Science Society, pages 464–469. Goldberg, L. (1993). The structure of phenotypic personality traits. American Psychologist, 48:26–34. Granger, S. and Rayson, P. (1998). Automatic profiling of learner texts. In Granger, S., editor, Learner English on Computer, Studies in Language and Linguistics, pages 119– 131. Addison Wesley Longman, New York. Heylighen, F. and Dewaele, J.-M. (2002). Variation in the contextuality of language: An empirical measure. Foundations of Science, 7:293–340. Howeler, M. (1972). Diversity of word usage as a stress indicator in an interview situation. Journal of Psychological Research, 1:243–248. Kiesler, D. (1983). The 1982 interpersonal Circle: A taxonomy for complementarity in human transactions. Psychological Review, 90:185–214. Kline, P. (1993). Comments on “personality traits are alive and well”. The Psychologist, 6:304. Labov, W. (1972). Sociolinguistic Patterns. University of Pennysylvania Press, Philadelphia. Larstone, R., Jang, K., Livesley, W., Vernon, P., and Wolf, H. (2002). The relationship between Eysenck’s P-E-N model of personality, the five-factor model of personality, and traits delineating personality dysfunction. Personality and Individual Differences, 33:25– 37. Levelt, W. (1989). Speaking: From Intention to Articulation. MIT Press, Cambridge, MA. Lippa, R. and Dietz, K. (2000). The relations of gender, personality, and intelligence to judges’ accuracy in judging strangers’ personality from brief video segments. Journal of Nonverbal Behavior, 24:25–43.

Language with character

49

Marcus, M., Santorini, B., and Marcinkiewicz, M. (1994). Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19:313–330. Matthews, G., Deary, I., and Whiteman, M. (2003). Personality Traits. Cambridge University Press, Cambridge, 2nd edition. Matthews, G., Derryberry, D., and Siegle, G. (2000). Personality and emotion: Cognitive science perspectives. In Hampson, S., editor, Advances in personality psychology, volume 1, pages 199–237. Routledge, London. McCrae, R. and Costa, P. (1987). Validation of the five-factor model of personality across instruments and observers. Journal of Personality and Social Psychology, 52:81–90. McCrae, R. and Costa, P. (1989). The structure of interpersonal traits: Wiggins’s Circumplex and the five-factor model. Journal of Personality and Social Psychology, 56:586–595. McCrae, R. and Costa, P. (1997). Personality trait structure as a human universal. American Psychologist, 52:509–516. McCroskey, J. and Richmond, V. (1990). Willingness to communicate: A cognitive view. Journal of Social Behaviour and Personality, 5:19–37. Mehl, M. and Pennebaker, J. (2003). The sounds of social life: A psychometric analysis of student’s daily social interactions. Journal of Personality and Social Psychology, 84:857– 870. Milton, J. (1998). Exploiting L1 and interlanguage corpora in the design of an electronic language learning and production environment. In Granger, S., editor, Learner English on Computer, Studies in Language and Linguistics, pages 186–198. Addison Wesley Longman, New York. Pennebaker, J. W. and Francis, M. (1999). Linguistic Inquiry and Word Count (LIWC). Lawrence Erlbaum Associates, Mahwah, NJ.

Language with character

50

Pennebaker, J. W., Francis, M. E., and Booth, R. J. (2001). Linguistic Inquiry and Word Count (LIWC2001). Lawrence Erlbaum Associates, Mahwah, NJ. Pennebaker, J. W. and King, L. (1999). Linguistic styles: Language use as an individual difference. Journal of Personality and Social Psychology, 77:1296–1312. Poeppel, D. and Hickok, G. (2004). Towards a new functional anatomy of language. Cognition, 92:1–12. Ramsay, R. (1968). Speech patterns and personality. Language and Speech, 11:54–63. Ratnaparkhi, A. (1996). A maximum entropy part-of-speech tagger. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, University of Pennsylvania. Rayson, P. Leech, G. and Hodges, M. (1997). Social differentiation in the use of english vocabulary: some analyses of the conversational component of the british national corpus. International Journal of Corpus Linguistics, 2:133–152. Rayson, P. (2003). Matrix: A statistical method and software tool for linguistic analysis through corpus comparison. PhD thesis, Lancaster University. Scherer, K. (1979). Personality markers in speech. In Scherer, K. R. and Giles, H., editors, Social Markers in Speech, pages 147–209. Cambridge University Press, Cambridge. Scherer, K. R. (1978). Inference rules in personality attribution from voice quality: The loud voice of extraversion. European Journal of Social Psychology, 8:467–487. Siegman, A. W. (1978). The meaning of short pauses in the interview. Journal of Nervous and Mental Disease, 166:387–406. Siegman, A. W. (1987). The tell-tale voice: Nonverbal messages of verbal communication. In Siegman, A. and Feldstein, S., editors, Nonverbal behaviour and communication, pages 642–654. Erlbaum, Hillsdale, NJ. Sinclair, J. (1991). Corpus, concordance, collocation. Oxford University Press, Oxford.

Language with character

51

Smith, C. (1992). Introduction: inferences from verbal material. In Smith, C., editor, Motivation and personality: Handbook of thematic content analysis, pages 1–17. Cambridge University Press, Cambridge. Stallman, R. (1994). GNU Emacs Manual. Free Software Foundation Press, Boston, MA, 10th edition. Stemmer, B. and Joanette, Y. (1998). The interpretation of narrative discourse of braindamaged individuals within the framework of a multilevel discourse model. In Beeman, M. and Chiarello, C., editors, Right Hemisphere Language Comprehension, pages 329–348. Lawrence Erlbaum Associates, Mahwah, New Jersey. Stephenson, G., Laszlo, J., Ehmann, B., Lefever, R., and Lefever, R. (1997). Diaries of significant events: Socio-linguistic correlates of therapeutic outcomes in patients with addiction problems. Journal of Community and Applied Psychology, 7:389–411. Tapasak, R., Roodin, P., and Vaught, G. (1979). Effects of extraversion, anxiety, and sex on children’s verbal fluency and coding task performance. The Journal of Psychology, 100:49–55. Teiger, P. and Barron-Teiger, B. (1998). The Art of SpeedReading People. Little, Brown, Boston. Thorne, A. (1987). The press of personality: A study of conversations between introverts and extraverts. Journal of Personality and Social Psychology, 53:718–726. Trapnell, P. D. and Wiggins, J. S. (1990). Extention of the interpersonal adjective scales to include the big five dimensions of personality. Journal of Personality and Social Psychology, 59:781–790. Tribble, C. (2000). Genres, keywords, teaching: towards a pedagogic account of the language of project proposals. In Burnard, L. and McEnery, T., editors, Rethinking language pedagogy from a corpus perspective, pages 75–90. Peter Lang, Frankfurt.

Language with character

52

Wells, A. and Matthews, G. (1994). Attention and emotion: a clinical perspective. Erlbaum, Hove. Wiggins, J. and Pincus, A. (1992). Personality: Structure and assessment. Annual Review of Psychology, 43:473–504. Wiggins, J. S. (1979). A psychological taxonomy of trait-descriptive terms: The interpersonal domain. Journal of Personality and Social Psychology, 37:395–412. Wilson, M. (1987). MRC Psycholinguistic Database: Machine usable dictionary. Technical report, Oxford Text Archive, Oxford. Wray, A. and Perkins, M. (2000). The functions of formulaic language: an integrated model. Language and Communication, 20:1–28. Yellen, R., Winniford, M., and Sanford, C. (1995).

Extraversion and introversion in

electronically-supported meetings. Information & Management, 28:63–74.