Unified test of substantive & statistical significance-v4 - Justin H. Gross

1 downloads 181 Views 1MB Size Report
May 31, 2013 - A Unified Test for Substantive and Statistical Significance in ...... the assessment of the degree to whi
A Unified Test for Substantive and Statistical Significance in Political Science Justin H. Gross University of North Carolina at Chapel Hill Department of Political Science [email protected]

May 31, 2013

Abstract For over a half-century, various fields in the behavioral and social sciences have debated the appropriateness of null hypothesis significance testing (NHST) in the presentation and assessment of research results. A long list of criticisms have fueled a recurring “significance testing controversy” that has ebbed and flowed in intensity since the 1950s. The most salient problem presented by the NHST framework is that it encourages researchers to devote excessive attention to statistical significance while underemphasizing substantive (scientific, contextual, social political, etc.) significance. What might best serve as a diagnostic tool for distinguishing signal from noise continues to be mistaken, far too often, as the primary result of interest. Our foremost goal in analyzing data ought to be ascertaining the other type of significance, measuring and interpreting relevant magnitudes. I introduce a simple technique for giving simultaneous consideration to both forms of significance via a unified significance test (UST). This allows the political scientist to test for actual significance, taking into account both sampling error and an assessment of what parameter values should be deemed interesting, given theory.

1

Introduction It may or may not come as a surprise that many scientists agree with Meehl (1978) that

“excessive reliance on significance testing is a poor way of doing science,” leading to theories that “lack the cumulative character of scientific knowledge, . . . [tending] neither to be refuted nor corroborated, but instead merely fad[ing] away as people lose interest.” Meehl’s insistence that 1

“the almost universal reliance on merely refuting the null hypothesis as the standard method for corroborating substantive theories in [certain areas of psychology] is a terrible mistake, basically unsound, . . . and one of the worst things that ever happened in the history of psychology” (p.817, my emphasis) may seem an exaggeration, but to the extent that social and behavioral sciences fetishize the rejection of point null hypotheses, instilling the practice with importance out of all proportion to its true contribution, the spirit of the statement is apt. What is at issue is not the scientific practice of generating hypotheses from theory, then utilizing data as evidence in sorting through these hypotheses. Rather, it is the peculiar way this typically plays out in practice that falls well short of what science requires. More than fifteen years have has passed since a group of psychologists imagined: “What if there were no significance tests?” in their edited volume of the same name (Harlow, Mulaik and Steiger, 1997). Their rhetorical query may as well have been a flight of fancy on the order of John Lennon’s entreaty to “imagine no possessions.” A more modest, realistic, question might be posed: What if significance tests were more meaningful, more, well. . . significant. Regardless of one’s philosophical perspective on statistical inference (i.e., even if one is unwilling to stray from a conservative frequentist interpretation of probability), it is possible to do much better than reflexively reporting statistical significance, signs, and p-values, and making these the focus of discussion. We have become deeply accustomed to declarations of statistical significance and asterisks next to parameter estimates and, while such rituals are regularly misconstrued or distract from more meaningful conversation about the data, may play a role in communicating simple summaries of findings. After all, we ideally write for more than one audience, wishing to reach scientist and layperson alike, and even scientific audiences will approach di↵erent studies with di↵erent purposes, often wishing to start with a cursory yet meaningful look at results. As it stands, much of the resistance to simply banishing p-values and tabular asterisks altogether is motivated not by laziness, not from a desire to have authors spoon-feed us results, as more cynical critics may charge, but from a desire for commonly accepted heuristics in assessing research results. The problem is that p-values and asterisks aren’t quite suited to the task. Moreover, rather than serving as an invitation to the reader to delve more deeply into the substantively meaningful results,

2

secure in the knowledge that apparent patterns are not likely attributable to sampling error, these heuristics are often o↵ered and accepted as a substitute for interpretation of magnitudes in context, an end rather than a beginning to the conversation. In what follows, I propose a simple way to integrate statistical and substantive (a.k.a. scientific, contextual, social, political, economic, or real-world ) significance, in the hope that this may restore some balance to analyses that have dwelled too heavily on the former. At the very least, I hope to encourage political scientists to join a conversation that has been prominent in the social and behavioral sciences generally and yet nearly absent from political science, beyond a few notable recent exceptions (Gill, 1999; Ward, Greenhill and Bakke, 2010; Esarey, 2010). In Section 2, I briefly review the the most damning criticisms of null hypothesis significance testing (NHST), noting that the manner in which it is conventionally applied in political science serves as a distraction from—or poor substitute for—close interpretation of results. In Section 3, I propose a simple solution, in the form of a unified statistical and substantive significance test (UST) that overcomes the most troubling shortcomings of NHST. In Section 4, I conclude by suggesting that we pay attention to the some recent developments in other fields, such as informative hypothesis testing and minimal important di↵erences within health outcomes research for tools that may serve us in more carefully setting up reasonable statistical hypotheses and justifying the conclusions we draw from the data we have.

2

The Controversy that Passed Us By: Debating the Merits of NHST The function of statistical tests is merely to answer: Is the variation great enough for us to place some confidence in the result; or, contrarily, may the latter be merely a happenstance of the specific sample on which the test was made? The question is interesting, but it is surely secondary, auxiliary, to the main question: Does the result show a relationship which is of substantive interest because of its nature and its magnitude? Better still: Is the result consistent with an assumed relationship of substantive interest? (Kish (1959), reprinted in Morrison and Henkel (1970), emphasis Kish’s) Some time ago, I attended a talk by a political scientist who was doing some research on teacher 3

training. After explaining his research design, in which teacher success would be operationalized using students’ scores on standardized exams, he presented a slide with a long list of coefficients estimated under several models, o↵ering the standard apology for the vast sea of numbers. To wade through the slide, in keeping with ritual, he called our attention to one or two variables, which, like Dr. Seuss’s star-bellied Sneetches, had enough good fortune to be marked by the asterisks its companions were lacking. He then noted that, according to his results, parents might do well to ask whether their children’s teachers received in-state training. For, after controlling for a long list of other predictors, the estimated “e↵ect” of in-state training was found to be positive and statistically significant. Asked about units, the presenter could only say that the scores were based on some composite standardized measure, but not how the numbers ought to be interpreted. As it turned out, the expected jump in test scores associated with a teacher being trained in-state was around one-fortieth of a standard deviation! Pressed on whether a parent should seriously be concerned by a (predicted) relationship so small, he conceded that the magnitude did not seem too large, but it was statistically significant, after all. This extreme deference to statistical significance, wherein the very term “statistical” is lorded over the audience as if to imply that it is but a more rigorous form of everyday significance, leads to opportunities for mischief and – even more perniciously – rewards laziness. At various points over the past half century or so, individual fields in the behavioral, health, and social sciences have grappled publicly with the issue of what role significance testing of hypotheses should take in the assessment of research results. Psychology, sociology and economics have devoted volumes to the topic, special issues of journals as forums for debating the various dimensions of the “controversy,” and even argued over whether policies for publication ought to be changed to reflect the limitations of conventional hypothesis testing (Morrison and Henkel, 1970; Harlow, Mulaik and Steiger, 1997; Altman, 2004).1 1

The history of the controversies surrounding the so-called null hypothesis significance testing procedure has typically been traced to passionate disagreements between the towering figures of early twentieth century statistics, Sir R.A. Fisher on one hand and J. Neyman and E.S. Pearson on the other. A thorough historical treatment of statistical significance and the roles of Fisher, Neyman and Pearson is provided by Gigerenzer, Swijtink and Daston (1990, p. 79–109), with other good summaries found in Gill (1999) and Ziliak and McCloskey (2008). Fisher is often credited with giving us the notion of statistical significance, and indeed he emphasized it in his work and bears much of the responsibility for its eventual dominance, but the notion did not originate with him. In fact, some version of it appeared some two hundred years prior (Arbuthnot, 1710), though its role would be quite minor until Fisher’s work

4

There is, frankly, much to dislike about the NHST approach to social science, and these have been outlined comprehensively elsewhere (see, for example, Cohen (1994); Gill (1999); Ziliak and McCloskey (2008)). A few of the most troubling aspects of NHST are next discussed in brief.

Levels of significance are arbitrary and p-values misunderstood. The basic strategy of using the tail area of the null distribution—the probability, under H0 , of observing a test statistic more extreme than the actual one—originated in an earlier test for identifying outliers (see an account in Gigerenzer, Swijtink and Daston (1990)). Critics often note the awkwardness of even the correct interpretation of p-values; according to the oft-repeated quip of Je↵reys, “What the use of P implies is that a hypothesis that may be true may be rejected because it has not predicted observable results that have not occurred” (Je↵reys, 1961). Indeed the typical manner in which social and behavioral scientists report statistical significance is an even less satisfactory compromise; two to four conventional choices for significance level ↵ are considered, and superscripts are placed on the parameter estimates to indicate the lowest such ↵ at which H0 would be rejected if we were to conduct a decision-oriented hypothesis test. Unfortunately, as Gelman and Stern (2006) put it in the title of their article, “the di↵erence between ‘significant’ and ‘not significant’ is not itself statistically significant.” Under the dominant contemporary approach to hypothesis testing in social science, one is presumed to make a discrete decision to either reject the null hypothesis in favor of the alternative or to fail to reject it. Choosing the significance level ↵ in advance allows one to limit the probability of rejecting the null should it in fact be true. The p-value, or “attained significance level,” from the perspective of this approach, does not o↵er a measure of how significant the finding is, but might rather be considered a form of sensitivity analysis, allowing one to assess whether the result is robust to the arbitrary choice of ↵. And yet, we too commonly imbue p-values with qualitative meaning that is utterly unjustified. Even textbooks sometimes reinforce this misunderstanding; for example, one popular book would have readers interpret p-values as indicating that a result is “extremely significant,” and popularization of his approach. A hybrid of Fisher’s “significance tests” and Neyman-Pearson “hypothesis tests” would come to be codified in a number of mid-twentieth-century teaching texts, and it is this approach, commonly referred to as “null hypothesis significance testing,” that dominates common practice in the social and behavioral sciences.

5

“highly significant,” “statistically significant,” “somewhat significant,” “could be significant,” or “not significant” based on the interval in which it lies (Verzani, 2005). Even worse, this particular set of criteria seems to imply that if a p-value is small enough (less than 0.01), it somehow implies scientific significance, while p-values in the interval (0.01, 0.05] are indicative of results that are only statistically significant. The typical misrepresentation of p-values is less grotesque, involving an inversion of conditional probabilities. We seem to be constitutionally incapable of not treating P r(data|H0 ) as P r(H0 |data). When told that the latter expression is meaningless within a frequentist framework—with no way to assign probabilities to hypotheses unless we take a Bayesian approach—we grasp at the former as the closest substitute for what we seek. One reason we so readily accept the inverted conditional probability as a substitute is the incorrect perception that modus tollens reasoning applies to probabilistic statements. In deductive logic, argument by contrapositive is always permissible: “If A, then B ) If not B, then not A.” This does not extend to statements of the type “If A, then probably B ) If not B, then probably not A.” Yet this is exactly the reasoning on which conventional NHST rests, what Falk and Greenbaum (1995) call “the illusion of probabilistic proof by contradiction” (quoted in Cohen (1994)). We rely on the unjustified argument that P r(test statistic as extreme as that observed|H0 ) is small to convince us that H0 is unlikely, given the data from which the test statistic was calculated (Gill 1999, p.653). The two propositions are not unrelated, but their relationship is more complicated than implied when we rest inferences on this type of reasoning.

The null hypothesis never has a fighting chance. A common metaphor in teaching NHST is to say that it is a trial in which the null hypothesis is assumed innocent until proven guilty. It is difficult to justify why the null hypothesis is a↵orded this special treatment in social science. In jurisprudence, the asymmetry reflects an implicit loss function that attributes greater regret in jailing the innocent than in freeing the guilty. On what basis does H0 earn such protection? And if we can never, regardless of the quality or abundance of our data, find in favor of H0 —“all you can conclude is that you can’t conclude that the null was

6

false,” in the words of Gill (1999)—then why should we be impressed when we find in favor of H1 ? After all, as John Tukey bluntly notes, “All we know about the world teaches us that the e↵ects of A and B are always di↵erent—in some decimal place—for any A and B. Thus asking ‘Are the e↵ects di↵erent?’ is foolish” (Tukey, 1991). The authors of a popular—yet rigorous—textbook on probability and statistics, put it this way: From one point of view, it makes little sense to carry out a test of the hypotheses [H0 : µ = µ0 vs. H1 : µ 6= µ0 ] in which the null hypothesis H0 specifies a single exact value µ0 for the parameter µ. Since it is inconceivable that µ will be exactly equal to µ0 in any real problem, we know that the hypothesis H0 cannot be true. Therefore H0 should be rejected as soon as it has been formulated (DeGroot and Schervish, 2002, p. 481). In fact, from the Bayesian perspective preferred by the text’s authors, this notion is formalized by the observation that the probability of simple hypothesis H0 being true is 0, so there is no need to even consider data in order to reach the trivial decision against the null. Treating the null hypothesis as a straw dog leads to such absurdities as being less sure of results as our sample size grows larger, or not infrequently, published articles with tables displaying coefficient significant estimates such as 0.000** (not to be confused with

0.000**).

NHST presents a false dichotomy and star-gazing provides a false escape. As others have pointed out, the use of decision theoretic hypothesis tests without consideration of an appropriate loss function, is a hollow practice. Consideration of expected loss given a decision requires a well-defined function relating states of the world, together with possible decisions by the researcher, to the loss (or gain) associated with this pair of possible states of the world. This is rarely mentioned in social science statistics, although at least one recent e↵ort has made the loss function central to testing substantive significance (Esarey, 2010). Since even hypothetical costs and benefits remain invisible to the political scientist—indeed, loss or gain may often be restricted to intellectual insight—to posit a loss function strikes one as insincere, and the very depiction of political scientists as decision-makers borders on delusional. Most social scientists recognize that we are not really making decisions, though some of our

7

research may help inform decision-makers. So we try to have it both ways, maintaining the conceit of hypothesis test as binary, while introducing sneaky ways to hedge our bets. A number of contemporary issues make the correct interpretation of levels of significance impossible in practice. As Schrodt (2006) puts it, “the ubiquity of exploratory statistical research has rendered the traditional frequentist significance all but meaningless. Alternative models can now be tested with a few clicks of a mouse . . . Virtually all published research now reports only the final tip of an iceberg of dozens if not hundreds of unpublished alternative formulations.” A related reason that ↵ may not be what it seems is the well-known “file drawer problem” (Rosenthal, 1979). The overwhelming tendency to attribute special properties to numbers such as 0.01 and 0.05 may warp the state of our accumulated knowledge; though foundational figures such as Fisher recognized the importance of circulating well-designed studies regardless of whether a null hypothesis was rejected or not, publication bias in favor of null rejection and self-censorship means that our journals are likely littered with observations sampled from the tail of a null distribution. Finding statistically “significant” results tends to make further investigation less likely, while the reverse is true of null findings. The most serious scientific implication of NHST may be the sleight-of-hand through which the reader’s attention is misdirected to a relatively uninformative presentation of data. Nowhere is this more evident than in the invitation to “stargaze.” In fact, the over reliance on such cues to the reader takes what is worst about significance testing and simply accentuates it. Frank Yates was the first to utilize asterisks as a shorthand indication of whether a null hypothesis would have been rejected at common significance levels. According to a biographer, “Frank must later have regretted its possible encouragement of the excesses of significance testing that he would so often condemn!” Indeed, as Finney (1995) recounts, Yates would eventually “try to stem the tide of research publications that regard a row of asterisks, a correlation coefficient, or the result of a multivariate significance test as indicators of triumph in research or as sufficient summaries of findings from an experiment.” One problem is the that the everyday meaning of asterisks lead to the conclusion that *** means these results should be considered scientifically significant rather than simply distinguishable from

8

random noise. When samples are large enough, we are treated to a table in which most or all estimates are accompanied by asterisks, reading like an email in all capital letters, seeming to shout “IT’S ALL IMPORTANT!” while one without any seems to bemoan “we didn’t find anything :(” All too often, what they really tell us can be found next to the letter n: “THIS IS A BIG SAMPLE!” or “this is a small sample.”

We’re completely missing the point. These legitimate concerns are subsumed by – or even symptomatic of – a deeper problem concerning the very role of statistical significance in the presentation of empirical research. Should a finding of statistical significance (for some parameter of interest) be the focal point of the researcher’s presentation, a climax to which the rest of a scholarly paper is building? Or shall it play a supportive role in service of other aspects of the findings? Put simply, what exactly are we trying to establish? As long as there has existed a technical concept of of statistical significance, there have been entreaties to not confuse it with the everyday notion of significance (e.g., Boring (1919) p. 338). As one author, an educational researcher, put it early on, “di↵erences which are statistically significant are not always socially important. The corollary is also true: di↵erences which are not shown to be statistically significant may nevertheless be socially significant” (Tyler, 1931). Indeed, statistical significance is a property of a particular data set with respect to some hypothesis, while social (or scientific) significance is based upon our interpretation of the parameters suggested by the data together with our assessment of underlying phenomenon itself. And while some (though still too few) of us have learned to clearly distinguish between the two types of significance by modifying the word “significant” as appropriate, or even replacing it altogether with more meaningful phrases such as “distinguishable from zero,” our continued fetishization of statistical significance at the expense of social scientific significance reveals misplaced priorities. The influence of books by Fisher and his followers, together with the contributions of Neyman and Pearson and their adherents, produced a generation of social scientists for whom methodological training instantly brought to mind t-tests,

2 -tests,

9

F -tests and a slew of others, filling modern

training manuals that would come to be derisively referred to as “cookbooks” by many. Even those closely associated with the rise of the NHST paradigm, recognized the danger in this. In 1951, Frank Yates, one of the most widely respected statisticians of his time, himself a Fisherian, wrote an essay reflecting upon the influence of Fisher’s book, Statistical Methods for Research Workers (1925) on the trajectory of statistical science. Writing mostly of the ways in which the volume’s ideas had sparked “a revolution in the statistical methods employed in scientific research,” Yates concedes that “the emphasis given to formal tests of significance” has had some unsatisfactory consequences, one of which is that “scientific research workers [were led to] pay undue attention to the results of the tests of significance they perform on their data, . . . and too little to the estimates of the magnitude of the e↵ects they are investigating” (Yates, 1951). To quote two of the most strident contemporary critics, “[s]tatistical ‘significance,’ once a tiny part of statistics, has mestastasized” (Ziliak and McCloskey, 2008, p.4), causing many of us to obsess over signal-to-noise ratio in our data — even to the point of forgetting to ask what exactly we are measuring. With so many drawbacks (just a few of which are listed above), why has this form of significance testing survived and thrived? Yates (1951), blaming a methodological setting of “utmost confusion” at the time of Fisher’s major contributions, explains that “in the interpretation of their results research workers in particular badly needed the convenience and the discipline a↵orded by reliable and easily applied tests of significance.” The simplicity and concreteness o↵ered researchers sca↵olding on which to build reasonable and reliable habits. Nearly a century after Fisher, we may be ready to let some of that sca↵olding fall away in order to discover more flexible approaches to statistical and scientific reasoning. I next propose an easily implemented step in the this direction.

3

Unifying the Two Notions of Significance Within a Single Testing Framework According to an educational psychologist, writing about the significance test controversy in

1998, “everyone in social-science academic circles seems to be talking about it these days,” so much so that the psychologist titled his article “What if there were no more bickering about statistical

10

significance tests?” and pleaded “When do we stand up and say ‘Enough already!’ ? When do we decide that ample arguments have been uttered and sufficient ink spilled for us to stop talking about it and instead start doing something about it?” (Levin, 1998) The frustration expressed above—if NHST is so bad, then what exactly should we do about it?— is understandable. The truth is, there are a number of things we can do and all are better than the status quo. The best practices of political scientists already render the issue somewhat irrelevant by including comprehensive and often creative discussion and visualization of results. Bayesian methods, which eschew the problems of frequentist significance testing and invite deeper substantive analyses, are increasingly embraced by social scientists. Graphical representation of confidence intervals, accompanied by careful interpretations of estimated parameters in the range of plausible values, have become more common, though not yet the norm. Authors with a sophisticated grasp of statistical methods, and a deep understanding of what these tools can and—just as important— cannot tell us, seem to find ways to appropriately and imaginatively communicate their results to their audience. My proposal, however, does not concern best practices, which vary according to setting, but rather acceptable standard practices. That is, what should we expect at minimum as a point of departure for meaningful discussion of statistical results? What should we consider an acceptable default or a convenient shortcut to the salient results in a paper (a role currently filled by tables of point estimates with asterisks)? And how should we advise referees to appraise submissions so that statistical significance serves substantive analysis rather than overshadowing it? When conducting hypothesis tests, what researchers typically have in mind is not a literal interpretation of the null hypothesis as a single point hypothesis, but rather that “the value of µ is close to some specified value µ0 against the alternative hypothesis that µ is not close to µ0 ,” as DeGroot and Schervish (2002) put it (p. 481). It thus may make more sense to replace the idealized simple hypothesis with “a more realistic composite null hypothesis, which specifies that µ lies in an explicit interval around the value µ0 ” (p. 482). It is this suggestion that I take as the basis for a unified significance test that simultaneously takes statistical and substantive (or practical ) considerations into account. (See also pp. 518–20, 529–30).

11

One may reasonably protest that such a procedure requires an arbitrary choice of length of the interval constituting a composite null set. DeGroot and Schervish (2002, p. 519) recommend a posterior probability plot for di↵erent values of interval diameter , but no such plot is available to the frequentist. It is nonetheless possible to conduct analyses of sensitivity to the choice of as a comparable robustness check. An explicit discussion of what di↵erence or relationship would be meaningful is an essential, but frequently overlooked task for the researcher. As DeGroot and Schervish (2002) write, “[f]orcing experimenters to think about what counts as a meaningful di↵erence is a good idea. Testing the (simple) hypothesis . . . at a fixed level, such as 0.05, does not require anyone to think about what counts as a meaningful di↵erence” (p. 520). This observation cuts straight to the heart of why the conventional NHST approach encourages bad habits. Indeed, demanding that political scientists articulate and even debate what would constitute a meaningful e↵ect in context of their particular research problems, rather than encouraging arbitrary cut-points, puts the emphasis back on experts’ subject-area knowledge; political scientists should welcome the opportunity rather than shrink from it.

3.1

Unified Significance Testing: An integrated approach to detecting meaningful magnitudes in context in light of sampling error

Given parameters of substantive interest, be they real-world quantities (e.g., di↵erence between mean incomes for two subpopulations) or quantities whose meaning is derived only within a proposed model (e.g. a Poisson regression coefficient), a researcher wishing to simultaneously test for statistical and substantive significance should begin by declaring a set of parameter values to be taken as e↵ectively null. This should be based, when feasible, upon context and defended by the author. Such a choice should emerge from reflection on the question: if one could know precisely the “true” value of a parameter, what values would seem inconsequential and which would seem worthy of note? The resulting null set would include any value that seems practically indistinguishable from the (sharp) null value, or e↵ectively null. The very process of thinking this through and the resulting conversation with others would itself be a healthy development. Indeed, while the precise distinction between what values are e↵ectively null and which ones are of interest may be

12

somewhat arbitrary, thoughtful consideration should in most cases reveal a range of values that all knowledgeable individuals would take to be e↵ectively null and a range of values that anyone would consider noteworthy. In the examples provided, I will illustrate how one may go about proposing an e↵ective null set, as well as the usefulness of conducting a simple sensitivity analysis in order to indicate how robust one’s results are to both this partition of the parameter space and the chosen level of confidence/statistical significance. The e↵ective null set may be used as a heuristic for the reader who wishes to get a quick sense of what the authors purport to be of value in their results. Suppose, for example, we wish to know whether two groups of laborers, comparable other than with respect to gender, earn the same hourly wage, on average. We are unlikely to care if the true di↵erence is only a few cents, even if this di↵erence were known with absolute certainty. Suppose we declare the e↵ective null set to be ⇥0 = [ $0.25, $0.25], so that any discrepancy of twenty-five cents or less is considered inconsequential or insufficiently notable to merit intervention. Then once a confidence interval is constructed from the data (at the preferred level, say 95%), a simple qualitative distinction may be drawn: 1. The confidence interval may lie entirely outside the e↵ective null set, in which case we may say that the di↵erence is meaningful (or substantively/scientifically significant) at 95% confidence. 2. The confidence interval may lie entirely within the e↵ective null set, in which case we may say that there is no e↵ective di↵erence, at 95% confidence. 3. The confidence interval may overlap the e↵ective null set, in which case we may say that it is inconclusive whether there is a meaningful di↵erence at 95% confidence. To be clear, this is still an oversimplification of the results, but at least an oversimplification that points in the direction of what the reader cares about. If the confidence interval lies mostly within the null set, we might say the evidence leans against a meaningful di↵erence; if barely overlapping ⇥0 , we might say the di↵erence is likely meaningful. Note, this notion of what is to be taken as meaningful incorporates both statistical and substantive significance. Having an agreed-upon shorthand that may draw the casual reader to closer inspection could be helpful. It 13

is not the oversimplification represented by p-values and asterisks per se that threatens scientific understanding, but rather that such features draw focus away from what matters most. In this a UST presentation, the simplification addresses what is of scientific interest (the magnitude of a parameter) while simultaneously providing evidence as to whether we can have some confidence that a seemingly meaningful result is not a phantom. A unified approach to statistical and practical significance really just constitutes formalized recognition that without substantive reasoning, signal-to-noise ratios are devoid of meaning. A key virtue of such a framework is that it forces an explicit declaration of what the scientist would consider a meaningful, or interesting, result. A good practice is to do the following: Ask yourself, if you could have perfect and infinitesimally precise knowledge of a parameter’s value, with zero sampling error, would you know whether you should be impressed with this value? If not, then any inference from a sample is a waste of time. Thus, when employing a testing paradigm, the declaration of thresholds of real-world significance is a necessity. Two other troubling aspects of NHST are addressed by a unified significance testing framework outlined below. First, by never forcing a point null hypothesis to compete with a composite (interval) research hypothesis, one abandons the aforementioned practice of using H0 as a straw dog that is known to be false before data are even examined. Instead, it will be at least hypothetically possible to legitimately find in favor of the null hypothesis. As n gets large, the width of the resulting confidence interval will shrink until it lies entirely in either the e↵ective null interval or the alternative. Finally, the awkward and disingenuous embrace of one-sided hypotheses of the form H0 : ✓ < ✓0 vs. H1 : ✓ = ✓0 is banished as counterintuitive and unjustified. The use of one-tailed tests has itself long been the subject of controversy (see, e.g., Eysenck (1960)), in part because of the nagging suspicion that that they are employed more out of a desire to compensate for poor power to reject the null than for theoretically driven reasons. They would be replaced by one-sided hypotheses that are genuinely commensurable: H0 : ✓ < ✓0 vs. H1 : ✓

✓0 .

In Figure 1, I plot confidence intervals for hypothetical results one might obtain in addressing the wage di↵erence question. Since the results are imagined, let’s suppose for the sake of concreteness, that they are 95% confidence intervals (one could also superimpose two or three confidence

14

Figure 1: Interpretations of confidence intervals according to a unified substantive and statistical significance approach, where a di↵erence of up to 25 cents per hour has been deemed inconsequential (and the corresponding interpretation under conventional NHST) intervals with di↵erent ↵’s). I have indicated with dashed segments the thresholds of my e↵ective null set and $0 as the corresponding sharp null from a conventional analysis. For each interval, compare how results would be interpreted under the proposed unified significance test versus how the corresponding NHST would be interpreted. In neither case should the researcher be satisfied with simply reporting simple direction and significance. Such a simplified summary of results is useful, but cannot replace a discussion that considers the range of plausible values falling in a confidence interval and interpreting their meaning. A. NHST would fail to reject the null hypothesis of no di↵erence, while UST allows us to find evidence of no meaningful di↵erence with 95% confidence (or, equivalently, at significance level 15

.05). B. Under NHST, one would reject H0 and find a di↵erence in wages to be statistically significant; under UST, the finding still favors no meaningful di↵erence in wages, at ↵ = .05. C. Under NHST, one would again reject H0 at ↵ = .05, while the corresponding unified test would be inconclusive (though the researcher would indicate that all values in the interval were positive, including both notable values and ones only trivially positive). D. The NHST and UST both find a significant di↵erence; in the case of the former, the di↵erence may only be called statistically significant, while by the latter, one may say that a meaningful di↵erence is detectable. E. While the null hypothesis is rejected under NHST (a sensible outcome), UST finds inconclusive results. Sensitivity analysis would confirm what is evident in the picture—a slight change in either ↵ or the defined ⇥0 would allow rejection of the null both on statistical and substantive grounds. Furthermore, the wide confidence interval includes only a sliver of the e↵ective null set, but a wide range of consequential values, meaning a nuanced consideration of the results should grant the researcher some legitimacy in claiming evidence of a meaningful wage di↵erence and calling for a study with a greater number of observations to gain precision. F. Note that NHST, while rejecting the null for interval E., would not reject it for F. UST would again claim inconclusive results, defensible due to the lack of precision. Stated precisely, in algorithmic form, a unified significance test includes the following steps, assuming a continuous parameter space:

Unified Substantive and Statistical Significance Test (UST) I Partition the parameter space ⇥ for each estimate of interest into ⇥0 , an e↵ective null set and ⇥1 , a set of meaningful values, both non-countable sets corresponding to composite hypotheses. II Defend the choices of each partition either through topic-specific theory or everyday explanation. III Estimate parameters using confidence intervals. 16

IV For an estimated 1

↵ confidence interval C,

(a) Find in favor of the null hypothesis of no meaningful “e↵ect” if C ⇢ ⇥0 , with 1 confidence,



(b) find in favor of the alternative hypothesis of a meaningful “e↵ect” if C ⇢ ⇥1 , with with 1 ↵ confidence, (c) or declare the result inconclusive at 1 ↵ if C \ ⇥0 6= ; and C \ ⇥1 6= ;, i.e., if the confidence interval overlaps the two sets of values. V Employ sensitivity analysis to determine whether other reasonable partitions of ⇥ or choices of ↵ would have a↵ected the outcome of the test. VI Discuss and interpret the confidence intervals in context, noting the range of likely e↵ect sizes. In particular, if the result must be declared inconclusive at the selected ↵, the analyst should, for example, distinguish between a fairly precise confidence interval containing parameter values either in or close to the e↵ective null set and one that is wide (less precise), but containing mostly values considered to be substantively significant and perhaps even large.

4

Two Illustrations of Unified Significance Testing A major problem involved in adjudicating the scientific significance of di↵erences is that we often deal with units of measurement we do not know how to interpret (Carver, 1978). I shall next illustrate how unified significance testing may enrich the presentation and interpre-

tation of results. In certain instances, the UST framework for considering results may strengthen authors’ arguments; in other cases it makes more obvious the tentativeness with the results should be viewed. Most importantly, however, a unified approach to significance shifts the emphasis to practical significance, giving it the focus deserved.

4.1

Media E↵ects on Public Opinion: Support for Military vs. Diplomatic Response as Function of Exposure to Television News

Within their article exploring the three most widely studied types of media e↵ects (agendasetting, priming, and framing) in the context of the lead up to the Persian Gulf War of 1990-91, Iyengar and Simon (1993) consider whether exposure to television news may predict support for a military response to Iraq’s invasion of Kuwait and the subsequent crisis. Having studied eight 17

months of prime-time newscasts during the relevant time interval, they note the predominant use of episodic over thematic framing; based on extant theory and previous research, they suspect that this will lead to viewers’ attribution of responsibility to particular individuals and groups rather than broader historical, societal or structural causes, and anticipate that this will translate into support for the use of military force against Saddam Hussein rather than diplomatic strategies by those who consume such media. They regress a variable measuring respondent support for a military over diplomatic response on several predictors, via OLS. According to the logic of a unified significance testing approach, it is essential that one consider what magnitudes of coefficients would be impressive if one were able to observe the parameter values themselves without sampling error. Iyengar and Simon are primarily concerned with the expected e↵ect on military support corresponding to variation in the values of TV News Exposure and Information, measures of, respectively, television news consumption and awareness of political information via identification of political figures in the news. Controlling for party, gender, race, education and general support of defense spending, what sort of coefficients should we view as e↵ectively zero and, conversely, what values would indicate at least a somewhat meaningful relationship? This sort of question, as noted before, is too often left unasked. To the extent that it does arise, it is almost always handled completely informally. To their credit, the authors here distinguish between the two types of significance: “Overall, then, there were statistically significant traces of the expected relationship. Exposure to episodic news programming strengthened, albeit modestly, support for a military resolution of the crisis.” From a unified significance testing—rather than NHST—perspective, the assessment of the degree to which this type of programming corresponds to greater military support is of principal concern.

The Iyengar-Simon model may be written as :

M ilitarySupport = +

0

+

1T V

5 M ale

+

news +

2 Inf o

6 nonW hite

+

+

3 (M ale⇥Inf o)

7 Republican

+

+

4 (nonW hite⇥Inf o)

8 Def enseSpend

+

9 Educ

+✏ (1)

18

The predictors of primary interest are T V news, the number of self-reported days per week watching TV news, and Inf ormation (or Inf o), the respondent’s score from 0 to 7 on a quiz of recognition of political figures, taken as another proxy for news consumption. In Iyengar and Simon’s model, the contribution of Inf o (but not T V news) is allowed to vary by race and gender (through the inclusion of interaction e↵ects), so that one might wish to discover whether the following parameters are of a meaningful magnitude:

1

=

2

= e↵ect of Inf o among White Females

2

=

2

+

3

= e↵ect of Inf o among White Males

3

=

2

+

4

= e↵ect of Inf o among non-White Females

4

=

2

+

3

+

4

= e↵ect of Inf o among non-White Males (2)

For each demographic category, the associated coefficient is interpreted as the di↵erence in expected level of support for a military solution associated with an additional point on the political knowledge quiz. Thus, for example, if

1

= 0.25, this means that one might expect an extra correct

answer on the quiz taken by a White Female to correspond to an additional quarter-point on the scale from 0 to 4, assuming that the ordinal scale of support for diplomacy vs. military action can be sensibly interpreted as if it were an interval-level measurement. A large di↵erence of four points on the quiz (e.g., correctly identifying six rather than two political figures, or four rather than zero), would be expected to translate into a full unit increase in the support for a militaristic solution on the scale of 0 to 4.2 Understanding this allows the researcher to set up reasonable expectations of what might be considered a truly meaningful “e↵ect” and then evaluate whether the data support such a finding in light of sampling error. Following the steps outlined above, one would begin by declaring a reasonable null set. Three 2

Specifically, one point was awarded if the respondent supported tougher military action going forward, rather than any of three less hawkish alternatives, and up to three points were awarded for level of militarism expressed in response to the question of what the United States should have done as an original response to the Persian Gulf crisis.

19

Table 1: Iyengar and Simon (1993) Results on Exposure to Information/TV News as Predictors of Support for Military Response in Gulf, [reprinted with permission (pending) from publishers.] Support for Military Rather than Diplomatic Response TV news exposure Information Male ⇥ Information Non-White ⇥ Information Male Non-White Republican Defense spending (favor) Education

b

SE

p level

0.02 0.07 -0.09 0.10 0.67 -0.76 0.07 0.20 0.07

0.01 0.03 0.04 0.06 0.11 0.13 0.01 0.02 0.02

.03 .03 .02 .08 .001 .001 .001 .001 .001

< < < <