Fall of the p paradigm: Modern methods for reporting linguistic research
Null Hypothesis Significance Testing (NHST) has been thoroughly discredited, yet ignoramus et ignorabimus, many linguists continue to report p values as if they provide their readers with something of value. In fact, “statistically significant” (often p < .05) is so often confused with practical importance that authors who use p (and journal editors who publish their research) can do very little with p to support the qualitative bona fides of otherwise very good research.
The question becomes, how do we report research without relying on the ubiquitous p? NHST is what we know, and often the only thing we know well. We have learned to interpret probability claims like p < .05, and statements like “was found to be significant” or “very significant,” or for something like p = .051 we might see “trending towards significance” or “is approaching significance”. As Geoff Cumming asks, “How do we know that p isn’t running away from significance as fast as it can?” What we do know is that these are all meaningless statements. Clearly, whether p is larger or smaller than your chosen alpha says nothing about the magnitude and direction of the effect you are trying to measure, nor does it affirm or deny a linguistic theory or hypothesis, probabilistic or otherwise. Simply put, p-values do not answer our research questions, and many very smart researchers really don’t know what p means.
What is it then? Rex Kline (2013) defines p < .05 to mean “that the likelihood of the data or results even more extreme given random sampling under the null hypothesis is < .05, assuming that all distributional requirements of the test statistic are satisfied and there are no other sources of error variance”. That is, p(D +|H0), emphasizing that p is the conditional probability of the data under the null hypothesis of no difference (H0) given that all other assumptions are met, like perfect measurement. How many linguistic measurements can we call perfect? In any case, Kline’s definition might sound like statistical mumbo jumbo to many. Here’s what p was invented to do: To detect whether or not the underlying process generating the data is or is not 100% random. So, properly, a 5% probability is actually saying there is 5% chance that the process behind the data is 100% random.
If you are compelled to report p, it is possible to mention it as part of your research preparation or pretesting procedure. For example, you could write:
“I found that my sample size did not give me a p value smaller than my alpha value which I set at 5%. Therefore, I increased my sample size from 22 to 32 participants before proceeding with my study. A sample size of 32 allowed me to cross my 5% alpha threshold, meaning that my sample size may help me get closer to the true population parameter when I perform my statistical analysis.”
As shown above, some elements of NHST might have value if they are used at the outset of your study. Major software programs like SAS include a sample size calculator, and some programs specialize in these calculations and others to help you get off to a good start. One commercial program is G*Power. These leverage what NHST is good for: Planning. And, by the way, statistical power is only meaningful in the context of NHST, which we should abandon anyway.
It is important to restate in no uncertain terms that NHST, if used at all, should only be part of planning your study; it should not be used for making conclusions about your linguistic hypotheses, or for comparing your study with somebody else’s study. It literally isn’t possible to make conclusions using p. Doing so evokes a cascade of fallacies as described by many authors including the aforementioned Rex Kline in Beyond Significance Testing: Statistics Reform in the Behavioral Sciences published by the APA (American Psychological Association) in 2013.
Now, what do you report?
As the APA recommends, we should report effect sizes and confidence intervals for all important statistics. The size of the effect, or effect size, is actually what researchers care about. Effect sizes answer research questions. Confidence intervals reveal how precise your statistics are when you account for sampling error, an unavoidable part of almost all linguistic research. It’s best to be honest about sampling error by reporting confidence intervals, and using them in your charts and graphics does exactly that. Each leg of the confidence interval is roughly 2X the margin of error. They are part of the estimation thinking that defines modern methods for reporting research.
For estimates of effect sizes, it is common to provide either Pearson’s r, Cohen’s d, or ratios. If you really want to be perfect, use the least biased measures: replace Pearson’s r with Omega-squared, ω2, and replace Cohen’s d with Hedges’ g. What is bias? Bias means that your statistic, the number you generated (mean, effect size, regression coefficient, etc.), is getting farther from the true population parameter you are trying to get at. For example, Cohen’s d is inflated by 4% when your sample size is 20, and it is inflated by 2% when your sample size is 50. This means that the estimate of Cohen’s d, a number between -1 and +1, is reporting a larger effect than what is real (note that zero means no effect). As you consider it further, you will notice that many statistical biases are against small sample sizes which, as it turns out, are common in some of the most meaningful linguistic research, especially in sociolinguistics and SLA (Second Language Acquisition).
As for confidence intervals, they are just another tick on the report-what dialogue box in your statistics program. In prose they are reported as such:
“The sample mean was M = 120, 95% CI [115, 125]. Given the confidence interval, it was plausible that the sample was drawn from a population whose mean was somewhere between 115 and 125.”
Important: The statement above did NOT include the word “probability”. Remember, a confidence interval does not make a probability statement about the population you are studying; it only makes a statement about the sample you took from the population you are studying. Curiously enough, confidence intervals are sometimes put through a frequentist sieve to produce statements like “the confidence interval captures the true population mean,” then a reference is used to somebody who is fettered to NHST. This should be avoided. It represents point thinking (an illusion of accuracy), not estimation thinking. Use the language above as sanctioned by the statisticians who invented confidence interval methodology, and refer to APA guidelines for other elements.
Unfortunately, if you want more information, our published statistics references within linguistics still lean on p-values too often and too heavily. Besides positing NHST as if it were the only thing that exists in statistics, simple things, like confusing alpha with p, seem to pop up every few pages in our references. Other issues crop up as well. For example, one noteworthy linguist who authored a book on statistics for linguistic research repeatedly mistakes the mathematical framework for Analysis of Variance (ANOVA) with a posterior development, and then he cites a respected statistics book that did not make this mistake (you have to follow the reference to figure that out). Just be aware that these things are going on, and that it may behoove you to access statistics resources outside of linguistics. Perhaps the very best free resource for learning the ins and outs of NHST may be Dr. Geoff Cumming’s freely available YouTube page. He’s an enjoyable speaker and exceptional communicator. Highly recommended.
In addition, here are two excellent books:
Discovering Statistics using IBM SPSS Statistics by Andy Field, 2013.
Understanding The New Statistics: Effect Sizes, CIs, and Meta-Analysis by Geoff Cumming, 2012. You can rent Dr. Cumming’s book from Amazon.com for under $14 (as of late 2017).
In conclusion, the paradigm of p as a standalone measure of importance has fallen, but many aren’t aware of it. It’s time for linguists to move forward with statistical methods that provide information that is useful, meaningful, honest, and encourages replication and communication among researchers, besides actually answering our research questions.