blatherskite | Statistical significance

A colleague directed me to a fascinating article on the notion of statistical significance: In case the link breaks, look for the 2015 Editorial in Vol 37 issue 1 of Basic and Applied Social Psychology by David Trafimow and Michael Marks.

Briefly, the journal has decided to ban the use of what is called "testing of the null hypothesis". A non-rigorous description would be as follows: It's surprisingly hard to be certain that what you observe during an experiment is real, since it's always possible that something you observed occurred purely by chance. This recognition is what distinguishes scientific research from most other forms of knowledge gathering: science requires replication. If many other researchers from around the world repeat your experiment and get the same results, you can be confident that the result is probably real. The more confirmations you find, the more confident you can be. The more contradictions you observe and can explain (based on still more replicated research), the more confident you can be that you really understand what you're seeing and what factors might change the outcome of an experiment. There are much deeper depths to this subject, of course, but that's the basic notion.

When you perform the first research to answer a particular question, then by definition, no other researchers have replicated your results: you don't have results yet, and even in the Internet era of instant publishing, it will take considerable time for your results to reach others who might want to try replicating your results. Thus, to provide a measure of confidence in your initial results, you replicate the research yourself: you repeat the experiment several times, ideally with a large number of individuals (whether plants, subatomic particles, or people) in each repetition. You can then use the mathematical techniques of statistics and probability to calculate the probability (the "P level") that a result is real and not just a random occurrence. Nonscientifically, "shit happens", usually because some unnoticed but important characteristic of the experimental conditions causes a result that other researchers can't replicate because their experiments don't use those precise conditions; identifying the cause of that contradiction often leads to really interesting new understandings.

As a sidenote, it's worth noting that many soi-disant "sciences" don't do a very good job of replication. Economics (before the advent of experimental economics) failed to meet the standard of evidence for more rigorous forms of science. This has led to many serious global problems due to seemingly plausible "theories" that don't actually describe the underlying mechanisms of how homo economicus really thinks. But educational research has been particularly called into question because of its lack of replication. However, professional organizations such as the American Psychological Assocation are aware of this issue, and are making efforts to educate their members about the problem and raise the level of their game.

An implicit point about the Editorial relates to the sometimes shaky nature of sociological and psychological research and whether Basic and Applied Social Psychology might be caving in to pressure from its authors to allow them to publish shoddy research. The authors clearly indicate that this is not the case. Rather, the editorial is based on several very important points that go right to the heart of how we define the validity of knowledge:

First, far too many scientists don't understand the meaning of P values. The significance level of a statistical test is always a probability, and will never be anything more than that. A very low P value means a very low chance that a result occurred by chance, but even at the fairly rigorous level of P = 0.01, this still means that 1 in 100 repetitions of an experiment might be expected to produce a different result purely because of random effects that weren't controlled or because we don't fully understand the experimental system and allowed some crucial factor to vary in a way that affected the outcome. At the more commonly accepted level of P = 0.05, 1 in 20 "accidents" might occur. Few of us would play Russian roulette with those odds. Why would we be any more confident in a research result?

Second, surprisingly many scientists don't seem to understand the difference between practical and statistical significance. For example, one might find a 1% difference in some blood test result between two populations that is statistically significant at P = 0.01 (a reasonably rigorous criterion). But if a 10% or 100% difference is required to improve health outcomes, the statistically significant result is meaningless for all practical purposes.

There are other more abstruse mathematical points that I don't fully grok but that are no less important. Let's leave it at that rather than exploring those murky depths.

All this being said, tests of statistical significance remain important because they provide additional criteria that help us to understand how much faith we should place in a given result. Basing our conclusion only on the P level is foolish; including the P level in our criteria for deciding whether a result is important is wise. On this logic, banning statistical analysis from journal articles is foolish. It makes about as much sense as banning the use of means (averages), on the logic that sometimes the median or mode is more important and relevant*. Far more sensible to instruct reviewers to look for the abovementioned problems and correct authors who make these and other mistakes related to statistical reasoning.

* The median represents the point at which half of the set of measured results are greater and half are smaller. (More precisely, it represents the midpoint of a frequency distribution.) An example of why it's important is that if the CEO of a company earns $10 million annually, and all other workers earn less than $40 thousand annually, the mean will appear artificially high because of the distorting effect of the CEO's large salary. The mode is the value that appears most frequently. Using the same salary example, if all employees earn the minimum wage of ca. $10/hour, calculating the mode will provide a value of $10/hour, not the much higher value that would result from including the CEO's salary in calculating a mean value.

In the Basic and Applied Social Psychology decision to ban hypothesis testing results, as in so many other areas, we see an example of how draconian solutions are rarely the best choice. A more practical solution would be to remind peer reviewers and the journal's own editors to rigorously examine papers for logical flaws that result from misunderstanding the purpose of statistical testing and misinterpreting its results. This retains the value of statistical testing (providing an objective criterion), while accounting for the weaknesses of that criterion.

Flat | Top-Level Comments Only

From:

pharis

This is blowing my mind. With a big caveat for my own ignorance (a bit less than the average person's, much much greater than that journal's editors), this seems deeply weird.

<< Second, surprisingly many scientists don't seem to understand the difference between practical and statistical significance. >>

Baaaaahhhhh. Bad scientists. Everyone should understand that.

<< Far more sensible to instruct reviewers to look for the abovementioned problems and correct authors who make these and other mistakes related to statistical reasoning. >>

Which is difficult and time consuming. Do you think this is a response to flaws in (some) peer review, either in general or for this particular journal?

blatherskite

I think it's probably peer review in general. Reviewers are human, and if they don't fully get the subtleties of statistics, they won't catch errors when they review a problematic paper. This is one reason many weak papers get published that should, perhaps, have been sent back to the drawing board.

But the larger problem is that doing peer review right is, as you note, difficult and time-consuming. It takes more time to do it right than many researchers are willing to allocate. The reviews I see range from cursory to insightful and rigorous; the former are the problem.

From: (Anonymous)

So is the bottom line here that because P-values and other tests of statistical significance cannot be used *in isolation* to judge the practical significance of a result, this journal bans their use altogether? That seems so odd that I'm wondering whether I missed something in the journal's rationale in their decision.

Ginger

Unfortunately, the full text of the 2014 editorial cited in the 2015 editorial is not available free, and I don't really want to pay $$$ to find out what it says.

As you suggest, it's probably a case that they're over-reacting to a real problem. But I'd have to have access to the original editorial to know that for sure. However, upon re-reading the 2015 editorial in search of clues to answer your question, their statement that they consider testing of the null hypothesis to be invalid makes me doubt my earlier comment that this is not intended to conceal shoddy research. Their "concluding thought" seems at best inconsistent with their decision, and possibly somewhat disengenuous.

Blatherskite: Musings on editing, writing, and fiction

Musings on editing, writing, and fiction

Statistical significance

(no subject)

Statistics redux

"Throwing the baby out with the bathwater"?

Re: "Throwing the baby out with the bathwater"?

Profile

Links

Expand Cut Tags