In the first issue of the new, free, and online journal Erasmus Journal for Philosophy and Economics, Aris Spanos reviews Ziliak & McCloskey’s The Cult of Statistical Significance (pp. 154-164; see my rambling review for other reviews and Ziliak’s homepage for more reviews, comments, and much more).
The review is interesting for several reasons. First of all, Spanos is an expert on econometrics and is highly qualified to understand and relate to Ziliak and McCloskey’s critique. Second, the review is highly critical of the book; Spanos does not agree with Ziliak & McCloskey on most central issues, it seems, and he doesn’t like their lack of alternatives to classical, significance testing nor their rhetorics. Finally, the review is particularly interesting because Ziliak & McCloskey has been given the opportunity to reply to Spanos critique.
First, Spanos blows off some steam about Ziliak & McCloskey’s rhetorics:
[T]hey attempt to make their case using a variety of well-known rhetorical strategies and devices, including themes like battles between good vs. evil, and conceit vs. humility, frequent repetition of words and phrases like ‘oomph’, ‘testimation’, ‘sizeless stare’ and ‘size matters’, picturesque language, metaphor and symbolism, flashback, allusion, parody, sarcasm, and irony. Their discourse in persuasion also includes some ‘novel’ devices like cannibalizing quotations by inserting their own ‘explanatory’ comments to accommodate their preferred interpretation, ‘shaming’ notable academics who ‘should have known better’, and recalling private conversations as well as public events where notable adversaries demonstrated the depth of their ignorance [p. 155].
I’m particularly sympathetic to the above critique; I find the rhetoric in the book disturbing and, at times, simply bad. In parts, the use of visual effects is way overdone, which is strange given McCloskey’s warnings about exactly that in her little gem Economical Writing. However, exactly because of McCloskey’s demonstrated wisdom on writing and rhetoric, I’m suspecting that the overuse of visual effects, the insisting repetition, and the defamation of R.A. Fisher is intentional; readers (and professors in particular) are supposed to jump in their chairs, being offended.
Spanos comments rather shortly on much of the book, but his review still gives a good idea of what The Cult is onto and how it goes about it. Spanos’s review naturally focuses on what he’s most interested in; ‘various philosophical/methodological issues pertaining to the problem’ and what to do about it (p. 155).
Spanos is annoyed by the ‘nontechnical’ flavor of the book; he’d like to see alternative methods demonstrated to fix the ‘problem.’ Obviously, Spanos has worked extensively on related issues; he refers to his own work the whole time (so, either he is the world leading expert on the stuff, or he has a wrong idea of himself). Spanos claims that an important reason for the ‘confusion in the minds of practitioners concerning the appropriate use and interpretation of frequentist methods’ was that the philosophical foundations of the early development of the methods ‘left a lot to be desired’ (pp. 157-158). I’m not convinced. Ziliak & McCloskey argue that a mechanical test cannot substitute scientific judgment. ‘Philosophical foundations’ cannot change that.
In the absence of any guidance from the statistics literature, practitioners in different applied fields invented their own favored ways to deal with these issues which often amounted to misusing and/or misinterpreting the original frequentist procedures […]. Such misuses/misinterpretations include, not only the well-known ones relating to the p-value, but also: (i) the observed confidence interval, (ii) the p-value curves, (iii) the effect sizes, (iv) the fallacy of the transposed conditional, (v) Rossi’s real type I error, (vi) Zellner’s random prior odds, and (vii) Leamer’s extreme bounds analysis [p. 158].
Spanos then claims that Ziliak & McCloskey too have misunderstood; they have misunderstood the methods they recommend to remedy the problem of significance testing. He goes on to shortly explain why the recommended methods won’t work. Unfortunately, some of his explanations are too short for me to understand. I’m convinced, however, that Spanos knows what he’s talking about and in particular that stuff like confidence intervals and p-value curves won’t help much; I was wondering myself how confidence intervals, based on standard errors, could help alleviate the misuse of significance testing (which is exactly to calculate standard errors and compare the distance between an estimate and a hypothesized value in terms of the standard error).
Spanos, too, misunderstands, however. As far as I can understand, on p. 160, where he writes
“A good and sensible rejection of the null is, among other things, a rejection with high power” (Ziliak and McCloskey 2008, 133). And “refutations of the null are easy to achieve if power is low or the sample is large enough” (p. 152).
No! No! You have it backwards. Rejection with high power is actually the main source of the problem of statistical vs. substantive significance, and ‘large enough sample sizes’ n go hand in hand with high power, not low.
Spanos misunderstands Ziliak & McCloskey when they write ‘if power is low or the sample is large enough’ (p. 152; my emphasis). They mean, as they write, or, not and, as Spanos seem to read.
Nonetheless, Spanos still claims that low power is desirable in a significance test. I presume he is correct; I was very confused about Ziliak & McCloskey’s discussion of power in statistical tests because my knowledge of it is limited and their explanations are not exactly text-book. Now, if I’m not wrong, Ziliak & McCloskey claimed that high power is necessary to avoid so-called Type II errors. Spanos doesn’t mention such errors when he claims they have misunderstood power; instead, he talks about Type I errors. Extremely unfortunate, Ziliak & McCloskey refrain from commenting on power in their reply to Spanos.
Spanos ends his review with a discussion of statistical adequacy, which, according to him is a more fundamental problem in applied statistics and which has been dealt with only recently and to a large degree by himself (p. 163). He continues:
Where does this leave the authors’ concern with the problem of statistical vs. substantive significance? Shouldn’t they have known that, even if one had a credible procedure to address the problem, one couldn’t make any progress on the basis of statistically misspecified models [p. 163]?
I don’t understand. Do more fundamental problems acquit researchers from misusing significance testing? I don’t think so.
I conclude that Ziliak & McCloskey’s are right in their critique of Fisher and his methods, but that their suggested methods aren’t exactly water tight either. Spanos is alien (hostile, even) to Bayesian methods, which he doesn’t even bother to discuss in detail; I may end up putting my faith there.
(For the record, my current understanding of the main difference between Fisherian and Bayesian methods is that Fisherian methods assume God [or whatever; Nature] put the truth out there [a parameter value, for example]; they assume the fixed, only truth exist and somehow treat observations as random. Bayesian methods, instead, treat the truth [the parameter value of interest] as random [or floating or whatever] and treat the observations as given. First, things change and stuff is not constant; the one and only truth don’t exist. Second, the observations are in fact the only given entity in a statistical problem, and I cannot help but find it funny to think of them as somehow random. Maybe I’m just not deep enough?)
Next, Ziliak & McCloskey’s reply (pp. 165 – 170). (Warning: A lot of cutting and pasting is about to hit you: The reply is so well-written that it best speaks for itself.) They start by repeating the gist of their book:
Over the past century the usual (and the conveniently mechanical) procedure devised by the great statistician, geneticist, and racial eugenicist R. A. Fisher has been shown to be scientifically silly again and again and again. Rarely has anyone actually defended NHST (null hypothesis significance testing). That is because it is logically indefensible. Statistical significance is neither necessary nor sufficient for substantive scientific significance. Everyone knows this, once they stop regressing for a minute and actually think [p. 165].
Obviously, they keep insisting on defaming Fisher; their rhetoric in the review, however, I find more balanced than that in the book.
Ziliak & McCloskey accuse Spanos of being angry and indignant. I find it peculiar: If anything else, what I read into Ziliak & McCloskey’s defamation of Fisher, it’s anger and indignation. I would also expect McCloskey to appreciate temperature in a debate. Furthermore, what do they expect when they do what they do to the icon Fisher in fact is? Again, I think Ziliak & McCloskey counted on stirring up feelings, and they use it for what it’s worth: they eloquently turn it against their attackers:
[T]he defenders [of significance testing] are always angry. Ignorant sneering, personal insult, and irrelevant indignation are judged acceptable when defending [significance testing]. We think the anger comes from a psychological tension. The defenders realize uneasily that it is strange to depend for scientific judgment on a sampling statistic without a persuasive context—failing to ask how big is big, which is the only scientific context relevant to a real scientific test. But they have been thoroughly indoctrinated in NHST, and belong to a professional club in which t > 2.0 or p < .05 or whatever is substituted for scientific judgment. The mechanical procedure of their profession is under attack. So they get angry. They have no reply. So they shout and bluster [pp. 165 – 166].
Next (p. 166) Ziliak & McCloskey sidesteps the main part of Spanos attack:
Spanos throws up a lot of technical smoke that has the effect of obscuring the plain fact that he agrees with us. (The mathematics in his piece is irrelevant to anything of importance. The reader may omit it.)
Omit it! They got some nerve. I still find it unfortunate that Ziliak & McCloskey avoid commenting on Spanos claims of errors and misunderstandings in The Cult. Anyway, they see differently on Spanos main criticism:
The main point of Spanos’s piece is that Ziliak and McCloskey do not offer guidance on how to address substantive scientific significance. Yet even if we had not, it would not be a fault. NHST is intellectually bankrupt, as Spanos agrees it is, and it should be abandoned. If you earn your living robbing banks, you should stop, right now, at once. You should not complain, “But how am I now to earn my living?” Go get honest work. And the honest work in the present case is the exercise of scientific judgment, quantified by relevant magnitudes that the best scientists find persuasive [p. 167].
Good point. And they find it appropriate to pound their main message in The Cult some more:
There is no discipline-independent criterion for importance, calculable from the numbers alone. Read that again. There is no discipline-independent criterion for importance, calculable from the numbers alone. Scientific judgment is scientific judgment, a human matter of the community of scientists. As vital as the statistical calculations are as an input into the judgment, the judgment cannot be made entirely by the statistical machinery [p. 168].
Cannot disagree to that. What is an expert if he cannot offer judgement? An expert is not merely an operator of a mathematical machinery; he knows best and should feel obliged to offer his judgment, he shouldn’t hide behind numbers.
Ziliak & McCloskey ends with a challenge (and their ‘prior’):
Here is our challenge. If you think, like Spanos, that you have a valid defense of NHST, offer it. Spanos, like Hoover/Siegler, and Anthony O’Brien (2004), have tried. They have failed. But at least they are serious about their intellectual commitments, and believe (given their Bayesian priors) that NHST is defensible. It is not [p. 169].
Hat-tip: Homo Phileconomicus