The vast majority of experimental work in behavioral and biomedical science involves group comparison—in the simplest case, an experimental and a control group. Average data are compared and the variability within each group is used to estimate the probability that any mean difference could have occurred by chance. The estimation method typically used, the Null Hypothesis Statistical Test (NHST) method, was devised by Ronald Fisher in a context to be discussed in a moment.
There are serious problems going from group data to the properties of individuals (the object of inquiry problem: Staddon, 2019), but the first major problem encountered by the NHST method was replication, exposed in a landmark paper by John Ioannidis (2005). The replication problem is fixable and various solutions have been proposed. In July 2017 a letter to Science (Benjamin et al.), signed by more than 70 statisticians, suggested that a solution to the NHST-replicability problem is to set the criterion (alpha level) for rejecting the null (no-effect) hypothesis at p = .005 rather than Ronald Fisher’s suggestion of p = .05, the then-standard. The authors argued that rather than choosing a one in twenty (or less) chance of error as good enough to accept the hypothesis that your treatment has an effect, the standard should be upped to one in two hundred.
Replicability would sometimes be improved by a tougher criterion; but a p-value this small would also eliminate much social science research that uses NHST; the publication rate in social and biomedical science would plummet. Partly for this reason, more than 80 scientists signed a November 2017 letter (Lakens et al, 2017) to Nature rejecting the suggestion of Benjamin et al., instead recommending “that the label ‘statistically significant’ should no longer be used” and concluding instead “that researchers should transparently report and justify all choices they make when designing a study, including the alpha [critical p-value] level.”
The situation seems to have settled down after that point (for summaries, see Baker, 2016, and Staddon, 2017, 2022). There have been some efforts at mitigation: pre-registering hypotheses, using larger groups, etc. But the NHST method continues to be widely used.
The emphasis in both these long letters is on the replication issue. But a little thought shows that the Fisherian method is, in fact, completely inappropriate for basic science. The reason is embarrassingly simple.
The NHST method was invented by Fisher when he was working in an applied environment, an environment where a decision had to be made—between two fertilizers or other treatments of an agricultural plot, for example. Each fertilizer had a certain estimated net benefit. The one with the significantly higher benefit was then chosen. The cost of an error—choosing the worse fertilizer, a false positive—is small and potentially measurable. In this case, cost is not an issue. It is only necessary to answer the question: which fertilizer is probably better? For that choice, the 5 percent criterion is perfectly appropriate.
In basic science, the situation is very different: the choices are: “confirm” or “don’t know”; but the cost of error is much higher. The benefit of correctly confirming a true experimental hypothesis (i.e., rejecting the null hypothesis) is a modest contribution to knowledge. But the cost of error, seeming to confirm a hypothesis that is in fact false (false positive, Type II error), may be very high, both for science and society. False positives, just like scientific frauds, can have very damaging effects (see, for example, Ritchie, 2020, and post by “Andrew”, 2021). Follow-up studies, in some cases very many studies, will go down the rabbit hole and both waste time and, probably, generate more errors. And as recent reports (e.g., Randall et al., 2021) point out, the cost, both human and financial, of public policies based on a scientific error can be enormous.
So the conclusion is simple: The Fisherian method is fine for deciding between two types of fertilizer; it is inadequate for deciding between truth and falsehood. It is simply wrong to rely on the NHST method in basic social or biomedical science.
What is the alternative? There is no obvious answer beyond human ingenuity. It is perhaps worth remembering that Hermann Ebbinghaus discovered the basic laws of memory using just one subject, himself, in studies that would not now meet the algorithmic standards of the NHST community, so not “Bend it like Beckham!” but “Think like Ebbinghaus”?
Baker, M. (2016) 1500 Scientists lift the lid on reproducibility, Nature 533, 452-454.
Benjamin et al. (2017) Redefine Statistical Significance. Nature Human Behaviour, 2, 6-10. (https://psyarxiv.com/mky9j/,
Ioannidis, J. P. (2005) Why Most Published Research Findings Are False, PLoS Med 2, 8.
Fisher, R. A. (1934) Statistical methods for research workers. (5th edition) Oliver & Boyd. http://www.haghish.com/resources/materials/Statistical_Methods_for_Rese…
Kupferschmidt, K. (2018) Researcher at the center of an epic fraud remains an enigma to those who exposed him. Science Aug. 17, 2018. https://www.sciencemag.org/news/2018/08/researcher-center-epic-fraud-re…
Lakens et al. (2018) Justify your alpha. Nature Human Behavior 2, 168-171. and https://psyarxiv.com/9s3y6/
Owens, B. (2018) Replication failures in psychology not due to differences in study populations: Half of 28 attempted replications failed even under near-ideal conditions. Nature, Nov. 19. https://www.nature.com/articles/d41586-018-07474-y
Randall, D., Kindzierski, W. & Young, S. (2021) Shifting Sands: Report I. https://www.nas.org/reports/shifting-sands-report-i/full-report#_ftnref…
Ritchie, S. (2020) Science Fictions: How fraud, bias, negligence, and hype undermine the search for truth. Metropolitan Books.
Staddon, J. (2017) Scientific Method: How science works, fails to work or pretends to work. Taylor and Francis.
Staddon, J. (2019) Object of Inquiry: Psychology’s Other (Non-replication) Problem Acad. Quest. 32, 246-256.
This topic is further discussed in my new book Science in an age of Unreason (Regnery, 2022)