Social Psychology and Multi-Lab Replications
The problems are real—but not as big as we feared.
Posted September 15, 2022 | Reviewed by Michelle Quirk
- A comprehensive review of multi-lab replications brings some reassuring news.
- Many ostensible failures did not falsify the hypothesis—they merely failed to test it.
- Low participant engagement may be a pervasive problem, as indicated by massive data discarding.
These are tough times for psychology research. The past decade has seen criticisms mount and collegial cordiality dwindle. The so-called replication crisis has been especially hard on social psychology. Although as far as I can tell all fields of scientific study have problems with replication, social psychology’s have been extra burdensome and troubling. Partly this stems from social psychology merely being more willing to face up to the problems. But another part is that the rate of high-profile replication failure has been more severe in social psychology. Nosek’s groundbreaking “Open Science” investigation found that findings from cognitive psychology had trouble being reproduced—but social psychology’s failure rate was considerably worse.
Some have turned to multi-site replications as a possible solution. At first blush, it seems like a great idea: Sign up a dozen or more labs that will all run the same experiment and combine the results. Yet, social psychology’s success at these continues to be low. What gives?
Two colleagues and I embarked on an ambitious literature review, hoping to understand what the problem is. We assembled all 36 published articles reporting multi-lab or multi-site replication attempts in social psychology. (We also covered, separately, some other projects that had participants do a slew of mini-studies all at once.)
Only four of the 36 could be considered successful replications. By success, we meant that it found a significant result in the same direction as the original. Only one of these matched the original effect size—all the others were smaller, often much smaller. A couple of others were classified as mixed successes, meaning that they reported different analyses, some of which were consistent with the original finding, others not.
Still, the most common pattern was that the original study found a significant result, while the multi-site replication did not. How can that be?
There is certainly a loud faction claiming that the original finding has been disproved and discredited. They might take this to mean that all the knowledge built up by several generations of social psychologists is garbage and should be ignored. Some critics carry this further, insinuating if not insisting that much of the original research is fraudulent.
But our view is more respectful and even optimistic. We assume most researchers have tried their best to carry out good research. That includes both the original researchers and the replicators— even though they generally come to different conclusions.
There is some fraud in science, but not much. The costs of fraud are enormous because there is a high likelihood of getting caught sooner or later—and getting caught means the death of one’s career. It discredits all of your work, even if you only cheated once. (It also hurts all your honest coauthors.) I think that’s why most fraudsters get started early, before they really appreciate what the eventual costs will be. Diederik Stapel reported that he began faking data in graduate school or soon after, in the hope of keeping up competitively with other scientists. Once you’ve cheated, there is less danger in doing it again. But I suspect that most scientists are honest, and by the time they get tenure, if not earlier, they live in fear that someone in their lab will fake something once and thereby discredit all the honest work their lab has ever done.
Biases Toward False-Negatives
Our conclusion after reading all those multi-site replication attempts is that the method by which studies are run contains some biases toward false-negatives. That is, even if the original researchers find a true effect, it may fail in multi-site replications. There are several reasons for this.
One reason is research participants in these giant projects seem not very engaged. They don’t care much about the experiment. Perhaps they mainly want to get it over with. They respond casually, indifferently. They aren’t emotionally involved in what’s happening, and so they often don’t really furnish data that test the hypothesis. This was not our initial idea or even on our list, but, as we read the literature, it gradually emerged as a major pattern.
A strong and disturbing sign of this lack of engagement is how much of the data is discarded. Typically researchers set up (and preregister) criteria in advance for discarding participants who seem not to be paying attention, not following instructions, or not understanding the experiment. In traditional research, you might discard a participant or two on that basis, or five at most. But the multi-site reports often had shockingly high rates of discarding data, in some cases more than a thousand research participants. Sometimes a quarter, a third, even occasionally more than half of the participants were discarded. Half the data! In one case, the researchers had to discard two-thirds of participants in one condition, but only 7.5 percent in the control condition. They correctly pointed out that differential attrition is a serious confound because your treatment group contains a different type of people from your control group.
Another crucial point is that many of the multi-site replications (but none of the originals) had nonsignificant manipulation checks. Thus, they failed to test the hypothesis. That raises questions—but clearly does not impugn the original finding. To illustrate, suppose a study manipulates anxiety, but the manipulation check shows that the high-anxiety treatment group was no more anxious than the control group. We cannot learn anything about the effects of anxiety from that finding. We may wonder why a manipulation that worked elsewhere fails to work again. (Low engagement could be a factor here, too.) But it does not dispute or discredit previous findings about anxiety.
Bottom line, the multi-site replications give rise for concern about social psychology. But they are not as dismal about the state of the literature as one might assume.
Baumeister, R.F., Tice, D.M., & Bushman, B.J. (in press). A review of multi-site replication projects in social psychology: Is it viable to sustain any confidence in social psychology’s knowledge base? Perspectives on Psychological Science.
Ito, H., Barzykowski, K., Grzesik, M., Gülgöz, S., Gürdere, C., Janssen, S. M. J., Khor, J., Rowthorn, H., Wade, K. A., Luna, K., Albuquerque, P. B., Kumar, D., Singh, A. D., Cecconello, W. W., Cadavid, S., Laird, N. C., Baldassari, M. J., Lindsay, D. S., & Mori, K. (2019). Eyewitness memory distortion following co-witness discussion: A replication of Garry, French, Kinzett, and Mori (2008) in ten countries. Journal of Applied Research in Memory and Cognition, 8(1), 68–77. doi:10.1016/j.jarmac.2018.09.004
Bouwmeester, S., Verkoeijen, P. P. J. L., Aczel, B., Barbosa, F., Bègue, L., Brañas-Garza, P., Chmura, T. G. H., Cornelissen, G., Døssing, F. S., Espín, A. M., Evans, A. M., Ferreira-Santos, F., Fiedler, S., Flegr, J., Ghaffari, M., Glöckner, A., Goeschl, T., Guo, L., Hauser, O. P., … Wollbrant, C. E. (2017). Registered replication report: Rand, Greene, and Nowak (2012). Perspectives on Psychological Science, 12(3), 527–542. doi:10.1177/1745691617693624
Baumeister, Roy & Tice, Dianne & Bushman, Brad. (2022). A Review of Multi-site Replication Projects in Social Psychology: Is It Viable to Sustain Any Confidence in Social Psychology's Knowledge Base? https://www.researchgate.net/publication/362617346_A_Review_of_Multi-si…