Can Machine Learning Solve Psychology's Replication Problem?
New research tries to predict replication rates in psychology.
Posted February 7, 2023 | Reviewed by Kaja Perina
- New research develops a machine learning algorithm to predict whether a psychology study will replicate or not.
- This machine learning algorithm uses word choices in psychology papers, not other key information, to make predictions.
- The algorithm has low accuracy overall (68%) and is based on existing replications that don't cover all the areas of psychology they study.
- Overall, the paper supports several existing findings about replication, like that it differs based on area (e.g., social vs personality).
Replicating studies is key to gaining confidence in them. We don’t just want psychological effects that happened once in a lab, we want effects that are broadly true and can be used to help us improve our lives in the real world. But conducting replication studies is difficult, time-consuming, and often fraught with academic in-fighting. What if we could just use machine learning to help automate this process and automatically get replication scores for thousands of studies at a time?
New research by Wu Youyou, Yang Yang, and Brian Uzzi in the Proceedings of the National Academy of Sciences tries to do this. They use machine learning to attempt to understand how well psychology research will replicate across several subfields (e.g., clinical psychology, developmental psychology, social psychology). This is an ambitious paper, and it gives some insight into replication in psychology. However, issues with the machine learning approach should make us cautious when interpreting results.
What did they do?
The researchers collected a sample of 388 psychology studies that had been replicated previously and used them to train their machine learning model. These were existing studies that had been conducted for other reasons, such as the 2016 Replication Project in Psychology (RPP) and the Life Outcomes of Personality Replication (LOOPR) project. The text of these papers was analyzed using a well-known algorithm. Roughly what the algorithm does is count up how often every word in a paper is used, and then convert these into a series of 200 numbers based on common word associations in social science research. These 200 number summaries of the manuscript text are then used to train a machine learning model to predict whether a study replicated accurately or not.
Then, the researchers used the machine learning model trained on existing replications to predict whether other papers would replicate (if someone were to try to replicate them in the future). They made these predictions on a much larger set of papers—over 14,000 papers, covering almost every paper published in six top journals for an entire decade. Then they analyzed these predictions to try to understand these subfields better.
Potential Problems With the Research
Careful readers of this paper might notice some potential issues right away.
1. How accurate were these predictions?
The accuracy was decent but not great: 68%. So when they analyze predictions for 14,000 new papers, we know they will be fairly inaccurate.
Further, we can do a quick check of the predicted replication of a field to the actual replication of a field. Sometimes it lines up: for social psychology, the replication rate in completed research is 38%, and the predicted replication rate is 37%. But sometimes it’s far off: for personality psychology, the replication rate in completed research is 77%, but the predicted rate is 55%. This should give us pause when drawing conclusions from this model.
2. Is it really reasonable to expect previous replication studies to predict new ones?
Answering this question means determining whether the previous replication studies do a good job representing any and all possible future replications (at least from these six journals). There are a couple reasons it does not.
First, the previous replication studies don’t include any studies from clinical psychology or developmental psychology. That’s a problem because this paper wants to make predictions about the top papers in both of those fields. Since the model wasn't trained on any of those papers, it’s likely the accuracy will be even lower when it encounters this new, different type of paper. (The authors try to address this by saying that the types of words used in those papers are similar to the types of words used in areas where we do have replications, but it’s not entirely convincing.) Our 68% accuracy is likely even lower for these fields.
Second, even in the areas where there are several existing replications, they don’t represent all areas equally well. For example, more replications have been done of social psychology experiments that can be done quickly on a computer, as compared to those that involve recording interactions and coding or rating behavior. So, our accuracy for these types of studies may also be less accurate.
3. Is a model based on lexical associations the best way to evaluate studies that have markers such as p-values?
The use of word vectors (200 numbers related to the authors' word choices) means that this particular machine learning approach relies on word associations alone. Other factors, beyond just what words were used, are clearly important. For example, we know that studies with p-values that just cross the threshold for being publishable tend to be less reliable than studies with p-values that cross by a wide margin. If this data could be used, and accuracy were increased by 5-10%, I’d be much more confident in any conclusions drawn from the predictions.
What Can We Learn?
Youyou and colleagues conclude that their "model enables us to conduct the first replication census of nearly all of the papers published in psychology’s top six subfield journals over a 20-year period." While they do generate and analyze predictions from this large set of manuscripts, the concerns over accuracy and applying the algorithm to new types of data (e.g., new subfields, new types of research) make me skeptical of being able to draw reliable conclusions from the algorithm's output.
That said, there are several convincing arguments that the authors make where their algorithm matches the existing literature. These arguments are most convincing (to me) because of this match.
- There isn't just one replication rate for psychology; replication rates should really be considered by area (e.g., personality psychology does better than social psychology)
- Lead authors who publish more and in better journals tend to have work that replicates more, but working at a prestigious university doesn't predict better replication rates.
- Studies that get more media attention tend to replicate less—possibly because the media is drawn to flashy, counterintuitive stories that are also less likely to stand the test of time.
Finally, the authors found that experimental research (where psychologists actively manipulate conditions) tend to replicate less than non-experimental research (where psychologists observe behavior and report what is related to what). This is somewhat surprising, but seems to me like it might be explained by the sample used to train the model: personality psychology, which tends to be more methodologically rigorous and observational, replicates more. Social psychology, which tends to be more methodologically lax and experimental, replicates less. Machine learning models pick up on patterns in the data they are trained on. Just as training a crime prediction model on racially biased data will reproduce those biases, training a replication prediction model on data biased towards observational research will reproduce that bias. There may be unique advantages to observational research in psychology over experiments, but I'm not yet convinced.
Overall, this manuscript represents an interesting contribution to a growing literature on using machine learning to evaluate research literature. A lot of computational work went into developing both the text-based codings and the predictions for the more than 14,000 new studies. While the algorithm isn't yet strong enough for us to draw strong conclusions, there is potential that in a few years, automated overviews of the field based on this model will be precise enough for us to make confident statements about psychology as a whole.
Youyou, W., Yang, Y., & Uzzi, B. (2023). A discipline-wide investigation of the replicability of Psychology papers over the past two decades. Proceedings of the National Academy of Sciences, 120(6), e2208863120.