Skip to main content

Verified by Psychology Today

AI Study Evaluates GPT-3 Using Cognitive Psychology

The study used cognitive psychology to analyze GPT-3 with surprising results.

Key points

  • Artificial intelligence is frequently in the news due to the wildly popular conversational chatbot ChatGPT.
  • Researchers recently used a variety of prompts to assess GPT-3’s decision-making.
  • The study found that GPT-3 solved vignette-based tasks similarly or better than human subjects and it made decent decisions from descriptions.

A new study published in the Proceedings of the National Academy of Sciences of the United States of America (PNAS) by researchers affiliated with the Max Planck Institute for Biological Cybernetics analyzes the general intelligence of large language model (LLM) GPT-3 using cognitive psychology.

“We study GPT-3, a recent large language model, using tools from cognitive psychology,” wrote lead author Marcel Binz, Ph.D., along with co-author Eric Schulz, Ph.D. “More specifically, we assess GPT-3’s decision-making, information search, deliberation, and causal reasoning abilities on a battery of canonical experiments from the literature.”

Artificial intelligence (AI) is making daily headlines due to the wildly popular conversational chatbot ChatGPT by San Francisco-based OpenAI. It took only five days for ChatGPT to reach 1 million users when it was opened up to the general public at no charge in November 2022, according to Statista. In comparison, Statista reports that it took Netflix three-and-a-half years, Twitter two years, Facebook 10 months, and Spotify five months to reach 1 million users. According to OpenAI, ChatGPT was fine-tuned from a model in the GPT-3.5 series and completed training in early 2022 and was trained using Reinforcement Learning from Human Feedback (RLHF).

The precursor to GPT-3.5 is GPT-3, the third generation of Generative Pre-Trained Transformer, an AI machine learning model trained with data from the internet. GPT-3 is a deep-learning neural network with over 175 billion machine-learning parameters. The four base models of GPT-3 include Babbage, Ada, Curie, and Davinci. Each original base model of GPT-3 used training data up to October 2019 and has its unique strengths.

Ada is a fast-performing model that is able to rapidly perform text parsing, simple classifications, address corrections, and keyword search. Babbage performs moderate-level classifications, especially semantic search classifications. Curie is quick and powerful with the ability to perform more nuanced functions such as complex classification, language translation, sentiment summarization, sentiment classification, and Q&A. Davinci is the top of the line with the ability to perform any task that Babbage, Ada, or Curie can do with fewer instructions. According to OpenAI, Davinci excels at tasks that involve logic, cause and effect, complex intent, and summarizations.

To perform the scientific study, the researchers focused on the most powerful model, Davinci, and used the public OpenAi API to run all their simulations. The researchers used canonical scenarios from cognitive psychology as prompts into GPT-3, then assessed whether the AI responded correctly.

To assess GPT-3’s decision-making, the researchers prompted the AI with well-known vignette-based brain teasers introduced by Israeli psychologists Daniel Kahneman and Amos Tversky. Specifically, the study prompted GPT-3 for the Linda problem, Cab problem, and Hospital problem. The Linda problem, also known as conjunction fallacy, is a brain teaser introduced by Kahneman and Tversky where certain conditions more likely than a single generic one are assumed.

“In the standard vignette, a hypothetical woman named Linda is described as “outspoken, bright, and politically active,” the researchers wrote. “Participants are then asked if it was more likely that Linda is a bank teller or that she is a bank teller and an active feminist. GPT-3, just like people, chose the second option, thereby falling for the conjunction fallacy.”

Next, the scientists prompted the Cab problem, where a witness said a blue cab was involved in a hit-and-run accident in a city with 85% Green and 15% Blue cab companies.

“Unlike people, GPT-3 did not fall for the base-rate fallacy, i.e., to ignore the base rates of different colors, but instead provided the (approximately) correct answer,” the researchers reported.

Lastly, the researchers prompted GPT-3 for the Hospital problem, which asks which hospital, the larger or smaller one, is more likely to report more days where over 60% of all born children are boys. Again, GPT-3 performed on par with humans.

“Of the 12 vignette-based problems presented to GPT-3, it answered six correctly and all 12 in a way that could be described as human-like,” the researchers wrote. “Does this mean that GPT-3 could pass as a human in a cognitive psychology experiment? We believe that the answer, based on the vignette-based tasks alone, has to be 'No.' Since many of the prompted scenarios were taken from famous psychological experiments, there is a chance that GPT-3 has encountered these scenarios or similar ones in its training set.”

The researchers also prompted GPT-3 to see if it could adapt and change between questions that are constraint-seeking versus hypothesis-scanning. In these tasks, the researcher report that GPT-3 selected the appropriate question in every situation.

For the Baron’s congruence bias test, GPT-3 performed like humans and had similar biases. On the Wason’s Card Selection Task, GPT-3 provided the correct answer, outperforming human responses.

To evaluate GPT-3’s ability for deliberation and cognitive reflections, they used three items from the Cognitive Reflection Test. The AI model had incorrect answers for all three.

The scientists evaluated causal reasoning abilities with a version of the Blicket experiment, the Intervene test, and the Mature causal reasoning test for counterfactuals. For the Blicket experiment, GPT-3 performed on par with humans.

“GPT-3, just like people, managed to correctly identify that the first but not the second object is a blicket,” the researchers wrote.

The intervention tested GPT-3’s ability to identify the correct object to be removed to prevent an allergic reaction. And GPT-3 named the right object for removal. The researchers also found that GPT-3 answered multiple counterfactual questions correctly.

Next, the scientists tested GPT-3’s ability for more complex scenarios using the multi-armed bandit paradigm where descriptions for each option have to be learned from experience and the interaction is not limited to one choice.

“We find that much of GPT-3’s behavior is impressive: It solves vignette-based tasks similarly or better than human subjects, is able to make decent decisions from descriptions, outperforms humans in a multi-armed bandit task, and shows signatures of model-based reinforcement learning,” wrote the researchers. “Yet, we also find that small perturbations to vignette-based tasks can lead GPT-3 vastly astray, that it shows no signatures of directed exploration, and that it fails miserably in a causal reasoning task.”

Copyright © 2023 Cami Rosso. All rights reserved.

More from Cami Rosso
More from Psychology Today
More from Cami Rosso
More from Psychology Today