Bayesian Social Deduction with Graph-Informed Language Models

1Purdue University, 2University of Pittsburgh, 3Intel Labs, 4Virginia Tech
*Indicates Equal Contribution
Indicates Equal Advising

The agent playing Avalon and using Bayesian reasoning to deduce the roles of other players.

Abstract

Social reasoning -- inferring unobservable beliefs and intentions from partial observations of other agents -- remains a challenging task for large language models (LLMs). We evaluate the limits of current reasoning language models in the social deduction game Avalon and find that while the largest models demonstrate strong performance, they require extensive test-time inference and degrade sharply when distilled to smaller, real-time-capable variants. To address this, we introduce a hybrid reasoning framework that externalizes belief inference to a structured probabilistic model, while using an LLM for language understanding and interaction. Our approach achieves performance competitive with much larger models in agent-agent play and, notably, is the first language agent to defeat human players in a controlled study -- achieving a 67% win rate and receiving higher qualitative ratings than both reasoning baselines and human teammates. We release code, models, and a dataset to support future work on social reasoning in LLM agents.

GRAIL: Graph Reasoning Agent Informed through Language

The overview of the GRAIL agent within the Avalon game

Overview

We introduce GRAIL, a novel agent for the social deduction game Avalon, which combines a graph-based belief inference model with a language model for interaction. GRAIL uses a Factor Graph to model the dependencies and conditional probability within the game state, and runs Max-Product Belief propagation to infer beliefs about the hidden roles of players. A neural network is used to estimate the conditional probability in factor functions. Furthermore, GRAIL uses an LLM to analyze the game messages and calculate the prior probabilities of the roles.

To show the importance of the components, we study ablations of GRAIL by comparing it to versions that only use Beliefs with no prior from the LLM, and a version that only uses the priors from the LLM and treats them as beliefs. We find that the combination of the two components is crucial for the agent's performance, and that the LLM is able to provide a strong prior for the beliefs.

Tables demonstrating the winrates of GRAIL and other agent types

Is GRAIL better than Reasoning models?

We develop an agent that prompts the reasoning model DeepSeek-R1, and compare the performance of GRAIL to this agent.

GRAIL is better than LRMs across all model sizes

We compared the winrate of GRAIL and the reasoning model across different model sizes. We also tried GRAIL with DeepSeek-R1 models and the reasoning model with non-reasoning LLMs. The results demonstrate that the GRAIL agent can achieve impressive winrates across model sizes while the same cannot be said for the reasoning agent.

Comparison of the winrates of GRAIL and reasoning models

GRAIL hallucinates less than LRMs

We analyzed the hallucination rates of GRAIL and the reasoning agent across different model sizes using an LLM-as-Judge scheme. Across all model sizes, our GRAIL agent achieves fewer hallucinations and the statements it makes in the game are more consistent with the game history and state.

Hallucination rates of GRAIL and reasoning agents across model sizes

Can GRAIL play against humans?

In human evaluations, we put three participants in a game with 3 agents. Two of the human participants constructed the Evil team, and the other human played alongside the agents in the Good team. We ran the experiments with both GRAIL agents and reasoning agents using GPT-o4-mini.
Good teams with GRAIL win 67% of the time against humans, while with reasoning agent they win only 27% of times

Humans prefer GRAIL to LRM agents

After each game, the participants were asked to rate the players (both agents and other humans) based on two criteria:

  • Q1: The player contributed to the success of the Good team in the game.
  • Q2: The player made helpful comments in the game
The results show that the participants significantly preferred GRAIL to a reasoning agent.

evaluation of human players

We treat the party votes as a classification task, where the goal is to predict whether the party contains an evil player in it or not. We compare F1 score of this classification task. The results show that GRAIL is able to predict the roles of the players significantly better than the agents and humans.

evaluation of human players

Humans are better Evil players than Evil agents

We record the beliefs of the agents about the roles of the players and plot their distribution. In the games against LLM agents, the GRAIL agent is able to detect the Evil players with high certainty. As seen in the violin plots below, as the game progresses, the GRAIL beliefs converge to the true roles of the players.

belief distribution of GRAIL against human evil players

However, when the beliefs against humans are plotted, we can see that the GRAIL agent cannot detecting the Evil players with a high certainty. We believe that this indicates that desigining a good and effective evil agent is still a challenging task. For more in detail exploration of these results, please refer to the paper.

belief distribution of GRAIL against LLM agent evil players

For more results and in depth analysis, please refer to the paper.

Demo

BibTeX


        @misc{rahimirad2025bayesiansocialdeductiongraphinformed,
          title={Bayesian Social Deduction with Graph-Informed Language Models}, 
          author={Shahab Rahimirad and Guven Gergerli and Lucia Romero and Angela Qian and Matthew Lyle Olson and Simon Stepputtis and Joseph Campbell},
          year={2025},
          eprint={2506.17788},
          archivePrefix={arXiv},
          primaryClass={cs.AI},
          url={https://arxiv.org/abs/2506.17788}, 
    }