Bayesian Social Deduction with Graph-Informed Language Models

1Purdue University, 2University of Pittsburg, 3Intel Labs, 4Virginia Tech
*Indicates Equal Contribution
Indicates Equal Advising

The agent playing Avalon and using Bayesian reasoning to deduce the roles of other players.

Abstract

Social reasoning -- inferring unobservable beliefs and intentions from partial observations of other agents -- remains a challenging task for large language models (LLMs). We evaluate the limits of current reasoning language models in the social deduction game Avalon and find that while the largest models demonstrate strong performance, they require extensive test-time inference and degrade sharply when distilled to smaller, real-time-capable variants. To address this, we introduce a hybrid reasoning framework that externalizes belief inference to a structured probabilistic model, while using an LLM for language understanding and interaction. Our approach achieves performance competitive with much larger models in agent-agent play and, notably, is the first language agent to defeat human players in a controlled study -- achieving a 67% win rate and receiving higher qualitative ratings than both reasoning baselines and human teammates. We release code, models, and a dataset to support future work on social reasoning in LLM agents.

GRAIL: Graph Reasoning Agent Informed through Language

The overview of the GRAIL agent within Avalon game

Overview

We introduce GRAIL, a novel agent for the social deduction game Avalon, which combines a graph-based belief inference model with a language model for interaction. GRAIL uses a Factor Graph to model the dependencies and conditional probabiltiy within the game state, and uses Max-Product Belief propagation to infer beliefs about the hidden roles of players. A neural network is used to estimate the conditional probabiltiy in factor functions. Furthermore, GRAIL uses an LLM to analyze the game messages and caluclate the prior probabilites of the roles.

To show the importance of the components, we study the ablations of GRAIL by comparing it to versions that only use Beliefs with no prior from the LLM, and a verison which only uses the priors from the LLM and treats them as beliefs. We find that the combination of the two components is crucial for the agent's performance, and that the LLM is able to provide a strong prior for the beliefs.

Tables demonstrating the winrates of GRAIL and other agnet types

Is GRAIL better than Reasoning models?

We develop an agent that prompts the reasoning model DeepSeek-R1, and compare the performance of GRAIL to this agent.

GRAIL is better across all model sizes

We compared the winrate of GRAIL and the reasoning model across model sizes. We also tried GRAIL with DeepSeek-R1 models and the reasoning model with non-reasoning LLMs. The results demonstrates that the GRAIL agent can achieve impressive winrates across model sizes while the same cannot be said for the reasoning agent.

Comparision of the winrates of GRAIL and reasoning models

GRAIL hallucinates less than LRMs

We anallyzed the hallucination rates of GRAIL and the resaoning agent across different model sizes using an LLM-as-Judge scheme with 95% accuracy. Across all model sizes, our GRAIL agent achieves less hallucinations and the statements it makes in the game are more consistent with the game history and state.

Hallucination rates of GRAIL and reasoning agents across model sizes

Can GRAIL play against humans?

In human evaluations, we put three participants in a game with 3 agents. Two of the human participants constructed the Evil team, and the other human played alongside the agents in the Good team. We ran the experiments with both GRAIL agents and reasoning agents using GPT-o4-mini.
Good teams with GRAIL win 67% of times against humans, while with reasoning agent they win only 27% of times

Do humans prefer GRAIL to LRMs?

After each game, the participants were asked to rate the players (both agnets and other humans) based on two criteria:

  • Q1: The player contributed to the success of the Good team in the game.
  • Q2: The player made halpful comments in the game
The results show that the participants significantly preferred GRAIL to a reasoning agent.

evaluation of human players

How good are LLM agents as Evil players?

We record the beliefs of the agents about the roles of the players and plot their distribution. The result shows that against humans, the GRAIL agent has a difficult time detecting the Evil players with a high certainity. This is in contrast to the games agains other agents which result in a high certainity about the player roles. For more in detail exploration of these results, please refer tot the paper.

belief distribution of GRAIL against human evil players

For more results and in depth analyisis, please refer to the paper.

Demo

BibTeX


        @misc{rahimirad2025bayesiansocialdeductiongraphinformed,
          title={Bayesian Social Deduction with Graph-Informed Language Models}, 
          author={Shahab Rahimirad and Guven Gergerli and Lucia Romero and Angela Qian and Matthew Lyle Olson and Simon Stepputtis and Joseph Campbell},
          year={2025},
          eprint={2506.17788},
          archivePrefix={arXiv},
          primaryClass={cs.AI},
          url={https://arxiv.org/abs/2506.17788}, 
    }