MENSA: Leveraging Mental Simulation for In-Context Policy Improvement in LLM Agents

Chung-Che Chang; Erick Chandra; Jane Yung-jen Hsu; Yen-Ling Kuo

MENSA: Leveraging Mental Simulation for In-Context Policy Improvement in LLM Agents

Chung-Che Chang¹, Erick Chandra¹, Jane Yung-jen Hsu^2,1, Yen-Ling Kuo³

¹National Taiwan University ²Chang Gung University ³University of Virginia
AAMAS 2026

Paper Code

MENSA Overview: Comparison between model-free and model-based in-context policy improvement frameworks

MENSA uses mental simulation via LLM text completion to generate forecasts of future action-state transitions, retrieving relevant past experiences to improve decision-making in context.

Abstract

Large Language Model (LLM) powered agents have shown promise in sequential decision-making tasks in interactive environments. However, prior agent frameworks usually rely on advanced LLM capabilities such as planning or instruction following to carry out tasks successfully. Effectively improving the performance of an LLM agent without assuming these capabilities remains challenging. To address this issue, we propose MENSA (MENtal Simulation Agent), a novel model-based approach that enhances LLM agents without fine-tuning. MENSA leverages the fundamental ability of any LLMs, text completion, to generate forecasts of action-state pairs (i.e., transitions) for future time steps. These forecasts are used to construct a set of relevant past experiences, which are provided to the LLM agent in context to improve its decision-making behavior. We evaluate MENSA in two challenging interactive environments, ScienceWorld and NetHack, and show that MENSA improves performance across various sizes of LLMs. Using large models (e.g., GPT-4o-mini), MENSA outperforms previous state-of-the-art methods by +15.8 points in ScienceWorld and by +40.0 points in NetHack. Even with smaller models like Phi-3-mini, MENSA achieves a gain of +11.9 points in ScienceWorld. Our results further suggest that MENSA is less affected by an LLM's limitations in instruction-following and planning compared to baselines.

MENSA Architecture

MENSA is composed of three key components: Actor (performs mental simulation and generates forecasts), Executor (translates LLM-generated actions into admissible ones), and Experience Learner (distills executed trajectories into reusable experiences).

Key Results

We evaluate MENSA on two challenging interactive environments: ScienceWorld (text-based science experiments) and NetHack (procedurally generated grid-based game).

ScienceWorld Results

In the adaptation setting, MENSA consistently outperforms all baselines across diverse LLMs:

GPT-4o-mini: MENSA achieves 70.3 vs. SSO's 54.5 (+15.8 points)
Phi-3-small (7B): MENSA achieves 58.8 vs. SSO's 26.4 (+32.4 points)
Phi-3-mini (4B): MENSA achieves 32.6 vs. SSO's 20.7 (+11.9 points)
Mistral-7B: MENSA achieves 44.3 vs. SSO's 13.1 (+31.2 points)

NetHack Results

MENSA also demonstrates strong performance in NetHack's Crossing Lava task:

GPT-4o-mini: MENSA achieves 50.0 vs. SSO's 10.0 (+40.0 points)
Llama-3-8B: MENSA achieves 32.0 vs. SSO's 0.0 (+32.0 points)

Transfer Setting

In the transfer setting on ScienceWorld, MENSA surpasses SSO by +18.6 (Llama-3-8B) and +19.9 (Gemma-2-9B), showing its ability to acquire transferable experience.

ScienceWorld Full Results

Setting	Model	ReAct	Reflexion	SSO	MENSA
Adaptation	GPT-4o-mini	22.4*	25.9*	54.5*	70.3*
	Gemma-2-9B	29.9	31.2	30.7*	42.0
	Llama-3-8B	26.7	31.9	41.6*	45.0
	Phi-3-small (7B)	18.1*	23.6*	26.4*	58.8*
	Mistral-7B	24.5	26.4	13.1*	44.3
	Phi-3-mini (4B)	6.5*	7.2*	20.7*	32.6*
	Gemma-2-2B	17.9	20.6	10.5*	29.9
Transfer	Gemma-2-9B	—	—	15.9*	35.8
Transfer	Llama-3-8B	—	—	20.8*	39.4

NetHack Full Results

Setting	Model	ReAct	Reflexion	SSO	MENSA
Adaptation	GPT-4o-mini	12.0*	20.0*	10.0*	50.0*
Adaptation	Llama-3-8B	6.0	8.0	0.0*	32.0

* Using instruction-tuned model.

Poster (TBD)

Poster will be available soon. (TBD)

BibTeX

@inproceedings{chang2026mensa,
  title={MENSA: Leveraging Mental Simulation for In-Context Policy Improvement in LLM Agents},
  author={Chung-Che Chang and Erick Chandra and Jane Yung-jen Hsu and Yen-Ling Kuo},
  booktitle={Proceedings of the 25th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026)},
  year={2026},
  url={https://roger0426.github.io/MENSA},
  note={Project page. arXiv version forthcoming.}
}