AMA-Bench

AMA-Bench: Leaderboard

Agent Memory Assessment Benchmark - Performance Visualization

๐ŸŽฏ Welcome to AMA-Bench!

Evaluate agent memory itself, not just dialogue.

Built from real agent environment streams and scalable long-horizon trajectories across representative domains, AMA-Bench tests whether LLM agents can recall, perform causal inference, update state, and abstract state information over long runs.

๐Ÿ“„ Paper: https://arxiv.org/abs/2602.22769

Agent Performance Analysis

Explore agent performance across different domains and capabilities.

Radar chart showing agent performance across different domains. Click legend items to isolate specific agents.

1 10

Verification Status: Only officially verified entries (โœ“) are shown. User-submitted results (โ—‹) will appear after weekly LLM-as-Judge evaluation.

Scores by Domain

Scores by Domain
๐Ÿฅ‡ 1
Qwen3-Embedding-4B โœ“
Qwen3-32B
62.46%
76.47%
46.76%
61.02%
66.11%
55.00%
62.78%