AMA-Bench

AMA-Bench: Leaderboard

Agent Memory Assessment Benchmark - Performance Visualization

๐ŸŽฏ Welcome to AMA-Bench!

Evaluate agent memory itself, not just dialogue.

Built from real agent environment streams and scalable long-horizon trajectories across representative domains, AMA-Bench tests whether LLM agents can recall, perform causal inference, update state, and abstract state information over long runs.

๐Ÿ“„ Paper: https://arxiv.org/abs/2602.22769

Agent Performance Analysis

Explore agent performance across different domains and capabilities.

Radar chart showing agent performance across different domains. Click legend items to isolate specific agents.

1 10

Verification Status: Only officially verified entries (โœ“) are shown. User-submitted results (โ—‹) will appear after weekly LLM-as-Judge evaluation.

Scores by Domain

Scores by Domain
๐Ÿฅ‡ 1
Qwen3-Embedding-4B โœ“
Qwen3-32B
56.91%
60.88%
63.55%
51.70%
64.35%
49.94%
47.08%