AMA-Bench: Leaderboard
Agent Memory Assessment Benchmark - Performance Visualization
๐ฏ Welcome to AMA-Bench!
Evaluate agent memory itself, not just dialogue.
Built from real agent environment streams and scalable long-horizon trajectories across representative domains, AMA-Bench tests whether LLM agents can recall, perform causal inference, update state, and abstract state information over long runs.
๐ Paper: https://arxiv.org/abs/2602.22769
Agent Performance Analysis
Explore agent performance across different domains and capabilities.
Radar chart showing agent performance across different domains. Click legend items to isolate specific agents.
Verification Status: Only officially verified entries (โ) are shown. User-submitted results (โ) will appear after weekly LLM-as-Judge evaluation.
Scores by Domain
๐ฅ 1 | Qwen3-Embedding-4B โ | Qwen3-32B | 56.91% | 60.88% | 63.55% | 51.70% | 64.35% | 49.94% | 47.08% |
Showing agent performance for each capability. Each subplot represents one capability with comparative performance across all agents.
Verification Status: Only officially verified entries (โ) are shown. User-submitted results (โ) will appear after weekly LLM-as-Judge evaluation.
Scores by Capability
๐ฅ 1 | Qwen3-Embedding-4B โ | Qwen3-32B | 56.91% | 63.67% | 59.76% | 51.70% | 47.25% |
๐ฅ 1 | AMA-agent โ | Qwen3-32B | 56.91% | 63.67% | 59.76% | 51.70% | 47.25% |
๐ฅ 2 | Long context โ | Qwen3-32B | 51.35% | 60.31% | 51.83% | 48.73% | 36.58% |
๐ฅ 3 | Memorag โ | Qwen3-32B | 45.70% | 49.77% | 54.51% | 37.98% | 36.86% |
4 | Hipporag2 โ | Qwen3-32B | 44.05% | 45.80% | 50.39% | 41.27% | 35.75% |
5 | Qwen3-Embedding-4B โ | Qwen3-32B | 42.04% | 50.06% | 48.74% | 32.91% | 30.43% |
6 | Memorybank โ | Qwen3-32B | 34.50% | 34.91% | 40.83% | 28.55% | 33.84% |
7 | Memgpt โ | Qwen3-32B | 33.14% | 36.14% | 42.82% | 25.22% | 25.53% |
8 | GRAPHRAG โ | Qwen3-32B | 32.92% | 32.78% | 39.15% | 29.99% | 28.79% |
9 | Amem โ | Qwen3-32B | 32.24% | 32.48% | 36.96% | 29.83% | 28.74% |
10 | Mem-alpha โ | Qwen3-32B | 30.89% | 29.93% | 41.60% | 28.06% | 21.83% |
11 | Memagent โ | Qwen3-32B | 27.39% | 28.53% | 32.39% | 24.56% | 22.30% |
12 | Mem0 โ | Qwen3-32B | 21.06% | 21.24% | 26.47% | 19.58% | 15.23% |
13 | Simple mem โ | Qwen3-32B | 18.60% | 22.56% | 18.54% | 16.53% | 13.92% |
14 | Mem1 โ | Qwen3-32B | 12.28% | 12.82% | 14.19% | 10.71% | 10.87% |
Model Performance Analysis
Explore model performance across different domains and capabilities.
Radar chart showing model performance across different domains. Click legend items to isolate specific models.
Verification Status: Only officially verified entries (โ) are shown. User-submitted results (โ) will appear after weekly LLM-as-Judge evaluation.
Scores by Domain
๐ฅ 1 | Qwen2.5-14B-Instruct-1M โ | 70.88% | 85.13% | 51.62% | 77.96% | 81.16% | 57.22% | 65.83% |
๐ฅ 1 | gpt 5.2 โ | 70.88% | 85.13% | 51.62% | 77.96% | 81.16% | 57.22% | 65.83% | |
๐ฅ 2 | GPT-5 mini โ | 67.11% | 84.31% | 54.40% | 83.07% | 50.03% | 42.78% | 78.05% | |
๐ฅ 3 | Gemini 2.5 flash โ | 51.43% | 64.03% | 37.96% | 61.83% | 24.96% | 41.94% | 71.36% | |
4 | Qwen3-32B โ | 51.35% | 50.49% | 50.86% | 55.11% | 53.08% | 48.56% | 50.57% | |
5 | Qwen3-14B โ | 46.03% | 42.49% | 42.83% | 53.49% | 53.60% | 37.50% | 49.17% | |
6 | Qwen2.5-14B-Instruct-1M โ | 45.76% | 42.96% | 46.30% | 55.11% | 50.11% | 31.33% | 50.29% | |
7 | Claude Haiku 3.5 โ | 43.27% | 39.71% | 25.48% | 53.33% | 51.93% | 32.50% | 62.36% | |
8 | Qwen3-8B โ | 40.69% | 37.42% | 40.51% | 43.82% | 47.09% | 29.72% | 47.78% |
Show model performance for each capability. Each subplot represents one capability with comparative performance across all models.
Verification Status: Only officially verified entries (โ) are shown. User-submitted results (โ) will appear after weekly LLM-as-Judge evaluation.
Scores by Capability
๐ฅ 1 | Qwen2.5-14B-Instruct-1M โ | 70.88% | 73.82% | 80.83% | 64.62% | 60.39% |
๐ฅ 1 | gpt 5.2 โ | 70.88% | 73.82% | 80.83% | 64.62% | 60.39% | |
๐ฅ 2 | GPT-5 mini โ | 67.11% | 68.30% | 71.86% | 64.12% | 62.56% | |
๐ฅ 3 | Gemini 2.5 flash โ | 51.43% | 56.56% | 51.99% | 50.11% | 42.27% | |
4 | Qwen3-32B โ | 51.35% | 60.31% | 51.83% | 48.73% | 36.58% | |
5 | Qwen3-14B โ | 46.03% | 55.77% | 44.46% | 44.05% | 31.64% | |
6 | Qwen2.5-14B-Instruct-1M โ | 45.76% | 54.85% | 41.20% | 45.79% | 33.85% | |
7 | Claude Haiku 3.5 โ | 43.27% | 47.87% | 45.37% | 43.46% | 30.60% | |
8 | Qwen3-8B โ | 40.68% | 49.64% | 37.58% | 39.26% | 29.23% |
Submit Your Model/Agent for Evaluation
Submit your model or agent predictions to be evaluated on AMA-Bench. Your results will be reviewed and scored weekly by our LLM-as-Judge system.
โฐ Submission Policy:
- Each user can submit once per week
- Submissions are evaluated weekly using our LLM-as-Judge system
- Official scores (
verified=true) are computed by our evaluation system - You can also run your own evaluation if you have access to the groundtruth data
๐ Submission Format:
Your JSONL file should contain one line per episode:
{
"episode_id": "trajectory_id",
"question_uuid_list": ["uuid-1", "uuid-2", "uuid-3"],
"answer_list": ["The agent moved right.", "..."],
"llm_as_judge_score_list": [true, false, true]
}
Field Descriptions:
episode_id(required): The episode identifier โ used to automatically look up the domainquestion_uuid_list(required): UUIDs of the benchmark questions in the same order asanswer_listโ used to look up each question's capability (A/B/C/D).answer_list(required): Your model/agent's answers, one per questionllm_as_judge_score_list(required):true/falseper answer โ your self-evaluated correctness scores used for leaderboard ranking.
Important Notes:
question_uuid_list,answer_list, andllm_as_judge_score_listmust all be the same length- Domain is resolved automatically from
episode_id; capability (A/B/C/D) is resolved fromquestion_uuid_listโ no need to supply them manually - All submissions start as
verified=falseand becomeverified=trueafter official LLM-as-Judge evaluation
AMA-Bench: Agent Memory Assessment Benchmark
AMA-Bench evaluates memory capabilities of LLMs and memory-augmented agents across four cognitive dimensions: Recall (retrieving stored info), Causal Inference (cause-and-effect reasoning), State Updating (tracking evolving states), and State Abstraction (forming higher-level representations).
Benchmarks
We evaluate on two complementary subsets:
- Real-world Subset: 2,496 QA pairs from real agent environment streams
- Synthetic Subset: 1,200 QA pairs stratified across five trajectory lengths (8K, 16K, 32K, 64K, and 128K tokens)
Leaderboard Tabs
Agent Performance: Compares RAG and Agent Memory methods
- Domain Performance: Radar charts across 6 domains (GAME, Embodied AI, Web, Text2SQL, Openworld QA, Software Engineer)
- Capability Performance: showing performance on 4 capabilities
- Top N Selection: Choose to display top 1-10 performers
Model Performance: Compares LLM models directly
- Domain Performance: Radar charts showing performance across different application domains
- Capability Performance: showing performance on each cognitive capability
- Top N Selection: Choose to display top 1-10 performers
Metrics
Results are reported as Accuracy and F1 Score:
- Charts display Accuracy only for clarity
- Summary statistics tables show both Avg Accuracy and Avg F1
- Tables include Rank with ๐ฅ๐ฅ๐ฅ medals for top 3 performers
Problem Type Distribution
- Type A (Recall): 33.6% - 839 questions
- Type B (Causal Inference): 23.9% - 596 questions
- Type C (State Updating): 25.9% - 647 questions
- Type D (State Abstraction): 16.6% - 414 questions
Submission Rules
๐ File Format
- Submissions must be in JSONL format (
.jsonl), one line per episode - Each line must be a valid JSON object containing the required fields below
question_uuid_list,answer_list, andllm_as_judge_score_listmust all be the same length- Files containing duplicate
episode_identries will be rejected
๐ Required Fields
| Field | Type | Description |
|---|---|---|
episode_id |
string | Episode identifier, used to automatically resolve domain |
question_uuid_list |
list[string] | UUIDs mapping each answer to a benchmark question, used to resolve capability (A/B/C/D) |
answer_list |
list[string] | Your model/agent's free-text answers, in the same order as question_uuid_list |
llm_as_judge_score_list |
list[bool] | Self-evaluated correctness (true/false) per answer |
โ Verification & Scoring
- All submissions initially appear as
verified=false(self-reported preview) - The score shown immediately after submission is based on your
llm_as_judge_score_list - Official scores (
verified=true) are recomputed weekly by our LLM-as-Judge evaluation system - Only
verified=trueentries are displayed on the public leaderboard
โ ๏ธ Important Notes
- Domain is resolved automatically from
episode_idโ no need to supply it manually - Capability (A/B/C/D) is resolved automatically from each
question_uuidโ no need to supply it manually - Official scores may differ from your self-reported preview after LLM-as-Judge re-evaluation
- We reserve the right to remove submissions that appear to contain fabricated or manipulated scores
Paper: https://arxiv.org/abs/2602.22769
For questions or submissions, please open a discussion in the Community tab.