AMA-Bench Leaderboard

Agent Performance Analysis

Explore agent performance across different domains and capabilities.

Radar chart showing agent performance across different domains. Click legend items to isolate specific agents.

Show Top N Agents

Select how many top agents to display (1-10)

1 10

Plot

Verification Status: Only officially verified entries (✓) are shown. User-submitted results (○) will appear after weekly LLM-as-Judge evaluation.

Scores by Domain

Scores by Domain

🥇 1	Qwen3-Embedding-4B ✓	Qwen3-32B	56.91%	60.88%	63.55%	51.70%	64.35%	49.94%	47.08%


🥇 1	AMA-agent ✓	Qwen3-32B	56.91%	60.88%	63.55%	51.70%	64.35%	49.94%	47.08%
🥈 2	Long context ✓	Qwen3-32B	51.35%	50.49%	50.86%	55.11%	53.08%	48.56%	50.57%
🥉 3	Memorag ✓	Qwen3-32B	45.70%	64.97%	51.81%	36.56%	56.52%	8.33%	41.63%
4	Hipporag2 ✓	Qwen3-32B	44.05%	50.49%	47.08%	38.17%	59.63%	16.94%	47.08%
5	Qwen3-Embedding-4B ✓	Qwen3-32B	42.04%	45.43%	59.49%	29.57%	51.78%	19.44%	41.11%
6	Memorybank ✓	Qwen3-32B	34.50%	25.98%	48.41%	39.87%	44.04%	17.24%	34.44%
7	Memgpt ✓	Qwen3-32B	33.14%	23.20%	58.02%	34.41%	41.77%	10.55%	32.85%
8	GRAPHRAG ✓	Qwen3-32B	32.92%	23.04%	35.19%	40.83%	55.44%	17.03%	32.17%
9	Amem ✓	Qwen3-32B	32.24%	33.33%	29.17%	38.44%	42.16%	21.11%	28.89%
10	Mem-alpha ✓	Qwen3-32B	30.89%	31.86%	31.94%	35.92%	44.45%	16.39%	23.70%
11	Memagent ✓	Qwen3-32B	27.39%	25.24%	50.46%	26.34%	32.00%	10.83%	16.39%
12	Mem0 ✓	Qwen3-32B	21.06%	12.97%	22.32%	27.42%	40.25%	11.94%	16.67%
13	Simple mem ✓	Qwen3-32B	18.60%	22.15%	24.63%	14.51%	28.53%	5.75%	12.50%
14	Mem1 ✓	Qwen3-32B	12.28%	6.86%	18.05%	13.17%	22.35%	3.33%	12.50%

Showing agent performance for each capability. Each subplot represents one capability with comparative performance across all agents.

Show Top N Agents

Select how many top agents to display per capability (1-10)

1 10

Plot

Verification Status: Only officially verified entries (✓) are shown. User-submitted results (○) will appear after weekly LLM-as-Judge evaluation.

Scores by Capability

Scores by Capability

🥇 1	Qwen3-Embedding-4B ✓	Qwen3-32B	56.91%	63.67%	59.76%	51.70%	47.25%


🥇 1	AMA-agent ✓	Qwen3-32B	56.91%	63.67%	59.76%	51.70%	47.25%
🥈 2	Long context ✓	Qwen3-32B	51.35%	60.31%	51.83%	48.73%	36.58%
🥉 3	Memorag ✓	Qwen3-32B	45.70%	49.77%	54.51%	37.98%	36.86%
4	Hipporag2 ✓	Qwen3-32B	44.05%	45.80%	50.39%	41.27%	35.75%
5	Qwen3-Embedding-4B ✓	Qwen3-32B	42.04%	50.06%	48.74%	32.91%	30.43%
6	Memorybank ✓	Qwen3-32B	34.50%	34.91%	40.83%	28.55%	33.84%
7	Memgpt ✓	Qwen3-32B	33.14%	36.14%	42.82%	25.22%	25.53%
8	GRAPHRAG ✓	Qwen3-32B	32.92%	32.78%	39.15%	29.99%	28.79%
9	Amem ✓	Qwen3-32B	32.24%	32.48%	36.96%	29.83%	28.74%
10	Mem-alpha ✓	Qwen3-32B	30.89%	29.93%	41.60%	28.06%	21.83%
11	Memagent ✓	Qwen3-32B	27.39%	28.53%	32.39%	24.56%	22.30%
12	Mem0 ✓	Qwen3-32B	21.06%	21.24%	26.47%	19.58%	15.23%
13	Simple mem ✓	Qwen3-32B	18.60%	22.56%	18.54%	16.53%	13.92%
14	Mem1 ✓	Qwen3-32B	12.28%	12.82%	14.19%	10.71%	10.87%

Model Performance Analysis

Explore model performance across different domains and capabilities.

Radar chart showing model performance across different domains. Click legend items to isolate specific models.

Show Top N Models

Select how many top models to display (1-10)

1 10

Plot

Verification Status: Only officially verified entries (✓) are shown. User-submitted results (○) will appear after weekly LLM-as-Judge evaluation.

Scores by Domain

Scores by Domain

🥇 1	Qwen2.5-14B-Instruct-1M ✓		70.88%	85.13%	51.62%	77.96%	81.16%	57.22%	65.83%


🥇 1	gpt 5.2 ✓	70.88%	85.13%	51.62%	77.96%	81.16%	57.22%	65.83%
🥈 2	GPT-5 mini ✓	67.11%	84.31%	54.40%	83.07%	50.03%	42.78%	78.05%
🥉 3	Gemini 2.5 flash ✓	51.43%	64.03%	37.96%	61.83%	24.96%	41.94%	71.36%
4	Qwen3-32B ✓	51.35%	50.49%	50.86%	55.11%	53.08%	48.56%	50.57%
5	Qwen3-14B ✓	46.03%	42.49%	42.83%	53.49%	53.60%	37.50%	49.17%
6	Qwen2.5-14B-Instruct-1M ✓	45.76%	42.96%	46.30%	55.11%	50.11%	31.33%	50.29%
7	Claude Haiku 3.5 ✓	43.27%	39.71%	25.48%	53.33%	51.93%	32.50%	62.36%
8	Qwen3-8B ✓	40.69%	37.42%	40.51%	43.82%	47.09%	29.72%	47.78%

Scores by Capability

🥇 1	Qwen2.5-14B-Instruct-1M ✓		70.88%	73.82%	80.83%	64.62%	60.39%


🥇 1	gpt 5.2 ✓	70.88%	73.82%	80.83%	64.62%	60.39%
🥈 2	GPT-5 mini ✓	67.11%	68.30%	71.86%	64.12%	62.56%
🥉 3	Gemini 2.5 flash ✓	51.43%	56.56%	51.99%	50.11%	42.27%
4	Qwen3-32B ✓	51.35%	60.31%	51.83%	48.73%	36.58%
5	Qwen3-14B ✓	46.03%	55.77%	44.46%	44.05%	31.64%
6	Qwen2.5-14B-Instruct-1M ✓	45.76%	54.85%	41.20%	45.79%	33.85%
7	Claude Haiku 3.5 ✓	43.27%	47.87%	45.37%	43.46%	30.60%
8	Qwen3-8B ✓	40.68%	49.64%	37.58%	39.26%	29.23%

Submit Your Model/Agent for Evaluation

Submit your model or agent predictions to be evaluated on AMA-Bench. Your results will be reviewed and scored weekly by our LLM-as-Judge system.

⏰ Submission Policy:

Each user can submit once per week
Submissions are evaluated weekly using our LLM-as-Judge system
Official scores (verified=true) are computed by our evaluation system
You can also run your own evaluation if you have access to the groundtruth data

Model/Agent Name

Submission Type

Model Agent

URL to Model/Agent Information

Organisation

Model Family

Contact Email

Submission File (JSONL format)

📋 Submission Format:

Your JSONL file should contain one line per episode:

{
  "episode_id": "trajectory_id",
  "question_uuid_list": ["uuid-1", "uuid-2", "uuid-3"],
  "answer_list": ["The agent moved right.", "..."],
  "llm_as_judge_score_list": [true, false, true]
}

Field Descriptions:

episode_id (required): The episode identifier — used to automatically look up the domain
question_uuid_list (required): UUIDs of the benchmark questions in the same order as answer_list — used to look up each question's capability (A/B/C/D).
answer_list (required): Your model/agent's answers, one per question
llm_as_judge_score_list (required): true/false per answer — your self-evaluated correctness scores used for leaderboard ranking.

Important Notes:

question_uuid_list, answer_list, and llm_as_judge_score_list must all be the same length
Domain is resolved automatically from episode_id; capability (A/B/C/D) is resolved from question_uuid_list — no need to supply them manually
All submissions start as verified=false and become verified=true after official LLM-as-Judge evaluation

AMA-Bench: Agent Memory Assessment Benchmark

AMA-Bench evaluates memory capabilities of LLMs and memory-augmented agents across four cognitive dimensions: Recall (retrieving stored info), Causal Inference (cause-and-effect reasoning), State Updating (tracking evolving states), and State Abstraction (forming higher-level representations).

Benchmarks

We evaluate on two complementary subsets:

Real-world Subset: 2,496 QA pairs from real agent environment streams
Synthetic Subset: 1,200 QA pairs stratified across five trajectory lengths (8K, 16K, 32K, 64K, and 128K tokens)

Leaderboard Tabs

Agent Performance: Compares RAG and Agent Memory methods
- Domain Performance: Radar charts across 6 domains (GAME, Embodied AI, Web, Text2SQL, Openworld QA, Software Engineer)
- Capability Performance: showing performance on 4 capabilities
- Top N Selection: Choose to display top 1-10 performers
Model Performance: Compares LLM models directly
- Domain Performance: Radar charts showing performance across different application domains
- Capability Performance: showing performance on each cognitive capability
- Top N Selection: Choose to display top 1-10 performers

Metrics

Results are reported as Accuracy and F1 Score:

Charts display Accuracy only for clarity
Summary statistics tables show both Avg Accuracy and Avg F1
Tables include Rank with 🥇🥈🥉 medals for top 3 performers

Problem Type Distribution

Type A (Recall): 33.6% - 839 questions
Type B (Causal Inference): 23.9% - 596 questions
Type C (State Updating): 25.9% - 647 questions
Type D (State Abstraction): 16.6% - 414 questions

Submission Rules

📋 File Format

Submissions must be in JSONL format (.jsonl), one line per episode
Each line must be a valid JSON object containing the required fields below
question_uuid_list, answer_list, and llm_as_judge_score_list must all be the same length
Files containing duplicate episode_id entries will be rejected

📝 Required Fields

Field	Type	Description
`episode_id`	string	Episode identifier, used to automatically resolve domain
`question_uuid_list`	list[string]	UUIDs mapping each answer to a benchmark question, used to resolve capability (A/B/C/D)
`answer_list`	list[string]	Your model/agent's free-text answers, in the same order as `question_uuid_list`
`llm_as_judge_score_list`	list[bool]	Self-evaluated correctness (`true`/`false`) per answer

✅ Verification & Scoring

All submissions initially appear as verified=false (self-reported preview)
The score shown immediately after submission is based on your llm_as_judge_score_list
Official scores (verified=true) are recomputed weekly by our LLM-as-Judge evaluation system
Only verified=true entries are displayed on the public leaderboard

⚠️ Important Notes

Domain is resolved automatically from episode_id — no need to supply it manually
Capability (A/B/C/D) is resolved automatically from each question_uuid — no need to supply it manually
Official scores may differ from your self-reported preview after LLM-as-Judge re-evaluation
We reserve the right to remove submissions that appear to contain fabricated or manipulated scores

Paper: https://arxiv.org/abs/2602.22769

For questions or submissions, please open a discussion in the Community tab.

AMA-Bench: Leaderboard

🎯 Welcome to AMA-Bench!

Agent Performance Analysis

Model Performance Analysis

Submit Your Model/Agent for Evaluation

AMA-Bench: Agent Memory Assessment Benchmark

Benchmarks

Leaderboard Tabs

Metrics

Problem Type Distribution

Submission Rules