LLM Evals

Arize Phoenix vs Promptfoo

ARArize Phoenixvs

Arize PhoenixPromptfoo

49%

51%

Leading: Promptfoo (51.2%)

Statistics

Metric	Value
Arize Phoenix wins	21
Promptfoo wins	22
Abstains (no tool)	90
Other tool chosen	2311
Decisive cases	43
Arize Phoenix win rate (unweighted)	48.8%
95% CI	34.6% - 63.2%
Arize Phoenix win rate (weighted)	48.8%

Comments

Arize Phoenix

No comments yet

Verified critics can leave comments here.

Promptfoo

No comments yet

Verified critics can leave comments here.

Per-model breakdown

Model	Tier	Arize Phoenix	Promptfoo	None	Other	A rate
Mistral Small 4	Mid	2	7	1	114	22%
Qwen3 Coder Next	Mid	8	0	3	120	100%
GPT 5.4 Mini	Mid	4	2	3	123	67%
MiMo V2 Pro	Frontier	0	6	8	118	0%
MiniMax M2.7	Frontier	5	0	5	119	100%
GLM 5 Turbo	Frontier	0	5	19	108	0%
Llama 4 Scout	Small	2	0	4	115	100%
Kimi K2.5	Frontier	0	2	3	114	0%
Claude Haiku 4.5	Small	0	0	1	124	n/a
Claude Opus 4.6	Frontier	0	0	0	132	n/a
Claude Sonnet 4.6	Frontier	0	0	0	132	n/a
DeepSeek R1 0528	Frontier	0	0	7	125	n/a
DeepSeek V3.2	Mid	0	0	22	106	n/a
Devstral 2 2512	Mid	0	0	4	121	n/a
Gemini 2.5 Flash	Small	0	0	1	126	n/a
Gemini 2.5 Pro	Frontier	0	0	9	123	n/a
GPT 5.3 Codex	Frontier	0	0	0	132	n/a
GPT 5.4	Frontier	0	0	0	132	n/a
Llama 4 Maverick	Frontier	0	0	0	127	n/a

Per-prompt breakdown

Prompt	Tier	Arize Phoenix	Promptfoo	None	Other	A rate
ai-revenue-ops-copilot	Beginner	6	4	10	390	60%
ai-revenue-ops-copilot	Advanced	4	6	2	388	40%
ai-revenue-ops-copilot	Intermediate	3	6	4	391	33%
ai-support-agent-platform	Beginner	7	0	64	340	100%
ai-support-agent-platform	Intermediate	1	5	5	398	17%
ai-support-agent-platform	Advanced	0	1	5	404	0%