LLM Evals

Arize AI vs Braintrust

Arize AIBraintrust

13%

87%

Leading: Braintrust (87.3%)

Statistics

Metric	Value
Arize AI wins	105
Braintrust wins	720
Abstains (no tool)	90
Other tool chosen	1529
Decisive cases	825
Arize AI win rate (unweighted)	12.7%
95% CI	10.6% - 15.2%
Arize AI win rate (weighted)	12.7%

Comments

Arize AI

No comments yet

Verified critics can leave comments here.

Braintrust

No comments yet

Verified critics can leave comments here.

Per-model breakdown

Model	Tier	Arize AI	Braintrust	None	Other	A rate
GPT 5.3 Codex	Frontier	0	126	0	6	0%
Claude Opus 4.6	Frontier	0	113	0	19	0%
Kimi K2.5	Frontier	0	109	3	7	0%
Claude Haiku 4.5	Small	0	103	1	21	0%
GPT 5.4	Frontier	0	92	0	40	0%
GLM 5 Turbo	Frontier	0	84	19	29	0%
Claude Sonnet 4.6	Frontier	0	58	0	74	0%
Llama 4 Scout	Small	40	0	4	77	100%
MiniMax M2.7	Frontier	1	31	5	92	3%
Gemini 2.5 Flash	Small	27	0	1	99	100%
Llama 4 Maverick	Frontier	13	0	0	114	100%
Gemini 2.5 Pro	Frontier	10	0	9	113	100%
Devstral 2 2512	Mid	8	0	4	113	100%
DeepSeek R1 0528	Frontier	3	0	7	122	100%
GPT 5.4 Mini	Mid	0	3	3	126	0%
Mistral Small 4	Mid	2	0	1	121	100%
Qwen3 Coder Next	Mid	1	0	3	127	100%
MiMo V2 Pro	Frontier	0	1	8	123	0%
DeepSeek V3.2	Mid	0	0	22	106	n/a

Per-prompt breakdown

Prompt	Tier	Arize AI	Braintrust	None	Other	A rate
ai-revenue-ops-copilot	Advanced	23	153	2	222	13%
ai-revenue-ops-copilot	Beginner	15	143	10	242	9%
ai-support-agent-platform	Advanced	18	123	5	264	13%
ai-revenue-ops-copilot	Intermediate	11	122	4	267	8%
ai-support-agent-platform	Beginner	18	92	64	237	16%
ai-support-agent-platform	Intermediate	20	87	5	297	19%