LLM Evals

Promptfoo vs Braintrust

PromptfooBraintrust

4%

96%

Leading: Braintrust (96.5%)

Statistics

Metric	Value
Promptfoo wins	11
Braintrust wins	303
Abstains (no tool)	36
Other tool chosen	650
Decisive cases	314
Promptfoo win rate (unweighted)	3.5%
95% CI	2.0% - 6.2%
Promptfoo win rate (weighted)	3.5%

Comments

Promptfoo

No comments yet

Verified critics can leave comments here.

Braintrust

No comments yet

Verified critics can leave comments here.

Per-model breakdown

Model	Tier	Promptfoo	Braintrust	None	Other	A rate
GPT 5.3 Codex	Frontier	0	51	0	3	0%
Claude Opus 4.6	Frontier	0	46	0	8	0%
Kimi K2.5	Frontier	1	44	0	3	2%
Claude Haiku 4.5	Small	0	44	1	7	0%
GPT 5.4	Frontier	0	39	0	15	0%
GLM 5 Turbo	Frontier	4	32	7	11	11%
Claude Sonnet 4.6	Frontier	0	28	0	26	0%
MiniMax M2.7	Frontier	0	17	1	34	0%
Mistral Small 4	Mid	4	0	0	47	100%
MiMo V2 Pro	Frontier	2	1	2	49	67%
GPT 5.4 Mini	Mid	0	1	1	52	0%
DeepSeek R1 0528	Frontier	0	0	2	52	n/a
DeepSeek V3.2	Mid	0	0	9	43	n/a
Devstral 2 2512	Mid	0	0	1	50	n/a
Gemini 2.5 Flash	Small	0	0	0	52	n/a
Gemini 2.5 Pro	Frontier	0	0	6	48	n/a
Llama 4 Maverick	Frontier	0	0	0	53	n/a
Llama 4 Scout	Small	0	0	3	46	n/a
Qwen3 Coder Next	Mid	0	0	3	51	n/a

Per-prompt breakdown

Prompt	Tier	Promptfoo	Braintrust	None	Other	A rate
ai-revenue-ops-copilot	Beginner	2	63	4	99	3%
ai-revenue-ops-copilot	Advanced	2	63	1	98	3%
ai-revenue-ops-copilot	Intermediate	4	49	1	110	8%
ai-support-agent-platform	Advanced	1	52	1	115	2%
ai-support-agent-platform	Intermediate	2	37	4	123	5%
ai-support-agent-platform	Beginner	0	39	25	105	0%