LLM Evals

LangSmith vs Braintrust

LangSmithBraintrust

54%

46%

Leading: LangSmith (53.8%)

Statistics

Metric	Value
LangSmith wins	837
Braintrust wins	720
Abstains (no tool)	90
Other tool chosen	797
Decisive cases	1557
LangSmith win rate (unweighted)	53.8%
95% CI	51.3% - 56.2%
LangSmith win rate (weighted)	53.8%

Comments

LangSmith

No comments yet

Verified critics can leave comments here.

Braintrust

No comments yet

Verified critics can leave comments here.

Per-model breakdown

Model	Tier	LangSmith	Braintrust	None	Other	A rate
GPT 5.3 Codex	Frontier	6	126	0	0	5%
Claude Haiku 4.5	Small	13	103	1	8	11%
Claude Opus 4.6	Frontier	1	113	0	18	1%
Claude Sonnet 4.6	Frontier	55	58	0	19	49%
GPT 5.4	Frontier	18	92	0	22	16%
Kimi K2.5	Frontier	0	109	3	7	0%
DeepSeek R1 0528	Frontier	108	0	7	17	100%
Gemini 2.5 Pro	Frontier	106	0	9	17	100%
GLM 5 Turbo	Frontier	18	84	19	11	18%
GPT 5.4 Mini	Mid	97	3	3	29	97%
Mistral Small 4	Mid	94	0	1	29	100%
MiniMax M2.7	Frontier	62	31	5	31	67%
Qwen3 Coder Next	Mid	88	0	3	40	100%
DeepSeek V3.2	Mid	80	0	22	26	100%
MiMo V2 Pro	Frontier	62	1	8	61	98%
Llama 4 Maverick	Frontier	21	0	0	106	100%
Devstral 2 2512	Mid	5	0	4	116	100%
Gemini 2.5 Flash	Small	3	0	1	123	100%
Llama 4 Scout	Small	0	0	4	117	n/a

Per-prompt breakdown

Prompt	Tier	LangSmith	Braintrust	None	Other	A rate
ai-revenue-ops-copilot	Intermediate	182	122	4	96	60%
ai-support-agent-platform	Intermediate	195	87	5	122	69%
ai-revenue-ops-copilot	Advanced	124	153	2	121	45%
ai-revenue-ops-copilot	Beginner	119	143	10	138	45%
ai-support-agent-platform	Advanced	134	123	5	148	52%
ai-support-agent-platform	Beginner	83	92	64	172	47%