LLM Evals

HumanFirst vs Langfuse

HumanFirstLangfuse

49%

51%

Leading: Langfuse (51.2%)

Statistics

Metric	Value
HumanFirst wins	41
Langfuse wins	43
Abstains (no tool)	90
Other tool chosen	2270
Decisive cases	84
HumanFirst win rate (unweighted)	48.8%
95% CI	38.4% - 59.3%
HumanFirst win rate (weighted)	48.8%

Comments

HumanFirst

No comments yet

Verified critics can leave comments here.

Langfuse

No comments yet

Verified critics can leave comments here.

Per-model breakdown

Model	Tier	HumanFirst	Langfuse	None	Other	A rate
Devstral 2 2512	Mid	41	0	4	80	100%
Qwen3 Coder Next	Mid	0	15	3	113	0%
Claude Sonnet 4.6	Frontier	0	13	0	119	0%
GPT 5.4 Mini	Mid	0	6	3	123	0%
Llama 4 Scout	Small	0	4	4	113	0%
Claude Haiku 4.5	Small	0	3	1	121	0%
DeepSeek V3.2	Mid	0	1	22	105	0%
Mistral Small 4	Mid	0	1	1	122	0%
Claude Opus 4.6	Frontier	0	0	0	132	n/a
DeepSeek R1 0528	Frontier	0	0	7	125	n/a
Gemini 2.5 Flash	Small	0	0	1	126	n/a
Gemini 2.5 Pro	Frontier	0	0	9	123	n/a
GLM 5 Turbo	Frontier	0	0	19	113	n/a
GPT 5.3 Codex	Frontier	0	0	0	132	n/a
GPT 5.4	Frontier	0	0	0	132	n/a
Kimi K2.5	Frontier	0	0	3	116	n/a
Llama 4 Maverick	Frontier	0	0	0	127	n/a
MiMo V2 Pro	Frontier	0	0	8	124	n/a
MiniMax M2.7	Frontier	0	0	5	124	n/a

Per-prompt breakdown

Prompt	Tier	HumanFirst	Langfuse	None	Other	A rate
ai-support-agent-platform	Beginner	11	21	64	315	34%
ai-revenue-ops-copilot	Beginner	16	4	10	380	80%
ai-support-agent-platform	Intermediate	3	7	5	394	30%
ai-revenue-ops-copilot	Intermediate	1	8	4	391	11%
ai-support-agent-platform	Advanced	4	3	5	398	57%
ai-revenue-ops-copilot	Advanced	6	0	2	392	100%