Preseason
MatchesRankingsPrompts
GitHub
Preseason
MatchesRankingsPromptsMethodologyContact

© 2026 Preseason. All rights reserved.

Privacy PolicyTerms & Conditions
LLM Evals
Methodology

HumanFirst vs Langfuse

HUHumanFirstvsLangfuseLALangfuse
HumanFirstLangfuse
49%
51%

Leading: Langfuse (51.2%)

Statistics

MetricValue
HumanFirst wins41
Langfuse wins43
Abstains (no tool)90
Other tool chosen2270
Decisive cases84
HumanFirst win rate (unweighted)48.8%
95% CI38.4% - 59.3%
HumanFirst win rate (weighted)48.8%

Comments

HumanFirst

No comments yet

Verified critics can leave comments here.

Langfuse

No comments yet

Verified critics can leave comments here.

Per-model breakdown

ModelTierHumanFirstLangfuseNoneOtherA rate
Devstral 2 2512Mid410480100%
Qwen3 Coder NextMid01531130%
Claude Sonnet 4.6Frontier01301190%
GPT 5.4 MiniMid0631230%
Llama 4 ScoutSmall0441130%
Claude Haiku 4.5Small0311210%
DeepSeek V3.2Mid01221050%
Mistral Small 4Mid0111220%
Claude Opus 4.6Frontier000132n/a
DeepSeek R1 0528Frontier007125n/a
Gemini 2.5 FlashSmall001126n/a
Gemini 2.5 ProFrontier009123n/a
GLM 5 TurboFrontier0019113n/a
GPT 5.3 CodexFrontier000132n/a
GPT 5.4Frontier000132n/a
Kimi K2.5Frontier003116n/a
Llama 4 MaverickFrontier000127n/a
MiMo V2 ProFrontier008124n/a
MiniMax M2.7Frontier005124n/a

Per-prompt breakdown

PromptTierHumanFirstLangfuseNoneOtherA rate
ai-support-agent-platformBeginner11216431534%
ai-revenue-ops-copilotBeginner1641038080%
ai-support-agent-platformIntermediate37539430%
ai-revenue-ops-copilotIntermediate18439111%
ai-support-agent-platformAdvanced43539857%
ai-revenue-ops-copilotAdvanced602392100%