Preseason
MatchesRankingsPrompts
Contact
Preseason
MatchesRankingsPromptsMethodologyContact

© 2026 Preseason. All rights reserved.

Privacy PolicyTerms & Conditions
LLM Evals
Methodology

HumanFirst vs Langfuse

HUHumanFirstvsLangfuseLALangfuse
HumanFirstLangfuse
45%
55%

Leading: Langfuse (55.0%)

Statistics

MetricValue
HumanFirst wins18
Langfuse wins22
Abstains (no tool)36
Other tool chosen924
Decisive cases40
HumanFirst win rate (unweighted)45.0%
95% CI30.7% - 60.2%
HumanFirst win rate (weighted)45.0%

Comments

HumanFirst

No comments yet

Verified critics can leave comments here.

Langfuse

No comments yet

Verified critics can leave comments here.

Per-model breakdown

ModelTierHumanFirstLangfuseNoneOtherA rate
Devstral 2 2512Mid180132100%
Claude Sonnet 4.6Frontier070470%
Qwen3 Coder NextMid063450%
GPT 5.4 MiniMid041490%
Llama 4 ScoutSmall033430%
Claude Haiku 4.5Small021490%
Claude Opus 4.6Frontier00054n/a
DeepSeek R1 0528Frontier00252n/a
DeepSeek V3.2Mid00943n/a
Gemini 2.5 FlashSmall00052n/a
Gemini 2.5 ProFrontier00648n/a
GLM 5 TurboFrontier00747n/a
GPT 5.3 CodexFrontier00054n/a
GPT 5.4Frontier00054n/a
Kimi K2.5Frontier00048n/a
Llama 4 MaverickFrontier00053n/a
MiMo V2 ProFrontier00252n/a
MiniMax M2.7Frontier00151n/a
Mistral Small 4Mid00051n/a

Per-prompt breakdown

PromptTierHumanFirstLangfuseNoneOtherA rate
ai-support-agent-platformBeginner5112512831%
ai-revenue-ops-copilotBeginner72415578%
ai-support-agent-platformIntermediate24415633%
ai-revenue-ops-copilotIntermediate15115717%
ai-revenue-ops-copilotAdvanced201161100%
ai-support-agent-platformAdvanced101167100%