Preseason
MatchesRankingsPrompts
GitHub
Preseason
MatchesRankingsPromptsMethodologyContact

© 2026 Preseason. All rights reserved.

Privacy PolicyTerms & Conditions
LLM Evals
Methodology

MLflow vs Vellum

MLMLflowvsVEVellum
MLflowVellum
42%
58%

Leading: Vellum (57.6%)

Statistics

MetricValue
MLflow wins14
Vellum wins19
Abstains (no tool)90
Other tool chosen2321
Decisive cases33
MLflow win rate (unweighted)42.4%
95% CI27.2% - 59.2%
MLflow win rate (weighted)42.4%

Comments

MLflow

No comments yet

Verified critics can leave comments here.

Vellum

No comments yet

Verified critics can leave comments here.

Per-model breakdown

ModelTierMLflowVellumNoneOtherA rate
Devstral 2 2512Mid01841030%
Llama 4 ScoutSmall1404103100%
MiMo V2 ProFrontier0181230%
Claude Haiku 4.5Small001124n/a
Claude Opus 4.6Frontier000132n/a
Claude Sonnet 4.6Frontier000132n/a
DeepSeek R1 0528Frontier007125n/a
DeepSeek V3.2Mid0022106n/a
Gemini 2.5 FlashSmall001126n/a
Gemini 2.5 ProFrontier009123n/a
GLM 5 TurboFrontier0019113n/a
GPT 5.3 CodexFrontier000132n/a
GPT 5.4Frontier000132n/a
GPT 5.4 MiniMid003129n/a
Kimi K2.5Frontier003116n/a
Llama 4 MaverickFrontier000127n/a
MiniMax M2.7Frontier005124n/a
Mistral Small 4Mid001123n/a
Qwen3 Coder NextMid003128n/a

Per-prompt breakdown

PromptTierMLflowVellumNoneOtherA rate
ai-support-agent-platformIntermediate311539021%
ai-revenue-ops-copilotBeginner421039467%
ai-revenue-ops-copilotIntermediate22439650%
ai-support-agent-platformBeginner04643430%
ai-support-agent-platformAdvanced305402100%
ai-revenue-ops-copilotAdvanced202396100%