LLM Evals

Humanloop vs HumanFirst

HUHumanloopvsHUHumanFirst

HumanloopHumanFirst

49%

51%

Leading: HumanFirst (51.4%)

Statistics

Metric	Value
Humanloop wins	17
HumanFirst wins	18
Abstains (no tool)	36
Other tool chosen	929
Decisive cases	35
Humanloop win rate (unweighted)	48.6%
95% CI	33.0% - 64.4%
Humanloop win rate (weighted)	48.6%

Comments

Humanloop

No comments yet

Verified critics can leave comments here.

HumanFirst

No comments yet

Verified critics can leave comments here.

Per-model breakdown

Model	Tier	Humanloop	HumanFirst	None	Other	A rate
Devstral 2 2512	Mid	8	18	1	24	31%
Gemini 2.5 Flash	Small	4	0	0	48	100%
DeepSeek R1 0528	Frontier	2	0	2	50	100%
DeepSeek V3.2	Mid	2	0	9	41	100%
Claude Haiku 4.5	Small	1	0	1	50	100%
Claude Opus 4.6	Frontier	0	0	0	54	n/a
Claude Sonnet 4.6	Frontier	0	0	0	54	n/a
Gemini 2.5 Pro	Frontier	0	0	6	48	n/a
GLM 5 Turbo	Frontier	0	0	7	47	n/a
GPT 5.3 Codex	Frontier	0	0	0	54	n/a
GPT 5.4	Frontier	0	0	0	54	n/a
GPT 5.4 Mini	Mid	0	0	1	53	n/a
Kimi K2.5	Frontier	0	0	0	48	n/a
Llama 4 Maverick	Frontier	0	0	0	53	n/a
Llama 4 Scout	Small	0	0	3	46	n/a
MiMo V2 Pro	Frontier	0	0	2	52	n/a
MiniMax M2.7	Frontier	0	0	1	51	n/a
Mistral Small 4	Mid	0	0	0	51	n/a
Qwen3 Coder Next	Mid	0	0	3	51	n/a

Per-prompt breakdown

Prompt	Tier	Humanloop	HumanFirst	None	Other	A rate
ai-revenue-ops-copilot	Beginner	5	7	4	152	42%
ai-revenue-ops-copilot	Intermediate	5	1	1	157	83%
ai-support-agent-platform	Beginner	1	5	25	138	17%
ai-revenue-ops-copilot	Advanced	3	2	1	158	60%
ai-support-agent-platform	Intermediate	3	2	4	157	60%
ai-support-agent-platform	Advanced	0	1	1	167	0%