Methodology

What We Measure

Preseason measures recommendation behavior, not product quality in the abstract. We ask a frozen panel of AI models which tools they would choose for realistic software building scenarios, then aggregate those decisions into public rankings and matchups.

Every public result comes from published runs inside a frozen benchmark season. A season freezes the prompt versions, eligible categories, model snapshots, and run configuration so rankings stay comparable and traceable over time.

Benchmark Design

Prompts are versioned snapshots grouped by user prompting level: beginner, intermediate, and advanced. Each prompt version declares which tool categories should produce a decision, which keeps the evaluation target explicit.

Model snapshots pin the provider, model identity, and inference settings used at evaluation time. Each benchmark run executes the full prompt-by-model matrix for the season, so each case can be reproduced and audited later.

The current corpus is intentionally focused on web application and SaaS-style scenarios, which is important context when interpreting category leaders.

Decision Capture

In benchmark mode, a model returns both natural language and a structured appendix. For every eligible category in the prompt, that appendix must resolve to exactly one category-level decision: recommend a tool or explicitly say no tool is needed.

We validate that appendix against the benchmark contract and store one case decision per category. Outputs that cannot be validated are excluded rather than inferred from prose.

We also record the model identity returned by the provider so silent model swaps or other drift can be caught before a run is allowed into the public dataset.

Tool Resolution

Recommended tool names are resolved against Preseason's tool catalog and approved aliases. When a name cannot be mapped confidently, it goes into a review queue and does not count toward rankings until it is resolved.

This keeps the public data conservative: unknown or ambiguous names are held out rather than silently forced into an existing tool entry.

Scoring

Rankings are computed from case decisions, not free-text mentions. The core metric is a tool's support rate: the share of eligible decisions that selected it within the chosen benchmark slice.

We pair that rate with raw counts, Wilson confidence intervals, model coverage, prompt coverage, and trend versus the previous non-overlapping published-run window. Rankings are ordered by weighted support rate, with confidence and count used to break ties.

The scoring layer supports model-tier weighting, and the weight configuration is frozen with each run so historical results remain auditable.

Publication Standards

Only published runs feed the public site. Before a run can appear publicly, it must clear quality checks around execution success, structured-output validity, unresolved tool rate, and panel coverage.

For category rankings, we only treat the result as benchmark-ready when the window has enough eligible decisions and enough diversity across both prompts and models.

At least 100 eligible case decisions
At least 3 distinct model snapshots contributing
At least 3 distinct prompt versions contributing

Categories below those thresholds are labeled as insufficient data instead of being presented as authoritative. Head-to-head tool matchups require at least 30 decisive cases before we publish them as benchmark-ready.

Scope and Filters

Public rankings can be filtered by prompting level, model tier, and specific frozen model version. By default, public reads resolve against the latest published season.

That makes the benchmark useful for questions like “what do frontier models prefer for advanced requests?” without pretending the answer generalizes to every product category or engineering context.

How to Read Rankings

A high rank means models in this benchmark panel frequently recommend a tool for the scoped scenarios. It does not mean the tool is objectively best in every context, and it does not replace hands-on evaluation by an engineering team.

The benchmark is meant to make model behavior legible and auditable: what gets recommended, how often, under which prompts, and with what level of statistical support.