Preseason runs a daily benchmark that asks a panel of AI models what tools they would recommend for specific development scenarios. Each benchmark run evaluates every combination of prompt and model in the active season.
Prompt panel: A curated set of development scenarios ranging from simple applications to complex multi-service architectures. The current seeded corpus covers 15 web-app scenarios, each represented at beginner, intermediate, and advanced prompting levels for 45 prompt variants total. Each prompt level changes the technical specificity of the request while keeping the target tool categories consistent within a scenario.
Model panel: A curated panel of 20 current AI models with extra depth in coding-heavy families, including multiple OpenAI GPT and Codex variants plus Anthropic Opus, Sonnet, and Haiku. Each model snapshot stores the company, family, exact version, and explicit frozen inference parameters (temperature, top_p, max_tokens) to ensure reproducibility.
Immutable snapshots: Both prompts and model configurations are frozen as immutable snapshots within each season. This ensures that rankings are always traceable to the exact inputs that produced them.
For benchmark runs, each model must return a machine-readable response in a strict format. For every eligible tool category in the prompt, the model provides a decision: recommend a specific tool, or indicate that no tool is needed for that category.
Responses that do not conform to the expected format are marked as invalid and excluded from rankings. There is no heuristic parsing or attempt to rescue malformed outputs. This strict approach ensures data quality at the cost of some data volume.
The fundamental unit of measurement is a case decision: one model's tool choice for one category in one prompt evaluation.
Support rate: The fraction of eligible decisions that selected a given tool. Shown as a percentage with the raw count (e.g., 35.2% with 42/119 decisions).
Confidence interval: A Wilson 95% confidence interval on the raw support rate. Narrower intervals indicate more reliable rankings. The CI is computed on unweighted counts.
Model coverage: The percentage of distinct model snapshots that recommended this tool. High coverage means broad consensus across different AI models.
Prompt coverage: The percentage of distinct prompt versions that produced a recommendation for this tool. High coverage means the tool is recommended across diverse development scenarios.
Trend: The change in support rate compared to the previous non-overlapping time window of the same type.
Each case decision can carry a weight based on its model's capability tier (frontier, mid, small). The weight configuration is versioned and snapshotted per run, so historical results always reference the exact weights that produced them.
Season 1 uses uniform weights (all model tiers = 1.0). This means every model gets one equal vote. We believe this is the most transparent and defensible approach for a first release. Non-uniform weighting may be introduced in future seasons once we have data to justify tier differentiation.
Both weighted and unweighted metrics are always computed and displayed. When weights are uniform, they are identical.
A category ranking is only published as authoritative when it meets minimum data thresholds:
Categories below these thresholds display “Insufficient benchmark data” rather than publishing potentially misleading rankings. Head-to-head comparisons require at least 30 decisive cases.
When a model recommends a tool, we match it against our database of known tools and approved aliases. If no match is found, the tool name enters a review queue for manual resolution. Unresolved tools are excluded from rankings until reviewed.
Tools are never auto-created from model output. This prevents hallucinated or misspelled tool names from polluting the database.
Rankings are computed over explicit time windows:
Rankings reflect what AI models recommend when asked about tool choices for development scenarios. They are not independent quality evaluations of the tools themselves.
A tool's ranking is influenced by its presence in AI training data, its popularity in developer communities, and how well it fits the specific scenarios in our prompt panel.
The current prompt panel focuses on web application development scenarios. It currently covers 15 recurring scenario slugs represented across beginner, intermediate, and advanced prompting levels, with a bias toward full-stack and SaaS style products. Rankings should be interpreted within this scope.
Categories with limited prompt coverage (few prompts mentioning that category) will show reduced confidence and may fall below publication thresholds. This is by design — we prefer honesty about coverage gaps over thin rankings.
Every published ranking can be traced back to the exact prompt versions, model snapshots, inference parameters, and weight configuration that produced it. The active weight configuration is always visible. If non-uniform weights are ever used, the exact values will be listed here.
Current weight config: Uniform (frontier = 1.0, mid = 1.0, small = 1.0).