Documentation

The benchmark is reproducible from the repository. V0.6 adds cleaner ranking semantics for the V0.6 public suite: capability score, pass rate, coverage, render success, and failure buckets.

V0.6 validity release Read why V0.6 requires fresh full-suite reruns before publishing official rankings. V0.6 operator guide Generate, render, report, and publish the active public suite. Methodology Task coverage, sandbox policy, source checks, manifests, visual sanity, and usage accounting. Blog Notes on methodology, scoring design, and benchmark decisions. FAQ Answers about official runs, the public suite, and reading the leaderboard. Quickstart Install the package, list V0.6 tasks, generate outputs, and render a run. V0.6 suite Read the six active task definitions and output contract. OpenRouter Default API gateway for public models with published OpenRouter slugs. Cursor Composer How Composer 2.5 runs through Cursor Agent CLI. Reproduce Use run manifests, suite paths, and output folders to verify results. Scoring See how source checks, render checks, visual sanity checks, and review fit together. Model workspaces Create file-backed workspaces for any AI coding agent. Run comparison Auto-discover ready model outputs and build a comparison report. Publishing Prepare report bundles and publish results to the public site. Publish to site Draft and live publish flow for the separate site repository. Launcher Interactive guided run flow for local and container sandbox backends. Repository Full source, task definitions, and issue tracker on GitHub.