V0.6 validity release

V0.6 makes the leaderboard a cleaner model ranking by separating capability score from the reasons a task failed.

The earlier engine made every failed check part of one headline fail state. That was useful for debugging but too blunt for ranking. A missing provider output, a Manim crash, a visual sanity failure, and a missing implementation hint are different facts. V0.6 publishes them separately.

What changed

Capability score is primary. The leaderboard ranks by the score after automated checks and severity caps.
Pass rate is separate. Pass/fail remains an operational gate for required source, render, timing, label, and visual checks.
Coverage is visible. Missing source files are reported as coverage loss instead of being hidden inside a generic fail label.
Failure buckets are published. Reports distinguish missing source, render crash, visual sanity, labels, sections, source parse, and other failure classes.
Source terms are advisory. Required source terms still affect score, but they no longer fail an otherwise valid rendered animation by themselves.

Why source terms changed

Task prompts often ask for exact Manim constructs because they are good proxies for animation intent. They are not perfect proxies for quality. A scene can show a coherent derivative explanation without using the exact transition helper the scorer expected. V0.6 still records that miss, but it does not erase the rest of the evidence.

Keyword stuffing is still a hard failure. If a model hides required terms in comments, dead strings, or unrelated text, the scorer treats that as benchmark gaming rather than animation quality.

What must be rerun

All official rankings need a fresh V0.6 full-suite run. Earlier result JSON can be useful for debugging, but it is not comparable to V0.6 leaderboard data because the pass gate and exported schema changed.

manimbench generate-batch --models <models> --provider auto --output-dir outputs --parallel 2 --force
manimbench run-file-matrix --model-output <model>=outputs/<model> --sandbox container --parallel 2 --run-id v06-<date>
manimbench report --run-dir runs/v06-<date>
manimbench publish --run-dir runs/v06-<date> --target draft --site-repo ../manimbench-site

How to read V0.6

Start with score, then check pass, coverage, render, and failures. A high score with low coverage is not the same result as a high score across every task. A low score dominated by render crashes points to Manim API reliability. A low score dominated by visual sanity points to layout and readability.

The benchmark is still automated. Human review remains important for wrong math, shallow explanations, or polished animations that miss the point. V0.6 makes the automated evidence less misleading before that review happens.