- Suite
- V0.6 public suite
- Models
- 0
- Updated
- Not yet
- Top capability
- n/a
V0.6 validity release
cleaner model ranking
- Capability score
- Pass rate
- Coverage rate
- Render success
- Failure buckets
- Schema 0.6.0
Leaderboard
| # | Model | Score | Pass | Coverage | Render | Failures | Cost | Time | Review |
|---|
V0.6 capability score
Higher is better. Rank score is separate from pass rate, coverage, and render success.
V0.6 is active
V0.6 changes ranking validity: source-term hints are advisory, while coverage, render success, pass rate, and failure buckets are published separately.
- Ranking. Capability score ranks models. Pass rate, coverage, render success, and failure buckets explain the evidence behind it.
- Validity. Missing source and render crashes remain serious failures. Exact source-term misses no longer fail otherwise valid rendered animations by themselves.
- Publishing. Official V0.6 rankings require a fresh full-suite rerun.
Run V0.6 locally
Use this order: install the engine, generate source files, render and score those exact files, then build the report from the same run id.
-
Install the engine and local render extras.
Run this once from the repository root.
python -m pip install -e ".[dev,render]" -
Generate outputs for API models.
This writes one Python file per task under
outputs/<model>/. Add--smokeonly for a one-task setup check.manimbench generate-batch --models gpt-5-5,opus-4-8 --provider auto --output-dir outputs --parallel 2 -
Generate Composer output if you include Composer.
Composer runs through Cursor Agent CLI, not OpenRouter.
manimbench generate --model composer-2-5 --provider cursor --output-dir outputs -
Render and score the generated folders.
Include one
--model-outputentry for each model you generated. Remove the Composer line if you are not testing it.manimbench run-file-matrix \ --model-output gpt-5-5=outputs/gpt-5-5 \ --model-output opus-4-8=outputs/opus-4-8 \ --model-output composer-2-5=outputs/composer-2-5 \ --sandbox container \ --parallel 4 \ --run-id v06-public -
Build the report from that run.
Review score, pass rate, coverage, render success, and failure buckets before publishing.
manimbench report --run-dir runs/v06-public
Output contract: every generated file must define one ManimCE MainScene. V0.6 expects six files: coordinate_system_animation.py, derivative_motion_story.py, matrix_transformation_grid.py, geometric_area_proof.py, probability_distribution_simulation.py, and fourier_series_decomposition.py.