FAQ

Common questions about rankings, official runs, and what you need to do to use or verify ManimBench.

What is ManimBench?

ManimBench is a public benchmark for AI models on Manim Community Edition animation. Each model receives the same task prompts, writes Python scene code, and is scored under identical sandboxed rendering conditions. V0.6 rankings separate capability score, pass rate, coverage, render success, failure buckets, cost, runtime, and tokens.

Do I need to run the benchmark myself?

No. This site publishes results, methodology, and task definitions. Clone the repository to verify a run, reproduce a score, or submit new outputs.

Which models appear on the leaderboard?

Any model with a complete official run on the active public suite. Pro-only or enterprise-only tiers are excluded. Public API models use OpenRouter when OpenRouter publishes an official slug.

What is the V0.6 public suite?

Six focused tasks: coordinate system animation, derivative motion story, matrix transformation grid, geometric area proof, probability distribution simulation, and Fourier series decomposition. Each task is one file named outputs/<task_id>.py with a single MainScene class.

How do I benchmark Composer 2.5?

Use Cursor Agent CLI through the engine: cursor-agent login, then manimbench generate --model composer-2-5 --provider cursor. OpenRouter does not currently publish a Composer model slug, so mapping Composer to another OpenRouter model would not be an actual Composer result.

What counts as an official run?

Official runs use the container sandbox with network disabled, resource limits recorded, 60 FPS rendering, and a 120 second per-task cap. Local subprocess runs are useful for development but are marked non-official in result metadata.

How is the score calculated?

V0.6 ranks by capability score after source, render, timing, and visual checks. Pass rate is separate. Required source terms are advisory score evidence, not a hard failure by themselves. Missing source, render crashes, labels, and visual sanity failures are published as failure buckets. See the V0.6 release note and repository docs for detail.

Why save outputs if ManimBench can generate through APIs?

Saved outputs make runs reproducible and prevent paying twice. Generation writes outputs/<model>/<task_id>.py, then rendering and scoring read those files. Reruns skip complete files unless --force is passed.

How can I submit results for my model?

Generate all six task files, run the benchmark with the container sandbox, generate the report, then publish a draft or live bundle with manimbench publish. Live publish requires a complete run unless explicitly allowed as partial.

Where did V0.4 go?

V0.4 remains in the repository for historical comparison and reproduction. V0.6 is the active benchmark release. Use --suite benchmarks/v0.4/suite.yaml if you need to run the older tasks.

More detail lives in the blog and the repository documentation.