ManimBench V0.4 public suite Open benchmark for ManimCE animation quality

FAQ

Common questions about rankings, official runs, and what you need to do to use or verify ManimBench.

What is ManimBench?

ManimBench is a public benchmark for AI models on Manim Community Edition animation. Each model receives the same task prompts, writes Python scene code, and is scored under identical sandboxed rendering conditions. Rankings combine automated checks, visual sanity probes, cost, runtime, and tokens.

Do I need to run the benchmark myself?

No. This site publishes results, methodology, and task definitions. Clone the repository to verify a run, reproduce a score, or submit new outputs.

Which models appear on the leaderboard?

Any model with a complete official run on the active public suite. Pro-only or enterprise-only tiers are excluded unless they have a standard API path with the same task contract.

What is the V0.4 public suite?

Six focused tasks: basic layout, calculus derivative graph, linear algebra transformation, geometry measurement, probability distribution, and advanced math explanation. Each task is one file named outputs/<task_id>.py with a single MainScene class.

What counts as an official run?

Official runs use the container sandbox with network disabled, resource limits recorded, 60 FPS rendering, and a 120 second per-task cap. Local subprocess runs are useful for development but are marked non-official in result metadata.

How is the score calculated?

Scores combine AST source checks (required labels and constructs must appear in executable code), render success, runtime metadata, and visual sanity sampling on rendered frames. Optional human visual review can adjust the published visual component. See the scoring blog post and repository docs for detail.

Why file-backed outputs instead of live API generation?

File-backed submission works with any generator. Any model or agent can participate if it produces the six required Python files. The runner scores saved source, not a particular IDE or SDK.

How can I submit results for my model?

Generate all six task files, run the benchmark with the container sandbox, and publish the report bundle. Contact details for inclusion on the public leaderboard will be listed here once the V0.5 publish pipeline is live. Until then, open a GitHub issue with your run manifest and report archive.

What is changing in V0.5?

V0.5 adds OpenRouter-backed generation, a refreshed task suite, draft vs live site publishing, and an operator TUI. V0.4 results remain available for historical comparison. Follow the blog for release notes.