Blog
Notes on how ManimBench works, what we measure, and how the benchmark is built.
- Introducing ManimBench Can coding models write ManimCE animations that explain math clearly? Fixed prompts, sandboxed renders, and public rankings.
- The V0.4 public suite Six tasks, one Python file each, and a shared MainScene contract. What changed from the old single-video showcase.
- How scoring works Source checks, sandbox renders, visual sanity probes, and optional human review.
- Sandbox execution and visual sanity Why official runs use an isolated container, what goes in the manifest, and what frame sampling checks.
- Why the benchmark is file-backed Generation and scoring are separate. Same prompts, saved outputs, any tool or API.