ManimBench asks whether coding models can write Manim Community Edition animations that explain math clearly, not just code that exits without error.

A scene can render cleanly and still fail as explanation. Labels overlap. Graphs clip at the frame edge. A model can hardcode the final answer in static text and skip the steps the prompt asked for. ManimBench scores those failures alongside render success.

Why Manim

Most coding benchmarks test symbols or unit tests. ManimCE requires layout, timing, labeling, and coordinate systems. A model has to build a scene graph and animate objects a viewer can follow.

ManimCE is widely used and maintained. It is expressive enough to be hard to fake with a static image or a block of LaTeX alone.

What gets measured

Each model gets the same six public tasks. Every task is one Python file with a single MainScene class, rendered at 60 FPS inside a sandboxed container.

Scores combine:

  • Source structure. Required scene class, forbidden shortcuts, and prompt labels present in executable code.
  • Render success. Manim completes within the time cap and produces valid output media.
  • Visual sanity. Sampled frames checked for blank output, clipping, clutter, and low contrast.
  • Efficiency. Runtime, tokens, and estimated cost recorded next to quality.

Automated checks cover the obvious cases. Human review can adjust the visual score when reviewers mark a task pass, partial, or fail.

Scope

ManimBench is not a general video benchmark. Official runs disable network access so scores reflect submitted Python, not downloaded assets.

A high score means a model cleared a fixed, repeatable bar. It does not replace classroom judgment.

Site and repository

This site publishes rankings, methodology, and task definitions. The repository holds manifests, container digests, and archived source for verification.

V0.5

V0.4 fixed the six-task public suite and the file-backed submission contract. V0.5 will add OpenRouter-backed generation, a refreshed task set, draft and live publishing, and batch tooling. The output contract stays the same: one scene per file, sandboxed renders, recorded manifests.

See the leaderboard, the V0.4 suite write-up, and how scoring works for detail.