Models submit six Python files. The runner scores saved source in a fixed sandbox. Generation tooling does not matter for grading.

Many benchmarks tie API calls and grading into one pipeline. ManimBench only requires a folder of files with predictable names. How those files were produced is out of scope.

Why split generation and scoring

Reproducibility

Archived source can be re-rendered and re-scored without another API call or the same model version still being live.

Fair comparison

Every model gets identical task text and the same six filenames, scene class name, FPS, and timeout rules.

Tool freedom

OpenRouter, direct APIs, local agents, or hand edits all work if the output matches the contract. V0.5 adds batch generation helpers but keeps the same file layout.

Output contract

V0.4 expects six files under a model output directory:

  • basic_manim_layout.py
  • calculus_derivative_graph.py
  • linear_algebra_transformation.py
  • geometry_measurement_diagram.py
  • probability_distribution.py
  • advanced_math_explanation.py

Each file imports ManimCE and defines MainScene. The runner maps files to task IDs, renders in the sandbox, and writes a report bundle.

Common misconceptions

File-backed does not mean manual copy-paste for every run. Operators can automate generation as long as on-disk files match the contract before scoring.

Local non-container runs are fine for debugging. They are not labeled official on the public leaderboard.

Site vs repository

This site publishes rankings. The repository holds manifests, prompts, and rerunnable scoring.

See the V0.4 suite overview, the FAQ, or the leaderboard.