Scoring runs in layers: source checks, render checks, visual sanity, and optional human review.

There is no single function that decides whether an explainer video worked. ManimBench aggregates evidence from each layer. A severe failure at any stage can cap the final score.

Source checks

Before Manim runs, the scorer parses submitted Python. It checks for a MainScene class, forbidden imports, and required labels in executable code rather than comments or dead strings.

Models often submit placeholder scenes that render a title card but skip the graph or diagram the prompt asked for. Keyword stuffing in comments is flagged the same way.

Render checks

Official runs use a container sandbox with network disabled, resource limits, and a timeout. The runner records exit code, output media, FPS, and duration.

Manim version, container digest, and per-task runtime go in the manifest so comparisons use documented conditions.

Visual sanity

A clean render still may be unreadable. After export, the runner samples frames and checks for:

  • Blank or flat frames
  • Overlapping labels and foreground clutter
  • Objects clipped at the frame edge
  • Low contrast between text and background

These checks are triage, not a full quality review. Severe visual failures cap the automated score even when required terms appear in source.

Human review

Reviewers score tasks on a 0–5 rubric: math correctness, Manim usage, clarity, pacing, prompt faithfulness, depth, and reproducibility.

Review files adjust the published visual column. The leaderboard shows when review is still pending.

Cost and runtime

Wall-clock time, token counts, and estimated cost are published beside scores. They are separate columns, not hidden into one composite number.

Reading the leaderboard

Check overall score, per-task breakdown, and whether the run is official (container vs local). Pending review means automated evidence only.

Manifests and archived source in the repository support reruns. See scoring.md and sandbox execution for more.