How scoring works | ManimBench

V0.6 public suite runs in layers: capability score, hard pass gates, failure buckets, and optional human review.

There is no single function that decides whether an explainer video worked. ManimBench aggregates evidence from each layer. A severe failure at any stage can cap the final score, but the leaderboard now separates rank score from diagnostic fields.

Source checks

Before Manim runs, the scorer parses submitted Python. It checks for a MainScene class, forbidden imports, and required labels in executable code rather than comments or dead strings.

Required source terms are advisory in V0.6. They still lower the score if missing, but they no longer fail an otherwise valid rendered animation by themselves. Models that stuff terms into comments, dead strings, or unrelated text still fail the hard gate.

Render checks

Official runs use a container sandbox with network disabled, resource limits, and a timeout. The runner records exit code, output media, FPS, and duration.

Manim version, container digest, and per-task runtime go in the manifest so comparisons use documented conditions.

Visual sanity

A clean render still may be unreadable. After export, the runner samples frames and checks for:

Blank or flat frames
Overlapping labels and foreground clutter
Objects clipped at the frame edge
Low contrast between text and background

These checks are triage, not a full quality review. Severe visual failures cap the automated score even when required terms appear in source.

Human review

Reviewers score tasks on a 0–5 rubric: math correctness, Manim usage, clarity, pacing, prompt faithfulness, depth, and reproducibility.

Review files adjust the published visual column. The leaderboard shows when review is still pending.

Cost and runtime

Wall-clock time, token counts, and estimated cost are published beside scores. They are separate columns, not hidden into one composite number.

Reading the leaderboard

Check capability score first, then pass rate, coverage, render success, and failure buckets. Pending review means automated evidence only.

Manifests and archived source in the repository support reruns. See scoring.md and sandbox execution for more.