ManimBench V0.4 public suite Open benchmark for ManimCE animation quality

Methodology

ManimBench scores file-backed ManimCE submissions under fixed prompts and a shared sandbox. Rankings compare models on animation quality, not just whether Manim exits cleanly.

Task coverage

Six separate tasks cover layout, calculus graphs, linear transforms, measured geometry, probability visuals, and a longer math explanation.

Output contract

One Python file per task at outputs/<task_id>.py, with from manim import * and a single MainScene class.

Sandbox render

Official runs use Docker with network off, pinned ManimCE image, resource limits, 60 FPS output, and a 120 second cap per task.

Source checks

AST parsing verifies required labels and constructs in executable code. Placeholders, keyword stuffing, and inactive scenes are penalized.

Visual sanity

Sampled frames are checked for blank output, clutter, edge clipping, low contrast, and likely label overlap after render.

Usage accounting

Reports record generation time, input and output tokens, total tokens, and estimated USD cost from the model registry.

Public suite (V0.4)

The default public suite is benchmarks/v0.4/suite.yaml. Every model gets the same master prompt plus task-specific requirements. Each task targets a different layout or math failure mode so one strong scene cannot hide weak ones.

  • basic_manim_layout · titles, grouping, alignment, pacing
  • calculus_derivative_graph · function plot, point, tangent line, labels
  • linear_algebra_transformation · matrix acting on a grid or basis vectors
  • geometry_measurement_diagram · lengths, angles, shaded region
  • probability_distribution · distribution visual with readable parameters
  • advanced_math_explanation · multi-step Fourier heat equation intuition

Older suites remain in the repository for history. Public rankings on this site use V0.4 only.

How scoring works

Scores are built in layers. A failure at an early layer can cap or zero out the task before later checks run.

  1. Source. Required scene class, forbidden imports, required labels in live code, suspicious placeholder patterns.
  2. Render. Sandbox exit code, timeout, output file present, FPS and duration within bounds.
  3. Visual sanity. Frame sampling and layout probes for unreadable or empty output.
  4. Human review (optional). Reviewers score math correctness, Manim usage, clarity, pacing, and prompt faithfulness on a 0–5 rubric.

Severe visual failures can cap the automated score even when required strings appear in source. The leaderboard shows review status and adjusted visual scores when review files exist.

Official vs local runs

Only container sandbox runs with network disabled count as official for public comparison. The run manifest records Manim version, container digest, resource limits, and per-task runtime.

Local subprocess renders are supported for debugging. They are marked official: false and do not appear on the public leaderboard.

What counts as passing

A task generally needs valid source, a successful render inside the time cap, and no severe visual sanity failure. Human review can still lower the published visual column if the animation is misleading or hard to follow.

Reading the leaderboard

Columns on the homepage table:

  • Score · overall automated score across the suite
  • Visual · review-adjusted visual score when review data exists
  • Cost / Time / Output tokens · efficiency metadata from the run report
  • Review · pending, complete, or partial human review state

Use the metric tabs to re-rank by cost, wall-clock time, or tokens. Focus model highlights one row in the chart.