Methodology

ManimBench scores generated ManimCE submissions under fixed prompts and a shared sandbox. V0.6 ranks models by capability score while keeping pass rate, coverage, render success, and failure buckets separate.

Task coverage

Six separate V0.6 tasks cover coordinate systems, calculus motion, linear transforms, area proof, probability simulation, and Fourier decomposition.

Output contract

One Python file per task at outputs/<task_id>.py, with from manim import * and a single MainScene class.

Sandbox render

Official runs use Docker with network off, pinned ManimCE image, resource limits, 60 FPS output, and a 120 second cap per task.

Source checks

AST parsing verifies required labels in executable code. Source-term hints are scored as advisory evidence; placeholders, keyword stuffing, and inactive scenes remain hard failures.

Visual sanity

Sampled frames are checked for blank output, clutter, edge clipping, low contrast, and likely label overlap after render.

Usage accounting

Reports record provider route, request IDs, generation time, tokens, cost when available, and run-level manifest data.

Public suite (V0.6)

The default public suite is benchmarks/v0.6/suite.yaml. Every model gets the same master prompt plus task-specific requirements. Each task targets a different animation or math failure mode so one strong scene cannot hide weak ones.

coordinate_system_animation · axes, points, vectors, and coordinate labels
derivative_motion_story · moving point, tangent line, and rate-of-change labels
matrix_transformation_grid · matrix action on grid and basis vectors
geometric_area_proof · annotated decomposition and area equality
probability_distribution_simulation · distribution evolution with readable parameters
fourier_series_decomposition · partial sums, target wave, and coefficient explanation

Older suites remain in the repository for history. Public rankings on this site use V0.6.

Generation routes

OpenRouter is the default gateway for public API models with official OpenRouter slugs. Composer 2.5 is generated through Cursor Agent CLI because OpenRouter does not currently publish a Composer model slug.

Generation is checkpointed. Complete outputs/<model>/<task_id>.py files are skipped unless the operator passes --force. Every generation call is logged as JSONL with provider route, model ID, task ID, request ID, elapsed time, token counts, cost when available, and status.

How scoring works in V0.6

Scores are built in layers. A failure at an early layer can cap or zero out the task before later checks run. The capability score is the primary ranking field, not the only evidence shown.

Source. Required scene class, forbidden imports, required labels in live code, suspicious placeholder patterns, and advisory source-term evidence.
Render. Sandbox exit code, timeout, output file present, FPS and duration within bounds.
Visual sanity. Frame sampling and layout probes for unreadable or empty output.
Human review (optional). Reviewers score math correctness, Manim usage, clarity, pacing, and prompt faithfulness on a 0–5 rubric.

Required source terms are no longer fatal by themselves. A rendered animation can pass while missing an implementation hint, but the miss still lowers the capability score and appears in per-task evidence. Keyword stuffing remains a hard failure.

Severe visual failures can cap the automated score even when required strings appear in source. The leaderboard shows review status and adjusted visual scores when review files exist.

Official vs local runs

Only container sandbox runs with network disabled count as official for public comparison. The run manifest records suite metadata, prompt hash, task hashes, provider route, configured OpenRouter slugs, Manim version, Docker image digest, git commit, scoring version, and per-task runtime.

Local subprocess renders are supported for debugging. They are marked official: false and do not appear on the public leaderboard.

What counts as passing

A task needs valid source, required labels, a successful render inside the time cap, generated media, and no severe visual sanity failure. It also needs a capability score of at least 70. Human review can still lower the published visual column if the animation is misleading or hard to follow.

Reading the leaderboard

Columns on the homepage table:

Score · V0.6 capability score across the suite
Pass · operational pass rate after hard gates
Coverage · share of assigned tasks with source available for scoring
Render · share of assigned tasks that produced media successfully
Failures · top failure buckets such as missing source, render crash, visual sanity, or labels
Cost / Time / Output tokens · efficiency metadata from the run report
Review · pending, complete, or partial human review state

Use the metric tabs to re-rank by cost, wall-clock time, or tokens. Focus model highlights one row in the chart.

More detail in the V0.6 release note, scoring, sandbox, and V0.6 suite docs.