Task coverage
Six separate tasks cover layout, calculus graphs, linear transforms, measured geometry, probability visuals, and a longer math explanation.
ManimBench scores file-backed ManimCE submissions under fixed prompts and a shared sandbox. Rankings compare models on animation quality, not just whether Manim exits cleanly.
Six separate tasks cover layout, calculus graphs, linear transforms, measured geometry, probability visuals, and a longer math explanation.
One Python file per task at outputs/<task_id>.py, with from manim import * and a single MainScene class.
Official runs use Docker with network off, pinned ManimCE image, resource limits, 60 FPS output, and a 120 second cap per task.
AST parsing verifies required labels and constructs in executable code. Placeholders, keyword stuffing, and inactive scenes are penalized.
Sampled frames are checked for blank output, clutter, edge clipping, low contrast, and likely label overlap after render.
Reports record generation time, input and output tokens, total tokens, and estimated USD cost from the model registry.
The default public suite is benchmarks/v0.4/suite.yaml. Every model gets the same master prompt plus task-specific requirements. Each task targets a different layout or math failure mode so one strong scene cannot hide weak ones.
basic_manim_layout · titles, grouping, alignment, pacingcalculus_derivative_graph · function plot, point, tangent line, labelslinear_algebra_transformation · matrix acting on a grid or basis vectorsgeometry_measurement_diagram · lengths, angles, shaded regionprobability_distribution · distribution visual with readable parametersadvanced_math_explanation · multi-step Fourier heat equation intuitionOlder suites remain in the repository for history. Public rankings on this site use V0.4 only.
Scores are built in layers. A failure at an early layer can cap or zero out the task before later checks run.
Severe visual failures can cap the automated score even when required strings appear in source. The leaderboard shows review status and adjusted visual scores when review files exist.
Only container sandbox runs with network disabled count as official for public comparison. The run manifest records Manim version, container digest, resource limits, and per-task runtime.
Local subprocess renders are supported for debugging. They are marked official: false and do not appear on the public leaderboard.
A task generally needs valid source, a successful render inside the time cap, and no severe visual sanity failure. Human review can still lower the published visual column if the animation is misleading or hard to follow.
Columns on the homepage table:
Use the metric tabs to re-rank by cost, wall-clock time, or tokens. Focus model highlights one row in the chart.