Six tasks, six files, one MainScene per file. V0.4 replaced the earlier single-video showcase.
One polished animation hid weak tasks. V0.4 splits the public suite into six submissions, each aimed at a different failure mode.
Task design
Each task targets roughly two minutes of render time at 60 FPS. Tasks cover layout, graphs, linear transforms, measured geometry, probability visuals, and a longer math explanation. Most scoring is automated; human review handles edge cases.
Shared output contract:
- One Python file per task:
outputs/<task_id>.py from manim import *- One primary scene class named
MainScene - 60 FPS, 120 second cap per task
Models get a master prompt plus task-specific requirements. Required labels are checked in source and expected on screen where applicable.
The six tasks
Basic Manim layout (basic_manim_layout)
Titles, grouped objects, alignment, and pacing on a simple scene.
Calculus derivative graph (calculus_derivative_graph)
Plot a function, mark a point, draw a tangent, label axes and equations.
Linear algebra transformation (linear_algebra_transformation)
Animate a matrix acting on a grid or basis vectors with readable labels.
Geometry measurement diagram (geometry_measurement_diagram)
Lengths, angles, and a shaded region in one annotated diagram.
Probability distribution (probability_distribution)
Bars or curves with parameter labels; common failure mode is overlapping text.
Advanced math explanation (advanced_math_explanation)
Multi-step intuition around the Fourier heat equation across a longer scene.
Out of scope for V0.4
No 3D scenes, external asset pipelines, or open-ended prompts. That keeps official runs reproducible in a network-disabled container.
Older suites (v1 44-task set, v0.3 showcase) remain in the repo for history. Public rankings use benchmarks/v0.4/suite.yaml.
Interpreting suite scores
The suite score aggregates per-task results. One failed task on an otherwise strong run still matters for production use.
V0.4 results will stay archived when V0.5 ships a new task set. Task semantics are not edited in place after a suite goes public.
See public suite docs and how scoring works.