Six tasks, six files, one MainScene per file. V0.4 replaced the earlier single-video showcase.

One polished animation hid weak tasks. V0.4 splits the public suite into six submissions, each aimed at a different failure mode.

Task design

Each task targets roughly two minutes of render time at 60 FPS. Tasks cover layout, graphs, linear transforms, measured geometry, probability visuals, and a longer math explanation. Most scoring is automated; human review handles edge cases.

Shared output contract:

  • One Python file per task: outputs/<task_id>.py
  • from manim import *
  • One primary scene class named MainScene
  • 60 FPS, 120 second cap per task

Models get a master prompt plus task-specific requirements. Required labels are checked in source and expected on screen where applicable.

The six tasks

Basic Manim layout (basic_manim_layout)

Titles, grouped objects, alignment, and pacing on a simple scene.

Calculus derivative graph (calculus_derivative_graph)

Plot a function, mark a point, draw a tangent, label axes and equations.

Linear algebra transformation (linear_algebra_transformation)

Animate a matrix acting on a grid or basis vectors with readable labels.

Geometry measurement diagram (geometry_measurement_diagram)

Lengths, angles, and a shaded region in one annotated diagram.

Probability distribution (probability_distribution)

Bars or curves with parameter labels; common failure mode is overlapping text.

Advanced math explanation (advanced_math_explanation)

Multi-step intuition around the Fourier heat equation across a longer scene.

Out of scope for V0.4

No 3D scenes, external asset pipelines, or open-ended prompts. That keeps official runs reproducible in a network-disabled container.

Older suites (v1 44-task set, v0.3 showcase) remain in the repo for history. Public rankings use benchmarks/v0.4/suite.yaml.

Interpreting suite scores

The suite score aggregates per-task results. One failed task on an otherwise strong run still matters for production use.

V0.4 results will stay archived when V0.5 ships a new task set. Task semantics are not edited in place after a suite goes public.

See public suite docs and how scoring works.