Official results come from container runs with network off, resource limits, and manifests that record the environment.
An official row means the render used a pinned Docker image, no network, documented CPU and memory limits, and a per-task timeout. Only the submitted Python files count as input.
Why use a sandbox
Without isolation, a model could download fonts, images, or clips during scoring. Network-disabled containers block that for official comparisons.
Containers also limit damage from hung renders or bad system calls. Local subprocess runs remain available for development and are marked non-official in metadata.
Container policy
Each official task render uses:
- No network during evaluation
- A pinned ManimCE image with digest in the manifest
- Process and memory limits
- 120 second timeout per task (V0.4)
- 60 FPS output
If Manim or system libraries change, the digest changes. Suite versioning ties results to the environment they ran in.
Visual sanity checks
Exit code zero does not mean the video is usable. Common problems: empty frames after a title, stacked equations, clipped axis labels, gray-on-gray text.
After render, the runner samples frames and runs simple layout checks. This is triage, not a full quality review.
Blank or flat frames
Often means the scene stopped after a title Write and never drew the main content.
Clutter and overlap
Dense regions flag scenes that may need human review even when frames are not empty.
Edge clipping
Text or shapes cut off at the border usually mean bad scaling or objects placed outside the frame.
Low contrast
Strings can pass source checks while remaining hard to read on screen.
Automation limits
Frame checks miss wrong math or bad animations that still look fine to heuristics. Severe automated failures cap scores; ambiguous cases go to human review.
Reproducing a run
Manifests and docs in the repository list the flags and image digest. Same suite, source, backend, and digest should reproduce render results.
See how scoring works and file-backed submission.