Sandbox execution and visual sanity

Official results come from container runs with network off, resource limits, and manifests that record the environment.

An official row means the render used a pinned Docker image, no network, documented CPU and memory limits, and a per-task timeout. Only the submitted Python files count as input.

Why use a sandbox

Without isolation, a model could download fonts, images, or clips during scoring. Network-disabled containers block that for official comparisons.

Containers also limit damage from hung renders or bad system calls. Local subprocess runs remain available for development and are marked non-official in metadata.

Container policy

Each official task render uses:

No network during evaluation
A pinned ManimCE image with digest in the manifest
Process and memory limits
120 second timeout per task
60 FPS output

If Manim or system libraries change, the digest changes. Suite versioning ties results to the environment they ran in.

Visual sanity checks

Exit code zero does not mean the video is usable. Common problems: empty frames after a title, stacked equations, clipped axis labels, gray-on-gray text.

After render, the runner samples frames and runs simple layout checks. This is triage, not a full quality review.

Blank or flat frames

Often means the scene stopped after a title Write and never drew the main content.

Clutter and overlap

Dense regions flag scenes that may need human review even when frames are not empty.

Edge clipping

Text or shapes cut off at the border usually mean bad scaling or objects placed outside the frame.

Low contrast

Strings can pass source checks while remaining hard to read on screen.

Automation limits

Frame checks miss wrong math or bad animations that still look fine to heuristics. Severe automated failures cap scores; ambiguous cases go to human review.

Reproducing a run

Manifests and docs in the repository list the flags and image digest. Same suite, source, backend, and digest should reproduce render results.

See how scoring works and file-backed submission.