V0.5 turns ManimBench from a file-backed runner into a full benchmark engine: generation, resume, render, report, and publish are now part of one reproducible path.

The core contract stays deliberately plain. Every model still produces one Python file per task, and every file still defines MainScene. The difference is that the engine can now create those files through provider adapters, track the calls, skip finished work, render in parallel, and publish a complete bundle.

New default suite

The active public suite is now benchmarks/v0.5/suite.yaml. It has six task IDs:

  • coordinate_system_animation
  • derivative_motion_story
  • matrix_transformation_grid
  • geometric_area_proof
  • probability_distribution_simulation
  • fourier_series_decomposition

V0.4 remains runnable by path for historical comparison. Published V0.5 results will not rewrite V0.4 semantics.

Generation routes

OpenRouter is the default route for public API models with official OpenRouter slugs. That gives the benchmark a single gateway for most public model runs and keeps provider metadata in one registry.

Composer 2.5 is handled differently. OpenRouter does not currently publish a Cursor Composer slug, so the engine routes composer-2-5 through Cursor Agent CLI. That keeps the label honest: a Composer result is generated by Composer, not by a nearby OpenRouter model.

cursor-agent login
manimbench generate --model composer-2-5 --provider cursor

Do not pay twice

Generation now records checkpoint state under .manimbench/runs/<run_id>/state.json. If an output file is complete and still matches the prompt hash, a rerun skips it unless --force is passed.

Each call also gets a JSONL entry in generation.log with provider route, model ID, task ID, request ID, elapsed time, token counts, cost when available, and status. The same data rolls into outputs/<model>/usage.json.

Render, manifest, publish

run-file-matrix now accepts --parallel N, so model-task renders can run with bounded concurrency. Official runs record the Docker image digest before publish.

Run manifests are immutable. They include suite metadata, prompt hash, task hashes, OpenRouter slugs where available, provider route, git commit, scoring version, Docker digest, and a reference to publish history.

Publishing has two targets. Draft pushes a complete bundle to the draft branch for preview. Live pushes a complete bundle to main for Cloudflare production. The engine validates required files, commits once, then pushes, so the public site is not half-updated.

What is not in V0.5a

The TUI is intentionally not part of this release. V0.5a exposes a stable Python orchestrator API first, so the future TUI can call engine functions directly instead of shelling through CLI internals.

Start with the V0.5 operator guide, the suite docs, and the Cursor Composer guide.