Reproducibility

Every leaderboard row can be reproduced from a single command.

What's in a result

Each results/<run_id>.json captures:

Reproducing a row

Given a row, find the agent and the task list, then run:


cae run --agent <agent> --task <task_id> [--docker]

For the official leaderboard, always use --docker so the run is reproducible across machines. Without it, results depend on the local Python/library versions in the workdir.

Upstream drift

If the SWE-bench dataset is updated, old results can still be re-run because task_source.swe_bench_commit is recorded in the result JSON. The importer writes this at import time.