Every leaderboard row can be reproduced from a single command.
Each results/<run_id>.json captures:
agent, agent_version, model — what was runmode — local or dockerstarted_at, harness_git_sha — when and with which codetask_source — the SWE-bench split and upstream commitpatch, test_results — what the agent produced and how the tests ranGiven a row, find the agent and the task list, then run:
cae run --agent <agent> --task <task_id> [--docker]
For the official leaderboard, always use --docker so the run is reproducible across machines. Without it, results depend on the local Python/library versions in the workdir.
If the SWE-bench dataset is updated, old results can still be re-run because task_source.swe_bench_commit is recorded in the result JSON. The importer writes this at import time.