FormulaCode Leaderboard
Advantage (Adv)
Human-relative advantage. A value of 0 means the agent performs exactly as well as a human expert. Positive values indicate superhuman performance.
Speedup
Geometric mean of speedup ratios across all workloads. >1.0 means faster than baseline.
RP Rank
Ranked Pairs algorithm rank. Aggregate ranking based on pairwise comparisons.
Global Leaderboard
| RP Rank ▲ | Agent ⇅ | Model ⇅ | Advantage ⇅ | Speedup ⇅ |
|---|---|---|---|---|
| #1 | OpenHands | Claude 4.0 Sonnet | -0.0112 | 1.0539x |
| #2 | OpenHands | Qwen 3 Coder | -0.0301 | 1.0346x |
| #3 | OpenHands | GPT-5 | -0.0209 | 1.0825x |
| #4 | Terminus 2 | Claude 4.0 Sonnet | -0.0410 | 1.0987x |
| #5 | Terminus 2 | Qwen 3 Coder | -0.0454 | 1.0677x |
| #6 | Terminus 2 | Gemini 2.5 Pro | -0.0433 | 1.0963x |
| #7 | Terminus 2 | GPT-5 | -0.0504 | 1.0585x |
Stratified Leaderboard
Performance broken down by optimization scope: Level 1 (Module), Level 2 (Class), Level 3 (Function).
| Agent ⇅ | Model ⇅ | Overall Adv ▼ | Level 1 (Module) ⇅ | Level 2 (Class) ⇅ | Level 3 (Function) ⇅ |
|---|---|---|---|---|---|
| OpenHands | Claude 4.0 Sonnet | -0.0112 | 0.2985 | 0.0156 | -0.0270 |
| OpenHands | GPT-5 | -0.0209 | -0.0119 | 0.0515 | 0.0280 |
| OpenHands | Qwen 3 Coder | -0.0301 | -0.0286 | -0.0223 | -0.0260 |
| Terminus 2 | Claude 4.0 Sonnet | -0.0410 | -0.0450 | -0.0491 | -0.0465 |
| Terminus 2 | Gemini 2.5 Pro | -0.0433 | -0.0370 | -0.0280 | -0.0225 |
| Terminus 2 | Qwen 3 Coder | -0.0454 | -0.0580 | -0.1103 | -0.1052 |
| Terminus 2 | GPT-5 | -0.0504 | -0.0464 | -0.0606 | -0.0676 |
Submit Your Model
To evaluate your own agent on FormulaCode, follow our installation guide.
Get Started