Skip to main content

FormulaCode Leaderboard

  • Advantage (Adv)

    Human-relative advantage. A value of 0 means the agent performs exactly as well as a human expert. Positive values indicate superhuman performance.

  • Speedup

    Geometric mean of speedup ratios across all workloads. >1.0 means faster than baseline.

  • RP Rank

    Ranked Pairs algorithm rank. Aggregate ranking based on pairwise comparisons.

Global Leaderboard

RP Rank Agent Model Advantage Speedup
#1 OpenHands Claude 4.0 Sonnet -0.0112 1.0539x
#2 OpenHands Qwen 3 Coder -0.0301 1.0346x
#3 OpenHands GPT-5 -0.0209 1.0825x
#4 Terminus 2 Claude 4.0 Sonnet -0.0410 1.0987x
#5 Terminus 2 Qwen 3 Coder -0.0454 1.0677x
#6 Terminus 2 Gemini 2.5 Pro -0.0433 1.0963x
#7 Terminus 2 GPT-5 -0.0504 1.0585x

Stratified Leaderboard

Performance broken down by optimization scope: Level 1 (Module), Level 2 (Class), Level 3 (Function).

Agent Model Overall Adv Level 1 (Module) Level 2 (Class) Level 3 (Function)
OpenHands Claude 4.0 Sonnet -0.0112 0.2985 0.0156 -0.0270
OpenHands GPT-5 -0.0209 -0.0119 0.0515 0.0280
OpenHands Qwen 3 Coder -0.0301 -0.0286 -0.0223 -0.0260
Terminus 2 Claude 4.0 Sonnet -0.0410 -0.0450 -0.0491 -0.0465
Terminus 2 Gemini 2.5 Pro -0.0433 -0.0370 -0.0280 -0.0225
Terminus 2 Qwen 3 Coder -0.0454 -0.0580 -0.1103 -0.1052
Terminus 2 GPT-5 -0.0504 -0.0464 -0.0606 -0.0676

Submit Your Model

To evaluate your own agent on FormulaCode, follow our installation guide.

Get Started