Skip to main content

FormulaCode: Evaluating Agentic Optimization on Large Codebases

1UT Austin, 2Caltech, 3Cornell
* Equal contribution

Abstract

Rapid advances in LLM agents have demonstrated the ability to optimize code at the repository level, leading to an urgent need for benchmarks that measure this ability to drive impact in real-world use cases. Existing code benchmarks, either relying on synthetic/LLM-generated tasks, single objective workloads, or binary pass/fail outcomes, offer a constrained evaluation landscape compared to these emerging capabilities.

To bridge this gap, we introduce FormulaCode, a novel benchmark designed for evaluating agentic optimization on large codebases, with a focus on real-world multi-objective performance optimization.

FormulaCode is a live benchmark consisting of 961 real world performance bottleneck tasks mined from scientific GitHub repositories, with an average of 1532 workloads per task. As such, FormulaCode represents the first large-scale analysis of the holistic ability of LLM agents to optimize codebases.

We find that FormulaCode proves to be a challenging dataset for frontier LLMs and agent frameworks, with unrestricted repository exploration emerging as the primary component for finding performance inefficiencies. FormulaCode is also robust to data-leakage; simply copying the online solution yields no leaderboard improvement.

FormulaCode’s Leaderboard (Tentative)

Snapshot of latest results on FormulaCode. Updated Monthly!

Agent L1: ParameterL2: FunctionL3: ClassL4: Module Overall
Terminus-2 (Claude Sonnet 4.0) 0.0189 0.0148 0.0194 0.0186 0.0177
Terminus-2 (GPT-5) 0.0112 0.0112 0.0107 0.0105 0.0110
Expert 0.0000 0.0000 0.0000 0.0000 0.0000

Don’t see your model on the leaderboard?

To evaluate an agent on FormulaCode, Follow the Installation instructions and run:

harbor run -d formulacode@0.1.0.post20251025 -a oracle

The next sections dive into FormulaCode’s analysis with interactive visualizations on a representative subset of FormulaCode. For up-to-date results and insights, please read the paper!

Read the paper
Speedup
Cumulative %
  • Claude
  • GPT-5
  • Oracle

Your codebase isn’t as fast as it used to be and you want to use an agent to optimize the code. You’ve got no preference for a model or agent framework, but you want it to work without any intervention. Which agent model pair do you choose? Loading...

Couldn’t decide? Maybe this info will help: Terminus 2 + GPT-5 has the highest advantage at producing module-level optimizations, but it often overlooks small optimizations, Terminus 2 + Claude Sonnet 4.0 finds function-level optimizations pretty well, but it might not be the best for deep optimizations. How do we know? Keep scrolling.

We scraped 110+ GitHub repositories with crowdsourced performance workloads and identified all pull requests that intended to improve the performance of a specific piece of code. Then, we measured the runtime of the repository before and after to see if the PR’s performance improvement was statistically significant.

After analyzing 1M+ PRs, we were able to identify 961 performance-improving tasks with over 1,472,080 total performance workloads across all tasks. For each of these problems, we asked a frontier LLM agent to optimize the code, given the same tools available to the human developers, and then measured the performance after rejecting optimizations that broke the code. Read more in the methodology.

Here’s a cumulative distribution function of the speedup ratio for each of our models. Hover over a model to see more details! A CDF is essentially an integration over the histogram; the slower the CDF line rises, the more benchmarks live in the faster region, and the better the model.

On first glance, it looks like our agents are doing pretty well! For GPT-5 and Claude Sonnet 4.0, there are a lot of jagged bumps, and about 3-5% of all benchmarks are outliers, where both models show extreme code optimizations. However, 75 to 80% of all benchmarks are modest improvements, with a speedup of less than 10%.

However, with a median of 81 benchmarks per task, good performance on a lone workload doesn’t tell us much about the holistic performance of such agents. What we really care about is whether models have a consistent advantage at optimizing code.

What emerges from the above analysis is that speedup alone doesn’t capture the full picture.

Performance optimizations rarely have isolated effects; an optimization in one part of the code could significantly slow down or speed up another part of the code.

Instead, we hypothesize that good performance optimizations produce an aggregate advantage. This requires reasoning about multiple workloads across multiple functionalities and target resources, and ensuring we consistently produce speedups.

To understand more, let’s dive deeper into the data.

Oracle Speedup→

Agent Speedup→

Instead of looking at the expert-produced speedup and the model-produced speedup separately, let’s look at them together on a scatterplot.

The Human Speedup is on the y-axis here, so the better the human speedup, the closer it is to the top. And the Model Speedup is on the x-axis.

Each data point represents a statistically significant workload captured in our benchmark.

The highlighted workload lies at position x=1.11 and y=1.38. That is, the human engineer optimized this workload to be 38% faster than the baseline while the agent’s optimization was only 11% faster.

The agent’s achievements are much less impressive now because the agent demonstrates no Advantage over the oracle.

So, where do the most impressive speedups lie? Let’s load the entire dataset and demarcate some regions of interest.

The identify function line depicts Equal advantage. For any workload on this line, an agent-written patch is as good as a human-written patch.

Workloads that cause slowdowns will have a speedup less than 1.00x.

The No oracle speedup line and a No agent speedup line centered at 1.00 help visualize this.

Now, we have 4 regions of interest.

The Bottom Left region characterizes Regressions; these are all the workloads where the agent and the oracle both caused a Performance Regression.

This could be an intentional tradeoff, or just a tricky workload for both agents and humans.

The Top left region shows sub-optimal benchmarks – the benchmarks where the oracle achieved a speedup but the agent caused a regression.

This is the worst region for an agent.

The Top right region shows under-optimized benchmarks – the agent still achieves some speedup but the expert-provided solution was much better.

Any workload here is a worthwhile tradeoff depending on resource prioritization.

What we are really interested in are Super optimizations – these are the workflows where the agent produced optimizations that were better than the oracle optimizations and better than the baseline.

This allows us to define a notion of agent advantage. Mathematically, given two dimensionless vectors depicting the oracle speedups and the agent speedups:

We can define a metric for the overall performance by calculating the average distance from the equal advantage line:

Intuitively, the closer a point is to the equal advantage line, the lower its score.

What if an agent tries to minimic the Human’s steps?

Unsurprisingly, all the points lie on the equal advantage line. This means that any simply replicating a memorized solution would get an advantage of 0.0.

Here’s the Human v/s Claude plot.

Most benchmarks are either super optimal or under optimal!

Claude’s advantage score here is 0.0749, which means Claude does slightly better than the expert on these problems.

The Human v/s GPT-5 comparison, is similar.

We see a few superoptimizations but mostly suboptimizations.

GPT-5’s advantage score is -0.02. So, it’s slightly worse off than humans.

This is surprising. Is Claude truly better than GPT-5 and humans?

This is a good time to talk about our grouping scheme.

In the bottom left corner, notice that the current data points aren’t being aggregated. So, we’re still looking at singular workloads.

To investigate the holistic optimization abilities, we can group workloads together based on their prefix strings (e.g: Aggregate all workloads under pandas.algorithm.*).

This is the same Human v/s Claude plot but aggregated on Modules.

The oracle’s performance increases significantly and most of Claude’s optimizations disappear! The new advantage score is now -0.0002.

So, Claude’s aggregate performance optimization capabilities are much weaker than its individual performance optimization capabilities.

With the same aggregation, GPT’s advantage score is 0.0034. Their advantage flipped.

But all this is conditioned on our definition of what counts as equal advantage. What if the minimum acceptable speedup is different?

Use the sliders to set your own criteria for equal advantage, and keep scrolling to see a model-by-model breakdown based on your selection.

Terminus 2 - Claude Sonnet 4.0

Median Agent Speedup

1.03x

Total Benchmarks

518

Median Oracle Speedup

1.03x

Agent Advantage

0.07

terminus-2,claude Benchmark Performance Distribution

Agent speedup distribution

Oracle speedup distribution

agent Speedup (vs No-op baseline)

Distribution of agent speedup values for terminus-2,claude across 518 benchmarks.

oracle Speedup (vs No-op baseline)

Distribution of oracle speedup values for terminus-2,claude across 518 benchmarks.

Terminus 2 - GPT-5

Median Agent Speedup

1.05x

Total Benchmarks

600

Median Oracle Speedup

1.04x

Agent Advantage

0.01

terminus-2,gpt-5 Benchmark Performance Distribution

Agent speedup distribution

Oracle speedup distribution

agent Speedup (vs No-op baseline)

Distribution of agent speedup values for terminus-2,gpt-5 across 600 benchmarks.

oracle Speedup (vs No-op baseline)

Distribution of oracle speedup values for terminus-2,gpt-5 across 600 benchmarks.

Terminus 2 - Oracle

Median Agent Speedup

1.02x

Total Benchmarks

1,075

Median Oracle Speedup

1.02x

Agent Advantage

0.00

terminus-2,oracle Benchmark Performance Distribution

Agent speedup distribution

Oracle speedup distribution

agent Speedup (vs No-op baseline)

Distribution of agent speedup values for terminus-2,oracle across 1075 benchmarks.

oracle Speedup (vs No-op baseline)

Distribution of oracle speedup values for terminus-2,oracle across 1075 benchmarks.

Leaderboard

This leaderboard displays the agent advantage scores by aggregation level. Higher scores indicate better performance relative to the oracle.

Use the thresholding filters above and see how they change the leaderboard.

Agent L1: ParameterL2: FunctionL3: ClassL4: Module Overall
Terminus 2 - Claude Sonnet 4.0 0.0955 0.0291 0.1424 -0.0002 0.0749
Terminus 2 - GPT-5 0.0090 0.0092 0.0060 0.0034 0.0079
Terminus 2 - Oracle 0.0000 0.0000 0.0000 0.0000 0.0000