FormulaCode: Evaluating Agentic Superoptimization on Large Codebases

1UT Austin, 2Cornell 3Caltech *Equal Contribution
Test cases streamline performance evaluation but constrain coding agents (e.g., AlphaEvolve) to a pass/fail reward – a signal too sparse for fostering iterative optimizations. FormulaCode introduces a live repository-level benchmark that complements existing work (In gray (SWE-Bench)) by challenging agents to optimize 451 real-world performance bottlenecks against human solutions drawn from community-maintained benchmarks (in light blue). These benchmarks provide evaluation functions that capture fine-grained performance insights, are less susceptible to data leakage, and expose a larger optimization surface to coding agents.

Abstract

Rapid advances in LLM agents have shown the ability to optimize code using continuous objective functions — a significant leap beyond traditional code generation techniques. However, there is an urgent need for novel benchmarks that can effectively measure this capability and translate it into real-world impact. Current code benchmarks, which often rely on binary pass/fail outcomes, offer a limited evaluation framework that falls short of capturing the full potential of these emerging capabilities.

To bridge this gap, we introduce FormulaCode, a novel benchmark designed for evaluating agentic superoptimization on large codebases, with a focus on real-world performance optimization. Constructed from a dataset of 451 real-world performance bottlenecks automatically mined from Github, FormulaCode enables comprehensive testing of an agent's ability to triage, diagnose, and resolve inefficiencies in realistic software environments.

FormulaCode proves to be a challenging benchmark for frontier LLMs and agentic frameworks, with unrestricted repository exploration emerging as a principal component for finding performance inefficiencies. By introducing FormulaCode, our goal is to drive the development of next-generation optimization algorithms that meet the rigorous demands of real-world software projects.

⚠️ Work in progress. Check back in a few days for updates!

Related Links

This project would not be possible without the excellent work of the community. These are some relevant papers to better understand the premise of our work:

BibTeX

If you found this post interesting, please read our paper for mathematical details and experimental results. You can cite our paper as follows:

@misc{sehgal2025selfevolvingvisualconceptlibrary,
	title={Evaluating Agentic Superoptimization on Large Codebases}, 
	author={Atharva Sehgal and Patrick Yuan and Ziniu Hu and Yisong Yue and Jennifer J. Sun and Swarat Chaudhuri},
	year={2025},
	eprint={????.?????},
	archivePrefix={arXiv},
	primaryClass={cs.CV},
	url={https://arxiv.org/abs/????.?????}, 
}