FormulaCode: Evaluating Agentic Superoptimization on Large Codebases

Abstract

Rapid advances in LLM agents have shown the ability to optimize code using continuous objective functions — a significant leap beyond traditional code generation techniques. However, there is an urgent need for novel benchmarks that can effectively measure this capability and translate it into real-world impact. Current code benchmarks, which often rely on binary pass/fail outcomes, offer a limited evaluation framework that falls short of capturing the full potential of these emerging capabilities.

To bridge this gap, we introduce FormulaCode, a novel benchmark designed for evaluating agentic superoptimization on large codebases, with a focus on real-world performance optimization. Constructed from a dataset of 451 real-world performance bottlenecks automatically mined from Github, FormulaCode enables comprehensive testing of an agent's ability to triage, diagnose, and resolve inefficiencies in realistic software environments.

FormulaCode proves to be a challenging benchmark for frontier LLMs and agentic frameworks, with unrestricted repository exploration emerging as a principal component for finding performance inefficiencies. By introducing FormulaCode, our goal is to drive the development of next-generation optimization algorithms that meet the rigorous demands of real-world software projects.

⚠️ Work in progress. Check back in a few days for updates!

BibTeX

If you found this post interesting, please read our paper for mathematical details and experimental results. You can cite our paper as follows:

@misc{sehgal2025selfevolvingvisualconceptlibrary,
	title={Evaluating Agentic Superoptimization on Large Codebases}, 
	author={Atharva Sehgal and Patrick Yuan and Ziniu Hu and Yisong Yue and Jennifer J. Sun and Swarat Chaudhuri},
	year={2025},
	eprint={????.?????},
	archivePrefix={arXiv},
	primaryClass={cs.CV},
	url={https://arxiv.org/abs/????.?????}, 
}

FormulaCode: Evaluating Agentic Superoptimization on Large Codebases

Abstract

⚠️ Work in progress. Check back in a few days for updates!

Related Links

BibTeX