Rapid advances in LLM agents have shown the ability to optimize code using continuous objective functions — a significant leap beyond traditional code generation techniques. However, there is an urgent need for novel benchmarks that can effectively measure this capability and translate it into real-world impact. Current code benchmarks, which often rely on binary pass/fail outcomes, offer a limited evaluation framework that falls short of capturing the full potential of these emerging capabilities.
To bridge this gap, we introduce FormulaCode, a novel benchmark designed for evaluating agentic superoptimization on large codebases, with a focus on real-world performance optimization. Constructed from a dataset of 451 real-world performance bottlenecks automatically mined from Github, FormulaCode enables comprehensive testing of an agent's ability to triage, diagnose, and resolve inefficiencies in realistic software environments.
FormulaCode proves to be a challenging benchmark for frontier LLMs and agentic frameworks, with unrestricted repository exploration emerging as a principal component for finding performance inefficiencies. By introducing FormulaCode, our goal is to drive the development of next-generation optimization algorithms that meet the rigorous demands of real-world software projects.
This project would not be possible without the excellent work of the community. These are some relevant papers to better understand the premise of our work:
If you found this post interesting, please read our paper for mathematical details and experimental results. You can cite our paper as follows:
@misc{sehgal2025selfevolvingvisualconceptlibrary,
title={Evaluating Agentic Superoptimization on Large Codebases},
author={Atharva Sehgal and Patrick Yuan and Ziniu Hu and Yisong Yue and Jennifer J. Sun and Swarat Chaudhuri},
year={2025},
eprint={????.?????},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/????.?????},
}