Home

fc-data is a python package for automatically curating and managing FormulaCode tasks. After installation, fc-data is designed to run as a monthly CRON job that updates the FormulaCode dataset with new commits and repositories.
FormulaCode is a continually updating benchmark for evaluating the holistic ability of LLM agents to optimize codebases. FormulaCode consists of two parts: a pipeline to construct performance optimization tasks, and an execution harness that connects a language model to our terminal sandbox.
How it works¶
graph LR
A --->|scrape| B
A2 <-->|sync| B
B -->|publish| C
B -->|publish| D
A[GitHub]
A2[Supabase]
B["fc-data<br/>(This repository)"]
C[DockerHub]
D[HuggingFace]
Get started¶
Most interaction with fc-data is through a single command:
This runs all 7 pipeline stages: repo discovery, PR scraping, LLM classification, dependency resolution, problem rendering, Docker synthesis, and publishing. See the Pipeline guide for the full CLI reference.
Key features¶
- Single-command pipeline —
fc-dataruns all stages with--resume,--stage, and--dry-runsupport - GitHub scraping — Async
httpxclient with automatic token rotation across multipleGH_TOKENS - LLM classification — DSPy-based agents classify PRs by performance category and difficulty
- Docker synthesis — Automatically generate Docker build contexts using coding agents (Claude, Codex, Gemini)
- Scalable runners — Async runners with concurrency control, Supabase progress tracking, and per-item error isolation
- Dataset publishing — Versioned Parquet datasets on HuggingFace, Docker images on DockerHub
Quick links¶
- Installation — Set up your development environment
- Pipeline guide (
fc-data) — The primary entrypoint — full CLI reference and stage descriptions - Quickstart — Python API examples
- Configuration —
tokens.envand environment variables